Evaluation of the satellite-based Global Flood Detection System for measuring river discharge : influence of local factors

Introduction Conclusions References


Introduction
Flooding is the most prevalent natural hazard at the global scale, often with dire humanitarian and economic effects.According to the International Disaster Database (EM-DAT), an average of 175 flood events per year occurred globally between 2002 and 2011, affecting an average of 116.5 million people, and causing economic losses of USD 25.5 billion.According to MunichRe (2014), the costliest natural catastrophe worldwide in terms of overall economic losses in 2013 was the flooding in southern and eastern Germany and neighbouring states in May and June, with estimated damages of USD 15.2 billion.In June of the same year, flooding in India claimed 5000 lives, with a further 2 million affected (Mu-nichRe, 2014; EM-DAT).
The Global Assessment Report (UNISDR, 2011) states that the proportion of world population living in flood-prone river basins increased by 114 % over four decades from 1970 to 2010.Additionally, while economic losses due to river floods have increased over the last 50 years, the number of casualties has decreased.The reduction in loss of life has been associated with the integration of early warning systems with emergency preparedness and planning at local and Published by Copernicus Publications on behalf of the European Geosciences Union.B. Revilla-Romero et al.: Evaluation of the Global Flood Detection System for measuring river discharge national levels (Golnaraghi et al., 2009;Kundzewicz et al., 2012).
Global early warning systems are needed to improve international disaster management.These systems can be used for both early forecasting (for better preparedness) and early detection, as well as for an effective response and crisis management.Their necessity was emphasised in 2005, and since then it has been a key element of international initiatives such as the "Hyogo Framework for Action 2005-2015" and, on a continental level, the European Commission Flood Action Programme.After the 2002 flooding of the Elbe and Danube rivers, the European Commission supported the development of the European Flood Awareness System (EFAS) (Bartholmes et al., 2009;Thielen et al., 2009) by the Joint Research Centre to increase preparedness for riverine floods across Europe.Currently, a number of organisations are involved in rapid mapping activities after major (flood) disasters, such as UNOSAT (2013), GDACS (2013), "Space and Major Disasters" (Disaster Charter, 2014), the Committee on Earth Observation Satellites (CEOS) Flood Pilot and the online Dartmouth Flood Observatory (http://floodobservatory.colorado.edu/).In Europe, Copernicus is the Earth observation programme which actively supports the use of satellite technology in disaster management and early warning systems for improved emergency management.
Flood warning systems typically rely on forecasts from national meteorological services and in situ observations from hydrological gauging stations.However, this capacity is not equally developed across the globe, and is highly limited in flood-prone, developing countries.Ground-based hydrometeorological observations are often either scarce or, in cases of transboundary rivers, data sharing among the riparian nations can be limited or absent.Therefore, satellite monitoring systems and global flood forecasting systems are a needed alternative source of information for national flood authorities not in the position to build up an adequate measuring network and early warning system.In recent years, there has been a notable development in the monitoring of floods using satellite remote sensing and meteorological and hydrological modelling (Schumann et al., 2009).
A variety of satellite-based monitoring systems measure characteristics of the Earth's surface, including terrestrial surface water, over large areas on a regular basis (van Westen, 2013).Such remote sensing is based on surface electromagnetic reflectance or radiance in the optical, infrared and microwave bands.Some key advantages of microwave sensors is that they provide near-daily basis global coverage and, at selected frequencies, relatively little interference from cloud cover.Two presently operating microwave remote sensors with near-global coverage are the Tropical Rainfall Measuring Mission1 (TRMM), operational from 1998 to present, and the Advanced Microwave Scan-ning Radiometer for Earth Observation System2 (AMSR-E) which was active from June 2002 to October 2011, succeeded by AMSR2, which was launched in May 2012 and is onboard the Japanese satellite GCOM-W13 , and from which brightness temperature data are being distributed from January 2013 onwards.For future work, the European Space Agency (ESA) and NASA have other missions to put similar instruments in orbit, capturing passive microwave energy at 36.5 GHz, such as ESA's Sentinel-3 satellites (planned launch in 2015 and 2016) and NASA's Global Precipitation Mission (GPM) (launched in February 2014) to replace TRMM.
Using AMSR-E data initially, De Groeve et al. ( 2006) implemented a method for detecting major floods on a global scale, based on the surface water extent measured using passive microwave sensing.Also, Brakenridge et al. (2005Brakenridge et al. ( , 2007) ) demonstrated that orbital remote sensing can be used to monitor river discharge changes.However, as underlined by Brakenridge et al. (2012Brakenridge et al. ( , 2013)), extracting the microwave signal and converting it into discharge measurements is not straightforward and depends on factors such as sensor calibration characteristics and perturbation of the signal by land surface changes.These changes can be found, for example, in irrigated agricultural zones and in areas where rivers flow along forested floodplains (Brakenridge et al., 2013).As rivers discharge increases, river level (stage), river width, and river flow velocity all increase as well, and the challenge is to measure one or more of these accurately enough to provide a reliable discharge estimator, and compare against a background of other surface changes that may affect what is measured from orbit.
There also remains the need to convert such discharge estimators to actual discharge units.Using ground discharge data or climate-driven runoff models for calibration and validation, methods to convert the remote sensing signal to river discharge have been previously tested at particular stations with output from the Global Flood Detection System (GFDS, http://www.gdacs.org/flooddetection/)and by different investigators (Brakenridge et al., 2007(Brakenridge et al., , 2012;;Khan et al., 2012;Kugler and De Groeve, 2007;Moffitt et al., 2011;Hirpa et al., 2013;Zhang et al., 2013).Yet the results are from different approaches and not easily comparable, making an assessment of the potential performance on a global scale difficult.Furthermore, definite conclusions about the influence of various environmental factors on the signal performance have not been reached.Therefore, in this study, a rigorous broad assessment of the method is undertaken with a systematic evaluation of the relationship between skills obtained between ground-and satellite-based discharges and the local characteristics of the stations.Specifically, this study addresses mean observed discharges, river widths, land cover types, leaf area indices, climatic regions and flood hazard maps, as well as the presence or absence of large floodplains, wetlands, river ice and hydraulic control infrastructure.
Our goal is to assess the potentials and limitations of the satellite-based surface water extent signal data for river discharge measurements with a large number of stations.Moreover, the relationship between ground and satellite sets of discharge measurements and the local surface characteristics is examined in order to provide guidelines for selection of observation sites.For this purpose, river catchments located in a range of different climatic and land cover types were selected in Africa, Asia, Europe, North America and South America.The remainder of the paper is structured as follows: Sect. 2 presents the study regions and data, Sect. 3 describes the analysis methodologies, and the results are discussed in Sect. 4.

Study regions and in situ discharge data
Figure 1 shows the study basins and in situ discharge locations.The selected stations are all located near major rivers of the world (Global Runoff Data Centre, 2007).The continental distribution and the upstream catchment area of the stations are summarised in Table 1.We selected the locations to be representative of a broad variety of local conditions: they belong to nine different main land cover classes (aggregated from GlobCover, 2009) and five main types of climate (Peel et al., 2007).The characteristics are listed in Table 2.
For Africa, Asia, Europe, North America and South America, daily in situ discharge values were used from the Global Runoff Data Centre (GRDC) database.In addition, for the South African stations, the discharge data were provided by the South African Water Affairs (DWA, http://www.dwa.gov.za/).The selected stations for all these continents include daily data between 1998 and 2010; however not all stations have continuous data during this time period.From 1998, the length of the time series was required to be above 6 years.The longest time series available was of 13 years, with a median value of 8.5 years.In situ discharge information may itself be affected by large and variable uncertainty, mostly on the measurement of the cross-sectional area of the channel and mean flow velocity at the gauge or control site (Pelletier, 1988).Although generally unknown, these values are typically between 5 and 20 % at the 95 % confidence level as highlighted in studies such as Hirsch andCosta (2004), Di Baldassarre andMontanari (2009), Le Coz et al. (2014) and Tomkins (2014).However, the uncertainty in river discharge is even higher during floods events when the stage-discharge relationship, the so-called rating curve, is used.As evaluated by Pappenberger et al. (2006), the analysis of rating curve uncertainties leads to an uncertainty of the input of 18-25 % at peak discharge.Di Baldassarre and Mon- tanari (2009) showed that the total rating curve errors increase when the river discharge increases and varies from 1.8 to 38.4 % with a mean value of 21.2 %.For the purposes here, these data are, however, regarded as "ground truth".We acknowledge the possible errors, however, and note that, for some river reaches, satellite-based methods may actually track discharge changes more accurately than ground-based measurements using stage; however, the extent to which this is true needs to be fully investigated.

Satellite-derived data
The Global Flood Detection System (GFDS) produces nearreal-time maps and alerts for major floods using satellitebased passive microwave observations of surface water extent and floodplains.It was developed and is maintained at the European Commission Joint Research Centre (JRC) in collaboration with the Dartmouth Flood Observatory (DFO).The surface water extent detection methodology using satellite-based microwave data is explained in Brakenridge et al. (2007) and Kugler and De Groeve (2007).Here, only the basic principles are recalled.
At each pixel, the method uses the difference in brightness temperature, at a frequency of 36.5 GHz, between water and land surface to detect the proportion of within-pixel water and land.The retrieved brightness temperature data are first gridded into a product with a pixel size of (near the Equator) 10 km × 10 km (0.09 • × 0.09 • ), and the system provides a daily output.For our work, the merged TRMM/AMRS-E product was used (http://www.gdacs.org/flooddetection/download.aspx);the gridded data are provided in the GCS WGS 1984 projection.For our period of study, 1998-2010, the merged data product was employed for the time period of its availability (June 2002(June -2010)), whereas standalone TRMM data was used for the remaining time period (1998 to June 2002) and available latitudes.Note that from 2013 the system has been providing the merged product TRMM/AMSR2; however, this period is out of our scope.a Vegetation means a combination of grassland, shrubland and forest.b Types of land cover and climate where the number of locations in each type was very low (e.g. 3) were excluded for their respective variables analysis as they will not be representative on a global scale.
In the GFDS system, the microwave signal (s) is defined as the ratio between the measurement over wet pixel (M) and the measurement over a 7 pixel × 7 pixel array of background calibration (C) pixel, known as the M / C ratio (Brakenridge et al., 2012;De Groeve, 2010).Better discharge signal values may be achieved when the measurement pixel is centred over a river reach and no hydraulic structures are present (Moffitt et al., 2011).However, this is sometimes difficult to achieve due to the desired co-location with gauging stations (Brakenridge et al., 2012) or because the potential measurement pixels within the raster are fixed geographically.

Other important data sets and maps
The quality of the microwave signal detected by the satellite sensors can be influenced by local ground conditions, including extreme rainfall, snow/ice, land cover/use and topography (Brakenridge et al., 2012).For example, forest is a type of land cover which influences the microwave emission properties due to the biometric features of vegetation such as crown water content and the shape and size of leaves (Chukhlantsev, 2006).In this study, the effects of the local ground conditions on the performance of the satellite signal were analysed as a function of the following factors: 1. River width: channel width from Yamazaki et al. (2014), estimation based on SRTM Water Body Database and the HydroSHEDS flow direction map and for which the map was upscaled from 0.025 to 0.1 • , taking the mean of the river grid values in the 4 × 4 area.
2. Mean observed discharge: for each station, a mean discharge value for the study period was calculated from daily ground data (mainly from the GRDC data set).
3. Upstream catchment area (GRDC 2007) data: the GRDC river network was used to visually select those stations located close to the "main rivers" classified by GRDC, and to use the values of the upstream catchment area for each station.Note that upstream catchment area values are missing from all South African stations from DWA data provider.

Presence of floodplains, flooded forest and wetlands:
this was obtained from the Global Lakes and Wetlands Database level 3, a global raster map at 30 s resolution which comprises lakes, reservoirs, rivers and different wetland types (Lehner and Döll, 2004).
5. Flood extent: we used the fractional coverage of potential flooding of 25 km by 25 km cells for a 100-year return period from the Global Flood Hazard Map derived using a model grid (HTESSEL + CaMa-Flood) (Pappenberger et al., 2012).
6. Land cover: we used land cover data from the Global Land Cover 2009 (Bontemps et al., 2010).The 19 labels were aggregated into 8 types of land cover depending on the vegetation type and density to synthesise the outputs (see Appendix Table A1).Further visual category checking was performed using Google Maps display for the sites, and, where necessary, land cover classes changed accordingly.An additional category was added for sparse vegetation areas where crops are grown along or near the river channels.
7. Leaf area index: a global reprocessed leaf area index (LAI) from SPOT-VGT is available for a period of 1999-2007 (http://wdc.dlr.de/data_products/SURFACE/LAI/).This LAI product is a global data set of 36 ten-day composites at a spatial resolution of the CYCLOPES products (1 km).For our analysis, a modified version of this product was used, which was upscaled to a spatial resolution of 10 km.
8. Climatic areas: we used the Köppen-Geiger climate map of the world (Peel et al., 2007) to distinguish the main climate areas: tropical, arid, temperate, cold and polar (see Table 2).9. Presence of river ice: through the signal, the presence of river ice cover can also be detected in cold land regions.The Circum-Arctic map of permafrost and ground-ice conditions (Brown et al., 2002) map was used here.Examples of these rivers are the Yukon and Mackenzie rivers in North America and the Lena River in Russia.
As is the case on the ground, discharge under ice cover is left largely unmeasured as both water area and stage no longer are responsive to discharge variation.
10. Dam location: hydraulic structures can disrupt the natural flow of water, and therefore may alter the expected performance of the satellite signal on that location.For this analysis the Global Reservoir and Dam (GRanD) (Lehner et al., 2008) data set was used.

Satellite signal extraction
In total, 398 locations for satellite-based measurement were selected which overlap spatially and temporally with available in situ stations providing daily measurements.Since satellites never pass directly over the same track at exactly the same time, the operational GFDS applies a 4-day forward-running mean to systematically calculate the signal; this also commonly fills between any missing days (Kugler and De Groeve, 2007).Furthermore, for each observation site, the signal on the GFDS system is calculated as the average signal of all measurement pixels under observation for each location (which can be one or more pixels) (GDACS, 2013).Thus, in some cases, even a 10 km pixel is not large enough as a measurement site, and would entirely saturate with water during flooding.An array of measurement pixels is instead used.In this analysis, we used the signal values from the single pixels which contain the ground station, as well as a multiple pixels selection.This includes, for each location, the pixel itself and also the three nearest neighbours of the 10 km × 10 km grid.In the case of multiple pixels, the signal value was calculated for the spatial median, average and maxima.Similar results were obtained globally when comparing the extracted signals (single or multiple pixels) with the in situ discharge observations.Therefore, we used the temporal and spatial averaging on the multiple pixel array as in the operational GFDS.For each site, a visual check with Google Maps was carried out to assure that the largest river section was included within the finalised measurement sites (see Fig. 2).

Satellite signal calibration and validation
For those co-located ground stations and satellite measurement sites where both sets of data (signal and in situ discharge) were above 6 years in length, calibration and validation was performed using the ground information as reference.Several stations, mainly in North America, located close to man-made infrastructures such as weirs and generating stations were excluded from this analysis due to the rapidly changing behaviour of the in situ-observed discharge.Also, in a satellite-based approach to measure river discharge, the local river characteristics and floodplain channel geometry control the accuracy of rating curves, as is the case for gauging stations on the ground (Brakenridge et al., 2012;Khan et al., 2012;Moffitt et al., 2011).Thus we expect some measurement sites to exhibit a more robust response to discharge changes, and a higher signal-to-noise ratio, than others.
It has been acknowledged that, for large rivers, using the daily GFDS signal as a floodplain flow surface area indicator of discharge might result in a few days of lag when comparing with ground-based discharge (Brakenridge et al., 2013).Thus, stage may immediately rise at a gauging station as a flood wave approaches, but flow expansion out into the floodplain requires some increment of time.This time lag may introduce error into the scatterplots used to calculate the rating equations and therefore lower skill scores obtained when analysing both data sets.In addition, in previous studies (Khan et al., 2012;Zhang et al., 2013), it was observed that, in some cases, an overestimation of satellite-measured discharge existed during low-flow periods when using a single rating equation for the full period to calibrate signal into discharge units.For this reason, we decided to use a rating equation for each month individually.In this case the time series data for a fixed month can be treated as stationary and the derived daily discharge values also adjusted better during low-flow periods.
To calibrate satellite signal into discharge measurements, the first 5 years of data were used for both satellite signal and ground discharge for each location.Regression equations were obtained using monthly means from daily values and GFDS-measured discharge was derived from this.
For the sake of simplicity, for this paper, the equations were restricted to linear equations.However, as the relation is purely empirical, we leave further research into a flexible way to fit these relations as follow-on work.Note that fitting straight lines to curves will reduce goodness of fit and predictive accuracy.Power law fitting was also tested to calibrate the signal into discharge units, yielding similar results (see open discussion author's response no.2).
The validation of the satellite-derived daily discharge data was carried out with daily in situ data on a 2-year period, and skills scores were calculated to quantify the agreement between both satellite-and ground-measured discharge.We are aware of the limited number of years (data) with available time series for both variables, which might influence the robustness of the calibration.In some cases there were longer time series available, but, in order to standardise the analysis for all the stations, we used 5 years (1998-2002 or 2003-2008 for northern stations with AMSR-E signal) and the following 2 years for validation purposes (2003-2004 and 2009-2010, respectively).Note that, for 36 out of the 322 stations available, data length was between 6 years and 3 months to almost 7 years.Validation was still carried out for the same period, but the data used for calibration were slightly reduced.As an example, Fig. 3a presents the scatterplot for the month of March for the Senanga station (long 23.25 degree, lat −16.116 degree) in the Zambezi River (Africa) with mean values derived from the period 1998 to 2002.For the same location, Fig. 3b shows the in situobserved and the GFDS-measured discharge derived from the GFDS signal for the period 2003-2004.

Skill scores
The initial analysis of the correlation of the remote sensing signal to in situ discharge was assessed for each station and site pair through the Pearson correlation coefficient (R).For the validation, the performance of the satellite-measured discharge was also assessed using the Nash-Sutcliffe efficiency (NSE) statistic in addition to the R skill score.Spearman's rank correlation coefficient (ρ) was also calculated to assess the validation performance.
One of the advantages of the R coefficient is that it is independent of the units of measurement, which permits the comparison of dimensionless GFDS signal data.A small value indicates a weak or non-linear relationship between the satellite signal and discharge.For this study, we grouped the computed R values into three ranges as follows: < 0.3, [0.3-0.7], and > 0.7.While Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship.Spearman's validation scores just obtained a mean value 6 % higher than Pearson mean score (see open discussion author's response no.2).In this manuscript, results are analysed based on the scores obtained using Pearson correlation coefficient.
Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) is typically used to assess the predictive power of hydrological models and was calculated here to describe the accuracy of satellite-derived discharge in comparison to gauge-observed discharge values.Higher values of the Nash-Sutcliffe statistic should indicate more correlated results, without other factors taken into account, such as autocorrelation (Brakenridge et al., 2012).However, the degree of correlation of these variables does not verify the discharge magnitudes (Brakenridge et al., 2013).An NSE value of 1 corresponds to a perfect match of modelled to observed data, whereas NSE = 0 indicates that the model predictions are as accurate as the mean of the observed data.The resulting scores will be classified as in Zaraj et al. (2013): < 0, [0.2-0.5],[0.5-0.75], and > 0.75.

Factors affecting the satellite signal
Understanding the influence of local factors on the accuracy of the satellite flood detection is critical for practical use of the remotely sensed signal.We analysed the accuracy effects of river width, mean daily discharge, upstream catchment area, presence of large floodplain, flooded forest and wetlands, potential flood extent, land cover type, LAI, climatic areas, presence of river ice and hydraulic structures.To assess their influence, the fractional coverage over the measurement site was retrieved for variables with spatial coverage.
First, we use the skill scores (R and NSE) obtained from a simple analysis for each individual factor or variable.Second, we seek to understand which of the surface variables have the greatest importance in determining sites with a good or poor performance.For this purpose, we use a decision tree technique called random forest (RF).Among other features, this allows for ranking of the relative importance of each variable.The technique is described by Breiman (2001) and implemented in R by Liaw and Wiener (2002), where the reader is referred for a more detailed explanation.As a summary of the RF algorithm, ntree bootstrap samples are randomly selected from the data set; a different subset is used for each bootstrap; and for each sample a tree is grown, obtaining ntree trees.RF is called an ensemble method because it applies the method for a number of decision trees, in this case 500, in order to improve the classification rate.Some stations are left out of the sample (out of bag -oob) and used to gain an internal unbiased estimate of the generalisation error (oob errors) and to obtain estimates of the importance of the variables (Breiman, 2001).These values are averaged over the ntree trees.For the variables classification, the node impurity is measured by the Gini index.Gini's mean difference was first introduced by Corrado Gini in 1912 as an alternative measure of variability.One of the parameters derived from it, the Gini index, is also referred to as the concentration ratio (Yitzhaki and Schechtman, 2013).The Gini index is mostly popular in economics; however it is also used in other areas, such as building decision trees in statistics to measure the purity of possible child nodes, and it has been compared with other equality measures (Gonzalez et al., 2010).The variables with larger decreases in Gini values (lower Gini) are those with higher importance in the classification analysis.
Although the information is hidden inside the model structure for "black-box models" such as RF, the prediction power is high (Palczewska et al., 2013).This method is relatively robust given outliers and noise because it uses randomly chosen subsets of variables at each split of each tree (Breiman, 2001;  Chan and Paelinckx, 2008).To further increase robustness, Strobl et al. (2009) state that results from the RF and conditional variable importance should always be tested by doing multiple RF runs using different seeds and sufficiently large ntree values to obtain robust and stable results.
The quality index chosen to rank variable importance and classify good or poor locations, in the RF analysis, was the NSE score.A threshold of NSE = 0 splits the data into two groups, obtaining about 50 % of the data above (true or good predictive) and below (false or poor predictive) that value of NSE.The results presented here are the average of 200 runs.Furthermore, four different training sets were used by a random 70 %/75 %/80 %/90 % of the stations and were validated with the remaining 30 %/25 %/20 %/10 % of stations, respectively.

Results and discussion
As a first step we analysed the relationship between the satellite signal and the in situ-observed discharge to have an initial understanding of the performance between the two data sets (Sect.4.1).Then we calibrated the satellite signal with in situ discharge data.With the regression equations obtained, we calculated satellite discharge measurements.A 2-year validation period was carried out for each station using the skill scores as described in Sect.3.3 (Sect.4.2).This was followed by an assessment of how different variables contribute in a positive or negative way to the overall skill (Sect.4.3).Variables included in the analysis are daily mean river discharge, river width, upstream catchment area, potential flood hazard area, land cover, LAI, climatic zones, presence of large floodplains, flooded forest and wetlands, river ice and hydrologic structure.Finally, the relative importance of all variables in comparison to each other has been assessed (Sect.4.4).
Before analysing the validation results, it is important to highlight two possible different sources of error which might influence the outputs.Firstly, the signal-to-noise ratio might be low for a site or have intermittent instrument noise occasionally producing positive spikes in discharge.Secondly, the rating curve may be offset, which will result in a consistent bias on the discharge values for that location even though the time series are strongly correlated.

Correlation of raw satellite data vs. gauge observations
The first step was to look at the "raw" correlation between daily ground-station-measured water discharge and the satellite signal and to calculate the empirical linear relation between these two variables for each site.The full time series, including low flows, were used for the calculation and executed for 398 stations.Figure 4 shows the R skills obtained.Of a total of 398 sites, 169 have an R > 0.3 and 42 of them have R > 0.5.Correlations might have perhaps been higher if regression had not been restricted to linear equations (Brakenridge et al., 2007(Brakenridge et al., , 2012)).

Satellite signal calibration, validation and evaluation through skill scores
For the stations with over 6 years of contemporary data for both in situ discharge and satellite signal, we obtained regression equations for each month of the year and station using the first 5 years of data.Next, using these equations, we carry out a calibration of the daily signal into discharge units.Afterwards, the validation of the GFDS-measured discharge was implemented for the following 2 years.In some regions, such as northern Asia, the lack of available recent long time series (after 2002) meant that the number of stations available for calibrating the satellite into discharge measurements was reduced.Stations where the number of years matching observed discharge and satellite signal was shorter than 6 years were excluded from the validation exercise despite performing well.Finally, out of 398, a total of 332 stations remained for calibration and validation.
For NSE score, Fig. 5 shows that 154 out of 332 stations are larger than 0: 13 located in Africa, 77 in North America, 62 in South America, 1 in Asia and 1 in Europe.Nevertheless, it needs to be noted that, in arid regions, results calculated with the skill scores such as NSE are penalised by low average discharge compared to high-flow conditions.If, instead of using all the available time series, a "dry stream" threshold had been applied, the scores obtained for these sites could have been higher when analysing the remaining data set period where flow is present.

River width and presence of floodplain and wetlands
As a first step to analyse the potential relationship between the individual local characteristics and the performance of the locations in global terms, we study the R score of the validation for the 322 stations in relation with the maximum river width value at each location (Fig. 6a).Results indicate that locations with a river width higher than 1 km are more likely to score an R larger than 0.3.In fact, the mean R score is 0.60 and 26 out of 64 (∼ 41 %) have R > 0.75.However, there are a number of stations with lower river width that also obtained high scores.As the retrieval of the satellite signal also depends on the floodplain geometry.As soon as the river floods and water goes over-bank, the proportion of water in the wet pixel greatly increases.Thus the score should also be high for small rivers with a proportionally large floodplain.
Figure 6b shows the R scores by location, where the majority of the area belongs to floodplain, flooded forest and wetlands category, or their absence.In our study, higher median scores were obtained for those located in large freshwater marsh and floodplains, followed by those on swamps and flooded forest.
These results give a first indication on the characteristics of the locations with better performance.

River discharge and potential flooding
Flooding is determined by the discharge as well as the potential flood hazard.Figure 7a shows that 84 out of 95 stations with R < 0.3 also have mean discharge values lower than 500 m 3 s −1 (log10 (500) ≈ 2.7), of which 55 stations had a mean discharge lower than 200 m 3 s −1 .These stations are mainly located in South Africa, and in some areas of North America.Therefore, it can be concluded that the mean discharge can be considered a key variable that determines the appropriateness of locations for which satellite discharges can be derived: as 77 % of the stations with Q < 500 m 3 s −1 have R < 0.3, while 91.5 % of the stations with Q > 500 m 3 s −1 have R > 0.3, locations with discharge of less than 500 m 3 s −1 might not provide reliable results for a global satellite-based monitoring system.Alternatively, non-permanent rivers and streams exhibiting only seasonal or ephemeral flow (typical for dry regions) may require a different monitoring approach, wherein a "dry" threshold is established for the signal data.
After excluding the global stations with low skill score due to low flows and studying the remaining stations, we can better understand the performance of the system in relation to other local characteristics.Figure 7b shows for each location the relationship between the validation R and the percentage of area in each pixel covered by potential flooding during a 100-year return period flood event, obtained with the model grid (HTESSEL + CaMa-Flood) (downscaled from a 25 km × 25 km pixel; Pappenberger et al., 2012).A value of 100 means completely flooded across its area, 50 means 50 % of the area within the cells is flooded, and 0 means that the area is not flooded.Although there is no clear trend for all the points, results indicate that locations with a percentage of potential larger than 40 % are expected to score an R larger than 0.3.

Land cover types and climatic areas
Figure 8 presents a global evaluation of the R score obtained during the validation and its classification by the land cover type of the stations.The bare land cover category was excluded from this study as only one of the selected locations belongs to that class.Looking at the median of the box plot (see Fig. 8), we found that some of the locations with higher density of vegetation such as those located on "closed forest" and "mosaic with predominant vegetation" (including forest, scrublands and grasslands) obtained lower median scores values.In contrast, the locations with lower vegetation density such as "sparse vegetation", "mosaics with predominant cropland/grasslands", "open forest" and "closed to open forest" land cover types obtained larger median R scores, around 0.6-0.8.Similar results can be observed when looking at the interquartile range or spread of the box plots: "closed to open forest" and "mosaics with predominant cropland/grasslands" obtained better results.At the same time, "closed forest" and "mosaic with predominant vegetation" had lowers scores.In addition, those sites with a combination of sparse vegetation and crops growing near the river channel had a lower median value when comparing with those on sparse or mosaic crop land cover.Note that the sites denoted "sparse with crops" are located in arid climatic areas, whereas most of the "sparse" sites are in cold or polar regions and are therefore run by different processes.In addition, sites with a majority of artificial/urban land cover (not shown) obtained a low median value of 0.267.
The relationship between locations by main Köppen-Geiger climatic areas (Peel et al., 2007) and R score obtained is shown in Fig. 9. Globally the tropical regions (Africa and South America) obtained the highest median scores (R ≈ 0.8), followed by cold regions (R ≈ 0.6).Lower median score values (R ≈ 0.3) were obtained for arid and temperate regions.It is important to clarify that these results are not only due to direct climate characteristics but also, for example, due to the characteristics of the rivers in those areas.In the case of the arid regions, it is mainly related to reduced daily average discharges, a characteristic of many of these stations.Note that polar climate was excluded from this evaluation as only three locations belong to that class.

Leaf area index (LAI)
LAI values typically range from 0 for bare ground to 6 or above for a dense forest; however CYCLOPES underestimates over dense vegetation (forest) (Zhu et al., 2013).Therefore, for this product LAI range is limited to [0-4], as   (Peel et al., 2007).Note that polar climate was excluded from this analysis as only three stations fell into this category.
seen in our analysis.Despite this, CYCLOPES is the most similar product to LAI references map (ibid.).According to the study carry out by Zhu et al. (2013)  We decided to study the relationship between the mean LAI and the skill obtained in the validation for each location, also looking at complementary variables such as the land cover and the geographical region which the stations belong to. Figure 10 shows that locations with a mean [LAI > 2] predominantly have a "closed to open forest" type in South America (31 stations), of which 29 have an R score higher than 0.6.For [LAI > 2], there are also 12 North American locations with "closed forest" land cover, but in general scores are poorer for those locations.Additionally, 18 stations with mosaic vegetation from North and South America obtained [LAI > 2], and 16 of those obtained [R > 0.6].For [LAI < 2], both the land cover and geographical locations are distributed along the scatterplots, from poor to high correlations.

River ice
Figure 11a shows the scores obtained for the locations with presence or absence of river ice, including a range from continuous to sporadic (Brown et al., 2002).It can be seen that stations located in areas with river ice tend to have a good correlation between in situ-and satellite-measured discharge (based on 33 stations), as the system tends to capture the annual spring ice break-up and freezing well, as indicated in the studies by Brakenridge et al. (2007) and Kugler (2012).At these locations, once ice-covered, the system has no sensing capability and the retrieved signal may seem analogous to low-flow conditions.However, there is an important difference when analysing time series of signal between ice-covered high-latitude river and all-year-round low-flow rivers.When an ice-melting process takes place, an   (Brown et al., 2002), and (b) presence or absence of a nearby dam or hydraulic control infrastructure using the Global Reservoir and Dam (GRanD) database (Lehner et al., 2008) and a visual check with Google Maps.For the validated locations, it is worth nothing that all stations with river ice (33) and most of them with dams (34 out of 48) are located in North America.
increase in river runoff occurs, and for many places this is translated into a strong change on the signal values.For the other types of rivers, low flows are generally a characteristic for most of the year, and if the signal-to-noise ratio is low, the signal retrieved is very noisy, which is one motivation for setting a "dry" threshold for such sites.

Hydraulic structures
The correlation between satellite and discharge data depends on both variables.Typically it is assumed that observed discharges are "ground truth"; however, when influenced by structures and dams, the ground discharge may not be well monitored with regard to flow area/flow width variation.For example, when there is a major increase in river discharge but a flood is avoided by artificial levees, we cannot expect the satellite signal to accurately capture the flood hydrograph; moreover, downstream flooding may be attenuated by an upstream flood control dam and reservoir, and thus the gauge location is critical.Figure 11b shows the influence of the presence or absence of a nearby dam using the Global Reservoir and Dam (GRanD) database (Lehner et al., 2008) or visually identified hydraulic control infrastructure.Locations where the dam or other element was present (48 stations) obtained lower median R scores.Therefore, ideally, observation sites should be located in areas without hydraulic control infrastructures.

Variable importance
Based on the individual analysis of the signal potential influence factors, we found that, in order to understand site performances, on some occasions multiple variables need to be analysed in a simultaneous way.For example, the generally low scores obtained at the eastern USA stations might be due to a number of factors: ∼ 64 % of these stations have a mean discharge value lower than 500 m 3 s −1 and ∼ 88 % of the stations are located at river width lower than 1 km.In addition, ∼ 59 % of the stations are located in wetland areas.Another example of the importance of analysing several factors can be seen with the locations (11 stations) which obtained low R but their mean observed discharge is higher than 500 m 3 s −1 .
All of them have a potential probability of flooding lower than 21 %, the land cover for 10 out of 11 is forest, 5 of them are located in wetlands, and 2 of them have a nearby hydraulic structure.Despite exhibiting a mean discharge greater than 500 m 3 s −1 , these other local characteristics may be the cause of the poor performance.Therefore, we decided to use a classification decision tree technique (RF), which split the data set at each node according to the value of one variable at a time (the best split) from a selected set of variables so as to understand the importance of each variable.RF is called an ensemble method because it is performed for a number of decision trees, in this case 500 trees, in order to improve the classification rate.
The result presented here is the rank of the importance of variables to classify a location with a good or poor performance.These values are obtained as an output of the RF analysis and are, in addition, the average of 200 independent runs.As explained in Sect.3.4 the variable importance based on the mean decrease in Gini index was calculated for the NSE score obtained from the validation.We used NSE = 0 to distinguish between the sites with a good (above 0) from poor performance (below 0), and we also tested it with a threshold NSE of 0.50.
Figure 12 presents the variable importance for the four test groups.Features which produced large values of the "mean decrease in Gini" are ranked as more important than features which produced small values.For our locations and data available, the mean daily observed discharge has the highest importance, followed by the climatic region, land cover/mean LAI and upstream catchment area.At the same time, the presence of hydraulic structures (mainly dams) and of river ice has the lowest importance to classify a location as good or poor performance.However, this does not mean that it has no influence.Although discharge is correlated with upstream catchment area and to some degree also LAI with land cover type, both were included in this case to understand which variable might help us most to classify the sites.
Although the effect of the correlations on these measures has been studied recently (see Archer and Kimes, 2008;Strobl et al., 2009;Nicodemus and Malley, 2009;Nicodemus et al., 2010;Nicodemus, 2011;Auret and Aldrich, 2011;Tolosi and Lengauer, 2011;Grömping, 2009 In order to test the effect on the results when correlated variables were included in our analysis, an independent RF analysis was carried out (not shown in this paper) for the same variables but excluding the river width and the presence of floodplains and wetlands variables.Results also showed that the mean daily observed discharge had the highest importance and that the presence of hydraulic structures (mainly dams) and river ice had the lowest importance for classifying a location as good or poor performance.

Conclusions and future research
In this article we presented an evaluation of the skill of the Global Flood Detection System to measure river discharge from remote sensing signal.From the 322 stations validated, the average continental R skills are as follows: Africa 0.382, Asia 0.358, Europe 0.508, North America 0.451 and South America 0.694.Approximately 48 % of these stations have an NSE score higher than zero: 13 located in Africa, 77 in North America, 62 in South America, 1 in Asia and 1 in Europe.Results showed that the low skills scores received by stations were, for the majority of cases, due to low-flow conditions.For example, 84 out of 95 stations with R < 0.3 have mean discharge values lower than 500 m 3 s −1 .These are located mainly in South Africa (25 cases) and North America (53 cases), which penalised their average continental skills.Note that our focus was on factors affecting the method globally, and that these skill values do not directly indicate measurement accuracy at a site (which could be improved, for example, by use of non-linear rating equations and/or accommodation of any phase shift or timing differences in flowarea-versus state-based discharge monitoring).
In order to better understand the impact of the local conditions on the performance of the sites, we first looked at specific factors individually.In general terms, higher skill scores were obtained for locations with one, or more than one, of the following characteristics: a river width higher than 1 km; a large floodplain area; in flooded forest; a potential flooded area per pixel greater than 40 % during a 100-year return period flood event; a land cover type of sparse vegetation, croplands or grasslands and closed to open and open forest; LAI above 2; location in a tropical climatic area; and a location where no dams or hydraulic infrastructures are present.Also, out of our locations, high-latitude rivers with seasonal ice cover tend to exhibit good performance.
Secondly, we performed a classification decision tree analysis, based on RF, to obtain the variable importance when classifying a site as good or poor.The output of this analysis showed that mean observed discharge, climatic region, land cover and mean LAI and upstream catchment area and were the variables with higher importance, whereas river ice and dam obtained the lowest importance.Both the individual and the combined classification analysis of these local characteristics give us critical evidence of the relationship between the ground and satellite discharge measurements and when it is expected to perform well.Furthermore, it provides a guideline for future selection of measuring sites.
The locations with a very good performance will be selected for a potential future project where satellite-measured discharge could be calculated for longer periods and on a daily basis from the remote sensing signal, analogous to the Dartmouth Flood Observatory method.This will represent a major step forward in developing continental and global hydrological monitoring systems as these data can fill the gaps where real-time ground discharge measurements are not available (the case at many locations globally).We found that some of the sites with good performance are located within international river basins such as the Niger, Volta and Zambezi in Africa.In addition, for the studied locations with good signal performance but rather short contemporary time series with in situ-observed discharge (such as the Siberian stations), the calibration of the signal to obtain discharge measurements could be executed at any point when additional ground data are available.This will also be beneficial for all stations, including those with time series longer than 7 years.Zhang et al. (2013) recently demonstrated the potential of integrating satellite signal provided by the Global Flood Detection System in improving flood forecasting.This first attempt at data assimilation was carried out for a single station (Rundu, northern Namibia -included in this study) with the conceptually simple Hydrological MODel (HyMOD).Hence, a prospective study with the inclusion of all these stations for post-processing through data assimilation and error correction of the stream-flow forecast in hydrological models could be done.For instance, for the pre-operational Global Flood Awareness System (GloFAS) (Alfieri et al., 2013) and the African Flood Forecasting System (AFFS) (Thiemig et al., 2014) in an analogous way as it is already being done with ground-gauge-observed streamflow on the European Flood Awareness System (Bartholmes et al., 2009;Thielen et al., 2009).Hence, work towards the integration of global flood detection and forecasting systems such as GFDS and GloFAS, respectively, can provide more comprehensive information for decision makers.

Figure 1 .
Figure 1.Location of selected stations (398) and corresponding river basins (109).TRMM and AMSR-E brightness temperature product extents are also provided.

Figure 2 .
Figure 2. Example of a measurement site: Caracarai station (Rio Branco catchment, Brazil).The blue rectangles outline the measurement pixels and the background image is from 2014 (Google; Landsat, DigitalGlobe).

Figure 3 .
Figure 3. (a) Scatterplot for the Senanga station (long 23.25 degree, lat −16.116 degree) in the Zambezi River (Africa).Monthly mean for March from 1998 up to 2002.(b) Validation hydrograph for 2003-2004 and skill scores for Senanga.The (monthly) rating equations were used to calibrate the signal into discharge units.Different rating equations were used for different months.

Figure 4 .
Figure 4. Location of stations and R skill score between in situobserved discharge and satellite signal (4 days and 4 pixels average).Globally, 169 sites have R > 0.3, of which 42 have R > 0.5.
Figure 6.(a) Relationship between R obtained from the validation of satellite-measured discharge and the maximum river width for each location.(b) Relationship between the same R score and the presence of significant floodplains, flooded forest and wetlands.The horizontal dotted line shows the R = 0.3 and R = 0.7 threshold, and the vertical line is the river width equal to 1 km.

Figure 7 .
Figure 7. (a) Relationship between R obtained from the validation of satellite-measured discharge and the mean in situ-observed discharge (log10 displayed) for each station.(b) Relationship between the same R score and the potential percentage of flooded area per pixel for a 100-year return period flood event (Pappenberger et al., 2012).The horizontal dotted line shows the R = 0.3 threshold, and the vertical line is the 40 % potential flooding threshold.

Figure 8 .
Figure 8. Global evaluation of the R score obtained during the validation and its classification by the land cover type of the stations.Land cover types were aggregated from the GlobCover (2009) and modified by means of a visual check with Google Maps.Note that artificial and bare land cover were excluded in this figure.

Figure 9 .
Figure 9. Global evaluation of the R score obtained during the validation and its classification −2 only main types by the Köppen-Geiger climate area(Peel et al., 2007).Note that polar climate was excluded from this analysis as only three stations fell into this category.

Figure 10 .
Figure 10.Evaluation of the R score obtained during the validation and its classification by LAI according to factors of (a) land cover and (b) geographical regions (contintent).

Figure 11 .
Figure 11.Evaluation of the R score obtained during the validation and its classification by (a) presence or absence of a river ice(Brown et al., 2002), and (b) presence or absence of a nearby dam or hydraulic control infrastructure using the Global Reservoir and Dam (GRanD) database(Lehner et al., 2008) and a visual check with Google Maps.For the validated locations, it is worth nothing that all stations with river ice (33) and most of them with dams (34 out of 48) are located in North America.

Figure 12 .
Figure12.Average variable importance of 200 runs using the RF methodology.The Nash-Sutcliffe score was chosen as a quality index for categorising the stations as true (good predictive) or false (poor predictive).With a threshold of NSE = 0, we have about 50 % of the stations above and 50 % below that value.Results are shown for the different training and test groups.For all the test groups and runs, the average highest variable importance was obtained for mean observed discharge, climatic region, land cover/mean LAI and upstream catchment area, and the lowest for dam/hydraulic structure presence and river ice.
; Gregorutti et B. Revilla-Romero et al.: Evaluation of the Global Flood Detection System for measuring river discharge al., 2013), there is not yet a consensus on the interpretation of the importance measures when the predictors are correlated and on what the effect of this correlation is on the importance measure.

Table 1 .
Number of catchments by continent and range of upstream areas for the located stations.

Table 2 .
Climate and land cover type of the 322 sites selected for the calibration and validation, aggregated by continent, climate and land cover.