Hydrological models can provide estimates of streamflow pre- and
post-observations, which enable greater understanding of past hydrological
behaviour, and potential futures. In this paper, a new multi-objective
calibration method was derived and tested for 303 catchments in the UK, and
the calibrations were used to reconstruct river flows back to 1891, in order
to provide a much longer view of past hydrological variability, given the
brevity of most UK river flow records which began post-1960. A Latin
hypercube sample of 500 000 parameterisations for the GR4J model for each
catchment were evaluated against six evaluation metrics covering all aspects
of the flow regime from high, median, and low flows. The results of the top
ranking model parameterisation (LHS1), and also the top 500 (LHS500), for
each catchment were used to provide a deterministic result whilst also
accounting for parameter uncertainty. The calibrations are generally good at
capturing observed flows, with some exceptions in heavily groundwater-dominated catchments, and snowmelt and artificially influenced catchments
across the country. Reconstructed flows were appraised over 30-year moving
windows and were shown to provide good simulations of flow in the early
parts of the record, in cases where observations were available. To consider
the utility of the reconstructions for drought simulation, flow data for the
1975–1976 drought event were explored in detail in nine case study catchments.
The model's performance in reproducing the drought events was found to vary
by catchment, as did the level of uncertainty in the LHS500. The
Standardised Streamflow Index (SSI) was used to assess the model
simulations' ability to simulate extreme events. The peaks and troughs of
the SSI time series were well represented despite slight over- or
underestimations of past drought event magnitudes, while the accumulated
deficits of the drought events extracted from the SSI time series verified
that the model simulations were overall very good at simulating drought
events. This paper provides three key contributions: (1) a robust
multi-objective model calibration framework for calibrating catchment models
for use in both general and extreme hydrology; (2) model calibrations for the
303 UK catchments that could be used in further research, and operational
applications such as hydrological forecasting; and (3)
Hydrological extremes and associated flood and drought events threaten the security of water supply, food supply, livelihoods and welfare (Kundzewicz and Matczak, 2015). Managing the impacts of both rainfall excess and deficit on the hydrological system poses a significant challenge for authorities and water resource managers across the globe. These challenges are set to become more acute in future: the latest projections for Europe suggest increasing hydrological variability with more severe extremes (Collet et al., 2018; Guerreiro et al., 2018; Teuling, 2018) and further reductions in low flows in many regions (Wilby and Harris, 2006; Christierson et al., 2012; Prudhomme et al., 2012; Kay et al., 2018; Marx et al., 2018). Increasing demands due to a growing population and socio-economic changes also imply growing pressures on water resources in the future, necessitating considerable investment in long-term strategic water resources planning and adaptation (HM Government, 2017).
Understanding extremes of the past can help us prepare for future extreme events. Drought characteristics of events in the recent past can be used to stress test water supply systems (Mens et al., 2015), a practice that is commonly applied in UK water resource management and drought plans (e.g. Southern Water, 2013; Northumbrian Water, 2017). Similarly, drought severity estimates of past events have been used to investigate the impacts of increased drought frequency on water supply vulnerability (Herman et al., 2016). There is a growing trend towards testing water supply systems against events worse than those experienced, using either scenario-based methods (e.g. Stoelzle et al., 2014; Anderton et al., 2015) or stochastic approaches to generate simulated droughts with credible characteristics (e.g. Atkins, 2016). In addition, short-term water management planning can benefit from seasonal forecasting of reservoirs inflows and streamflow volumes (Prudhomme et al., 2017), so that periods of water deficit can be known in advance and appropriate measures put in place to manage resources and mitigate impacts. However, these methods are all dependent on having a good understanding of past variability and long hydrometric records which are used to train and validate stochastic approaches, and to create tools that enable the simulation of river flows as accurately as possible under a range of varied climate conditions.
Observations of global streamflow are sparse prior to the 1950s, with less than 20 % of stations in the Global Runoff Data Centre (GRDC, 2019) beginning pre-1950. Post-1960, the streamflow network expanded rapidly, a pattern that is mimicked by the UK gauging network, where 100 gauging stations in 1950 have increased to over 1300 today. Qualitative data sources and long rainfall records can identify significant drought events in the pre-instrumentational period (Pfister et al., 2006; Marsh et al., 2007; Brázdil et al., 2016). However, these cannot be used to determine whether these events were more or less severe in hydrological terms than those on the observational record, and there is a need for temporally and spatially coherent flow time series to allow systematic assessment of extreme events.
Meteorological records of rainfall and temperature generally extend further back than hydrological data, often providing data from the turn of the 21st century (New et al., 2000), and occasionally as far back as the mid-20th century. Modelled climate reanalysis data (e.g. Compo et al., 2011) and long-term reconstructed climate datasets (e.g. Casty et al., 2007) have been developed for use in scientific research and can be fed into hydrological models to quantitatively reconstruct river flows beyond the limits of the observational period. In the UK, quantitative reconstructions of river flows using simple hydrological models have previously been conducted, but only for a handful of catchments (e.g. Jones and Lister, 1998; Jones et al., 2006). Regional flow reconstructions have been used to explore the implications of drought events on water resources (e.g. Spraggs et al., 2015). Drought reconstruction has also been conducted in other countries using proxy data (Jones et al., 1984; Cook et al., 2015), precipitation data (Noone et al., 2017, Ireland), soil moisture models (Wu et al., 2011, China), and hydrological models (Caillouet et al., 2017, France). Generally, however, there are few extant studies that use hydrological models to derive plausible historical sequences.
Catchment hydrological models are tools that can generate streamflow time series from meteorological time series data, to provide continuous proxy river flow data that are otherwise not directly available. They can be used to extend flow records, creating very long sequences that extend back beyond the initiation of the observational network. Such long time series can enable thorough analysis of past variability and frequency of severe events (e.g. Caillouet et al., 2017); as vital input to short range and seasonal forecasting (Day, 1985; Harrigan et al., 2018), providing valuable early warnings and help preparedness; or for future projections for long-term planning accounting for possible future non-stationarity, for example due to global warming (e.g. Collet et al., 2018).
Calibrating a hydrological model for multiple purposes, e.g. flow reconstruction and forecasting, for high, low, and average flows, requires careful consideration. Currently, models are typically calibrated to minimise a specific type of error against observations, measured by an “evaluation metric” also known as an “objective function”. Commonly used metrics, such as the Nash–Sutcliffe efficiency (Nash and Sutcliffe, 1970) or root mean squared error, tend to focus on the correct estimation of high flows (Krause et al., 2005; Dawson et al., 2007), whilst more general metrics, such as mean absolute percent error and percent bias are also used to more systematically optimise the flows and the water balance respectively. There are few examples focusing on optimising low flow simulation. Most commonly, a single objective function is used, implemented using automatic algorithms to find a deterministic parameterisation of the model. Such algorithms are commonly categorised as “local” (e.g. PEST, Kim et al., 2007) or “global” (e.g. SCE, Duan et al., 1993), some examples of which have been compared by Wallner et al. (2012). However, seeking a single “optimum” parameter set to describe the observations has been argued to be a misconception with theoretical catchment models (Beven, 2012). The need for calibration techniques to maximise hydrological model performance against multiple elements of the flow regime has, however, been recognised, and multi-objective optimisation methods have been advancing since the turn of the century, though few studies explore more than three objectives (Efstratiadis and Koutsoyiannis, 2010). Multi-objective optimisation commonly involves seeking Pareto-optimal solutions that find a compromise between objective functions (e.g. Shafii and De Smedt, 2009; Kamali et al., 2013; Jung et al., 2017). Multi-objective methods may also be used to optimise more than one hydrological variable (e.g. Mostafaie et al., 2018). In addition, utilising multiple model parameterisations have been advocated to account for “equifinality” – that many different parameterisations may produce equally adequate simulations of past observations (see, for example, Beven and Binley, 1992; Beven, 2006).
Here, we develop a framework to establish a national network of catchment
hydrological models and evaluate their application to the reconstruction of
hydrological time series, with application to the UK over the period 1891 to
2015. The aims of this research are to
develop a robust method for multi-objective model calibration suitable for
use in simulating streamflow with associated uncertainty, apply that method to reconstruct historic streamflow time series from the
1890s across the UK, examine the performance of these time series where observations are
available, and explore the potential for application of these time series in evaluating
drought events.
This paper first outlines the datasets in Sect. 2, before detailing the
modelling methods in Sect. 3. Section 4 provides the results of the
performance of the model reconstructions compared with streamflow
observations, both generally and during drought events. Section 5 discusses
the potential limitations of this work, and suggests directions for further
research, before the paper is concluded in Sect. 6.
The hydrological model employed in this study (see Sect. 3.1) requires rainfall and potential evapotranspiration data to run, and observed flow data for calibration and validation. Means of access to the datasets used in this study are described in the Data availability section at the end of the paper.
A diverse set of 303 UK catchments were selected for model calibration.
Initially, 395 stations were considered, from the near-natural catchments
suitable for low flow analysis from the UK Benchmark Network
(Harrigan et al., 2017), and those which are part of the National
Hydrological Monitoring Programme (
Map of 303 catchments calibrated. Nine reconstruction case study catchments (one per region) are shown with black hatching.
The daily rainfall dataset used in this study was derived by the UK Met Office as a result of a large data rescue and digitisation programme. The 5 km gridded dataset, which covers the period 1891 to 2015, was derived using the same methodology as the UKCP09 data (Met Office, 2017), with interpolation carried out using inverse distance weighting (Perry and Hollis, 2005). The data rescue and digitisation programme added over 200 monthly and 38 daily gauges to the network during the period 1890 to 1910. Catchment averages were derived from the 5 km grids, using the catchment boundaries provided from the National River Flow Archive, for use in the hydrological model.
As the meteorological variables needed to derive potential evapotranspiration (PET) data using the Penman–Monteith equation (Monteith, 1965) are not available prior to 1961, the PET data used for the reconstructions was derived using the McGuinness–Bordne temperature-based PET equation (McGuinness and Bordne, 1972), calibrated for the UK. The temperature data for 1891–2015 were again provided by the UK Met Office following their data rescue programme. A detailed description of the generation of the PET dataset used in this study, following a rigorous analysis of seven temperature-based PET equations, four calibration techniques, and seven input temperature data sources or formats, can be found in Tanguy et al. (2018).
The GR4J (Génie Rural à 4 paramètres Journalier) daily lumped
rainfall-runoff model (Perrin et al., 2003) was used in this study
via the “airGR” R package version 1.0.2 (Coron et al.,
2017). The suite of daily GR models (GR4J, GR5J, and GR6J) is being
increasingly applied around the world, and GR4J was chosen for several
reasons:
GR models have been used for streamflow reconstructions previously
(Brigode et al., 2016; Caillouet et al., 2017). The GR4J model has demonstrated good performance in a diverse set of
catchments in the UK (Harrigan et al., 2018), as well
as good performance at simulating temporal transitions between wet and dry
periods (Broderick et al., 2016). The GR models are openly accessible. The model has a low computational demand, and can be run in parallel without
manual input requirement.
The model has four free parameters (X1–X4), requires daily precipitation
and potential evapotranspiration data as input, and routes water into two
stores: the production store and the routing store. The production store
(capacity X1) gains water from effective rainfall and loses water through
evaporation and percolation. Percolated water joins that which has bypassed
the production store and is routed with a fixed split, in which 90%
is routed via a unit hydrograph (time lag X4), followed by the non-linear
routing store (capacity X3); the remaining 10 % is routed by a
single unit hydrograph (time lag
The GR models include an optional snowmelt module, CemaNeige
(Valéry et al., 2014). Due to the high computational demand of
the snowmelt module, it was decided to calibrate the GR4J model without
snowmelt, as only 15 (5 %) of the 303 catchments experience a significant
fraction of precipitation falling as snow (
The GR4J model was calibrated for this study incorporating concepts from
GLUE-type Bayesian approaches (Beven and Freer, 2001), and
multi-objective Pareto-optimal solutions (Yapo et al., 1998). The
approach consisted of three stages, the details of which are further
elaborated in this sub-section: firstly the feasible parameter space was
determined, and sampled using Latin hypercube sampling (LHS) (McKay et
al., 1979); secondly the model was run, and six evaluation metrics were
calculated for each parameter set; and thirdly the top 500 parameter sets
for each catchment were selected using a very simple Pareto-optimising
ranking approach, accounting for non-acceptable trade-offs
(Efstratiadis and Koutsoyiannis, 2010). This method was formalised
for several reasons:
Latin hypercube sampling allowed the systematic sampling of the model
parameter space. Multiple evaluation metrics enabled the simultaneous optimisation of several
aspects of the flow regime, including general water balance and low flows. Model equifinality (Beven, 2006) could be addressed by accepting
multiple “behavioural” parameter sets. A deterministic “best” parameter set could also be selected.
LHS uses Latin square theory to ensure that the full range of each parameter is represented regardless of its resultant importance (Cheng and Druzdzel, 2000), whilst maximising efficiency in comparison to a simple random sampling approach. An LHS set of 500 000 model parameter sets (parameterisations) for the four model parameters was derived using the MATLAB package “lhsdesign” (The MathWorks Inc, 2016), using the “maximin” criterion to maximise the minimum distance between each point. In order to determine what values to ascribe to the upper and lower bounds of the parameters, a smaller experiment using 100 000 model parameterisations was run over 45 catchments as a “first pass”. This experiment used parameter limits that could be found in previous literature on the GR4J model (Pushpalatha et al., 2011; Perrin et al., 2003). It was found that good parameter sets for this first pass had storage values (X1 and X3) close to the limits that had been set from the literature. Therefore, in consultation with the developers of the airGR model package, it was decided to widen the ranges of parameter values, and then to increase the number of model parameterisations that were run to account for this increase in the parameter space. The parameter values were sampled from a uniform distribution, using the upper and lower limits given in Table 1. Lower bounds of 0.0001 were ascribed to the two storage parameters to represent a value of 0, without causing division errors.
Sampled parameter ranges.
For each of the 500 000 model parameterisations, six evaluation metrics were
calculated in order to employ a “multi-objective” approach to cover the
full range of the flow duration curve (see Table 2): the Nash–Sutcliffe efficiency (NSE) focusses on optimising high flows,
absolute percent bias (absPBIAS) maintains the water balance, mean absolute
percent error (MAPE) and NSE on logarithmic flows (logNSE) measure overall
agreement on the full range of flows, and absolute percent error in
Evaluation metrics used for model calibration, where
Post-calibration, the upper and lower daily limits of the 500 top ranking
parameterisations (see Sect. 3.2.3 for details on
the ranking process) were used to calculate two further model performance
metrics over the full observational record available for each catchment (a
maximum of 1891–2014):
The uncertainty width (UncW) – calculated by taking range of the minimum and
maximum LHS500 members each day and dividing it by the midpoint of the
LHS500 for that day. The mean of these values was then calculated over the
duration of the time series, as per The containment ratio (ContR) – calculated as the percentage of days that
the observations fell within the envelope of the minimum and maximum of the
LHS500 ensemble members for that day.
In order to optimise six evaluation metrics, the 500 000 model parameterisations were ranked from best to worst by their scores for each metric in turn, and these ranks were then summed to create a total rank. This total, or “basic”, rank was used to reorder the parameterisations for each catchment from best to worst, accounting for all metrics. However, the scores of the 500 000 model parameterisations were not normally distributed, and it was found that unacceptable trade-offs between metrics were occurring, whereby nominal increases in one metric were taking preference over quite significant decreases in other metrics. Therefore, a series of thresholds of acceptability were set, as shown in Table 3. A simple iterative search algorithm was then used to re-rank the list according to these thresholds, whilst retaining their original ranks within each threshold group. For example, if the first, third, and fourth parameterisations in the basic rank met the hardest threshold for all six metrics, but the second ranked parameterisation did not, the third and fourth would be bumped up the rankings, above the second, resulting in a list of [1, 3, 4, 2 …]. All parameterisations meeting the hardest thresholds were prioritised before the algorithm switched to search for those in meeting the middle thresholds, and so on. From this final list, the top ranking optimum parameter set was extracted for deterministic model applications, herein referred to as LHS1. Due to the variability of the performance across catchments, where hundreds of thousands of parameter sets met the hardest threshold in some catchments, whilst none met even the softest threshold in other catchments, it was decided that extracting behavioural parameter sets using a “limit of acceptability” approach after Beven (2006) would not be appropriate. Therefore, a proportion of the sampled model parameterisations, the top 500 (herein referred to as LHS500), were taken forward to provide an indication of parameter uncertainty within the flow simulations. The extent to which the threshold re-ranking influenced the rankings varied by catchment due to the differences in mode performance. Figure 2 shows the NSE and logNSE scores of the 500 000 model parameterisations (though this graph has been limited to show only those with positive scores for both metrics) for the River Greet in the Severn Trent region. This figure demonstrates how the basic ranking system identified 500 parameterisations close to the Pareto front of NSE vs. logNSE; however; parameterisations with scores that were lower for NSE than logNSE were selected. By applying the thresholds, parameterisations with an NSE lower than 0.4 were rejected and replaced with others within the acceptable range for all metrics.
Nash–Sutcliffe efficiency and log Nash–Sutcliffe efficiency
calibration scores for all sampled model parameterisations (grey), the top
500 from the basic ranking process (blue), and the top 500 after the
thresholds were applied to negate non-acceptable trade-offs (red). The
Thresholds for selecting acceptable model parameterisations.
Using these 500 model parameterisations per catchment, and the rainfall and potential evapotranspiration data described in Sect. 2, daily flow reconstructions were produced from January 1891 to November 2015 for the 303 catchments. Details on accessing this data are provided in the Data Availability section at the end of this paper.
The application of model results to drought analysis are conducted here using the “Standardised Streamflow Index” (SSI). The SSI has for some years been advocated as an equivalent to the Standardized Precipitation Index (e.g. Vicente-Serrano et al., 2012), being based on the cumulative probability of a given monthly mean streamflow occurring in a given catchment. The procedure involves fitting a statistical distribution to time series of accumulated streamflow over a baseline period, then transforming the data to a normal distribution to produce a dimensionless time series of the deviation of flow about the catchment mean. In this study, SSI was calculated using the 12-month accumulation period (SSI-12) and the Tweedie distribution (Svensson et al., 2017), over the baseline period 1961–2010. A 12-month accumulation period was chosen to provide summaries of long-term deficits that were likely to have significant impacts on water resources. The Tweedie distribution, which is a flexible three-parameter distribution that has a lower bound at zero, has been shown to perform effectively for UK river flows, across a wide range of near-natural Benchmark catchments (Svensson et al., 2017).
Using the SSI, accumulated over a 12-month
period, drought events were identified as periods where the SSI was
consecutively negative (i.e. below normal) with at least one month reaching
an SSI of
The threshold met by the LHS1 model parameterisation (shape), and the percentage of the LHS500 that met that threshold (colour), for the 303 study catchments. See Table 3 for the definition of the thresholds.
The map in Fig. 3 shows the threshold (as set out
in Table 3) met by the LHS1 runs and the percentage
of the LHS500 members that met that threshold. The map shows that the LHS1
runs for 272 of the 303 catchments met the hardest threshold set (shown as
triangles). However, there is a lot of variability within these catchments,
with 82 demonstrating all of the LHS500 met the hardest threshold (black
triangles), whilst 108 have less than 10 % of the LHS500 above the hardest
threshold (yellow triangles). The LHS1 run for 20 of the catchments met the
“middle” threshold, and very few catchments performed worse than this,
having
Polar plots of the scores for six evaluation metrics over the calibration period 1982–2014. Each blue bar and associated dot represents one of the 303 catchments, plotted around the perimeter of the circle, grouped by hydrometric region; see Fig. 1 for region abbreviations. Dark blue dots represent the LHS1 run, and blue shaded bars represent the range of the LHS500. The score is shown on the radial axis, with the outside of the circle representing best model performance.
Figure 4 shows the results of the six evaluation
metrics for each of the 305 flow reconstructions over the calibration period
(1982–2014), for both the LHS1 runs and the range of the LHS500. These polar
plots confirm the findings of Fig. 3, showing
that the model performance is generally very good, with most of the LHS1
runs for the 305 catchments satisfying the thresholds defined in
Table 3 with ease. This plot allows the assessment
of each performance metric individually and shows that performance varies
both between metrics and across catchments. The poorest scores, where the
LHS1 did not meet the softest threshold, can be mostly attributed to NSE, but
MAPE and MAM30
Figure 5 shows the scores of the uncertainty width
(UncW) and the containment ratio (ContR) for each of the 303 catchments. The
lower the UncW (a narrow range of model results among the 500 ensemble
members), and the higher the ContR (a high proportion of the observations
fitting within the band of model runs), the more accurate and reliable the
simulation. In these results, there appears to be a correlation between
UncW and ContR (Pearson correlation 0.52, with significance,
These graphs represent an overview of the performance of the model
calibrations across the UK. The model performance for individual catchments,
as well as time series of the reconstructed flow data from 1891–2015, can be
explored in more detail using the interactive web application at
Polar heat map showing
In order to evaluate the integrity of the reconstructed flow series, in the
earlier pre-calibration parts of the record, the six evaluation metrics for
the LHS1 runs specified in Table 2, as well as the
uncertainty width and the containment ratio for the LHS500, were calculated
over 30-year moving windows for all water years where flow observations
were available. These results have been plotted as polar heat maps in a
similar way to the polar plots showing the evaluation metrics over the
calibration period. Figure 6 shows the heat map for
In general, across all metrics and catchments, the scores are very stable: where bars are dark or pale, showing good and poorer model performance respectively, they remain similar colours throughout their length. There are some exceptions, which are most notable in the catchments with longer observational records. The Avon at Evesham in the ST region, the Dee at Manley Hall in the north-western England and northern Wales (NWENW) region, and the Bedford Ouse catchment in the Anglian (ANG) region show reduced model performance earlier in the record, with the bars moving through orange and yellow shades as they stretch towards the centre of the circle. It is worth noting, though, that these catchments are not part of the near-natural Benchmark Network (Harrigan et al., 2017) and have had reported issues with inhomogeneity in their observed records as a result of human influences. The Lee at Feildes Weir in the SE region (plotted at the boundary with the ANG region) also shows variation in performance across most metrics, although in this catchment, the performance is good (plotted in black) at the start and end of the record, with poorer performance (shown in yellow) around the start years of 1920–1940 (evaluation years of 1920 to 1970). In contrast to this, the Dee at Woodend in eastern Scotland and the Severn at Bewdley in Severn Trent region, which have the longest records in their regions, show more temporal stability in the model performances (with black colouring for the whole bar). This, coupled with the generally very stable results over the 20–30 years prior to the calibration period among with the catchments with shorter records, demonstrates that the flow series produced for this study are suitable for use in longer temporal studies, outside of the period of calibration (1982–2014).
In this section, the nine case study catchments (shown in Fig. 1) are used to examine the performance of both the LHS1 and the LHS500 modelled flows at simulating drought events.
The 1975–1976 event was chosen as a case study period to test the model's capability to reconstruct drought events. This event occurs before the model calibration period of 1982–2014 and was one of the most severe and widespread droughts of the 20th century in the UK (Marsh et al., 2007). Summary statistics showing the model performance for these catchments during both the calibration period (1982–2014) and the 10-year period surrounding this significant drought event (1971–1980) are provided in the Supplement Table S1. It is worth noting that the observational records in the Bush and Crimple did not begin until 1972, nor the Greet until 1974.
Monthly mean flows for 1970–1980. Each of the LHS500 ensemble members are shown in grey, with the LHS1 run shown in black, and the observations shown in red. The
SSI data for the LHS1 runs have been calculated for all 303 catchments and are freely available (Barker et al., 2018; see Data availability). These data, along with summaries of extracted drought events over space and time, can also be explored using an interactive web application at
In the Crimple, the UncW is especially wide during low flow events, and the
observations lie very close to the lowest of these model runs; however, the
LHS1 run lies close to the observational flow values. In other catchments,
such as the Otter, the observed and LHS1 flows sit more centrally within the
range of the LHS500. In the Avon, the observations sit centrally within the
uncertainty range, but the LHS1 run overestimates low flows. The LHS1
flows for the Cree tend to underestimate the low flows. The Avon and the
Bush display poor scores in the low flows metrics MAM30
Daily flow time series for the 1975–1976 drought event in each of the nine case study catchments. The
Daily flows for January 1975–December 1976 (shown in Fig. 8) highlight the difference in variability between the catchments in the northern and southern parts of the UK. The variability is generally well simulated, though the GR4J model exhibits some difficulty in simulating the low flow variability in the southern catchments, with very little inter-monthly variability in the simulated discharge, although significant peaks are identified among the ensemble members. Note that the abnormal peaks of the observational record on the Lambourn in September–December of both 1975 and 1976 are the result of the West Berkshire Groundwater Scheme (WBGS) that was implemented during the drought to alleviate the extreme low flows, and these are not accounted for in the model, which has no human influence representation. Generally, LHS1 simulations are low among the LHS500 runs in the Cree, Bush, and the Crimple (as seen in the monthly plots) but are close to the observations. This indicates that selecting the “best” simulation where a deterministic result is needed is more appropriate, in these cases, than extracting a mean or median from the ensemble.
As with the monthly flows, the Avon and the Bush show systematic overestimation of the low flows by the LHS1 run, whilst the Cree shows underestimation of low flows, with the exception of the most extreme low flows in July–September 1976. These mixed results that can be seen for the nine case study catchments highlight the variation in model performance among the 303 modelled catchments and emphasise the need for users to carefully appraise the evaluation metrics of the flow simulations for the catchments they are investigating. However, these catchments were deliberately selected to explore these variations, and the results shown in Sect. 4 demonstrate that the model performs well across a wide range of different catchment types at the national scale.
SSI data for the LHS1 runs have been calculated for all 303 catchments and
are freely available (Barker et al. 2018; see
Data availability) but have also been calculated
here for the LHS500 for the nine case study catchments. These data are used
to evaluate how well the ensemble simulations reproduce the drought event
accumulated deficit. For low flows, we consider SSI values of
Standardised Streamflow Index (SSI) using a 12-month accumulation period for nine case study catchments over the years 1971–1980. SSI values of 0, as well as those representing moderate (
Here, the SSI time series for the same 10-year period (1971 to 1980) are appraised and shown in Fig. 9. The uncertainty widths in the SSI plots shown vary substantially between catchments and directly reflect the ranges seen in the flow time series: with the Lambourn showing a very low UncW from the LHS500, whilst the Greet, Tove, and Otter show a wider range. In the Lambourn, Dee, and Bush catchments, the SSI derived from the observations frequently falls outside of the range of the LHS500, showing a low containment ratio (ContR). This behaviour is more pronounced in the SSI time series than the flow time series. The Dee catchment, for example, produced a ContR of 92.6 % for the daily flow data over 1971–1980, but the SSI-12 ContR is just 30 %. It is noticeable that the UncW values of the SSI data are fairly even throughout the time series, whilst in the flow data, they appear to be wider during the more extreme high and low flow periods. There are two factors which may have contributed to these differences: firstly that the smoothed nature of the SSI-12 reduces the short-term variability of the data (the ContR of the SSI-1 are closer to those of the flow data), and secondly, when deriving the SSI, the tails of the fitted distribution are more uncertain than for the average flows, which may result in convergence of the SSI values for the more extreme members of the LHS500 during periods of high and low flows.
For the Lambourn, the negative SSI values (below normal flows) are underestimated and the positive SSI values (above normal flows) are overestimated showing the model is overemphasising the extreme events. In the Avon catchment the most extreme SSI deficit occurs in 1973, and the 1976 event is classed as “severe” but not “extreme” for the observations and all but a few of the LHS500. The deficit in 1973 is simulated as being more extreme than the observations, but the 1976 event is better captured. The uncertainty range in the Greet catchment is very wide, particularly for the SSI peak (drought termination) in 1977; however, the 1976 SSI deficit has a lower range among the LHS500. For the Tove, the SSI of the 1976 drought event is well simulated, as are those for the Crimple, despite some underestimation of SSI at other times in the 1970s. The Otter shows very good simulation of SSI-12 during this 10-year period.
SSI time series plots over the longer period 1975 to 2015 are provided in Fig. S2. These plots show that although the exact magnitudes of the SSI deficits and excesses are not always captured by the model in some of the poorer performing catchments, the pattern of the SSI-12, the shape of the peaks and the troughs are very well represented.
Accumulated deficits of extracted drought events (using a
threshold of SSI
This section explores the accumulated deficits of extracted drought events between 1975 and 2015 (the common observed period for all nine catchments), which are presented in Fig. 10.
This plot shows that drought events are generally in good synchrony across the country. For these nine catchments, four major nationwide drought events using SSI-12 are evident: 1975–1978, 1989–1993, 1997–1998, and 2004–2006. Regional droughts include 1984 in the northern catchments and 2010–2012, which affected England and Wales, but not Scotland and Northern Ireland. There appears to be a relatively “drought-poor” period in the south between 1977 and 1988, whilst the north shows a lack of droughts in the more recent period of 2006–2015.
The observed events are very well captured by the model simulations. There
are only four out of a total of 40 observed drought events across all nine
catchments that are not detected by the simulated drought events: an event
in 1992 in the Crimple, 1994 and 2004 in the Dee, and 2006 in the Cree. In
each of these cases, the SSI of the model simulations fall below
In terms of timing and deficit, the 1995–1998 drought event demonstrates the most confidence among the simulations. The Crimple catchment shows some uncertainty about the timing of each of the events, and the majority of the LHS500 model simulations place the 2004–2006 event later than the observation. In Fig. S2, it can be see that this is due to the fact that the intensity of the 2005 deficit was overemphasised by the model. Similarly, in Fig. 10, the 1975–1978 event in the Bush shows a wide range of mid-point dates (centre of the circles), and the deficit also varies. Overall, the deficits of the events are well captured by the modelled data: for example, the 2004–2006 event in the north showed smaller deficits than the 1975–1978 event, and the modelled deficits reflect these differences. The modelled results for the 1997–1998 event in the Greet show two possible event timings, and the thickness of the circles indicates some differences in the simulated accumulated deficit among the model parameterisations, though these differences are relatively small.
On balance, the pattern of drought events is well simulated by the GR4J model, despite some small differences in magnitude and timing, with magnitude being better estimated than the timing. These results demonstrate that, despite the issues seen in the SSI time series plots, the dataset can provide good estimates of drought events and their characteristics. This highlights the potential of the model to reproduce hydrological drought events using just precipitation and evapotranspiration data and shows that the reconstructed flow time series will be valuable in appraising historic hydrological droughts over a longer period and wider spatial domain than the observations that are available.
The multi-objective calibration framework presented in this paper has produced modelled flow data with demonstrable high performance across a wide range of available observed records. This framework has been developed to enable nationally and temporally coherent flow simulation that can be applied to a wealth of applications, past, present, and future. In this paper, the calibration framework has been applied to a wide range of catchments across the UK, allowing for a detailed exploration of model performance across different hydrological regimes. Two potentially limiting factors in model performance were highlighted in this study: snowmelt and human influences.
The airGR snowmelt module was not employed in this study as only 15 of the 303 catchments showed snowmelt fractions greater than 0.15 (15 %). These catchments were located along 10 rivers, all in Scotland. Despite the lack of snowmelt processes here, all of the catchments met at least the “softest” evaluation thresholds set out in Sect. 3.2.3, with six, eight, and one catchment meeting the hardest, middle, and softest thresholds respectively. This implies that snowmelt only causes modelling issues for high-altitude Scottish catchments.
Human interactions are a common problem in hydrological modelling that remain largely understudied (Calvin and Bond-Lamberty, 2018). Whilst global scale models have been advancing in socio-hydrology, making use of satellite information and governmental estimates of total water consumption, the data to support such endeavours is lacking (Bierkens, 2015). Small-scale catchment models would need to rely on significant amounts of abstraction and licencing data as well as reservoir operation procedures, the details of which are often sensitive and/or unavailable. The lack of abstraction processes in GR4J is likely to be responsible for some reduced model performance, particularly in the regions of Anglian and southern England. The loss function (parameter X2 “inter-catchment exchange coefficient”) of the GR4J model can account for some systematic losses or gains; however, human influence is often non-stationary (e.g. construction and operation of reservoirs, irrigation, and water transfer schemes). For the Lee at Fieldes Weir and the Thames at Kingston naturalised river flow data, which attempt to remove the impact of human activity on the observed flow, were available. Whilst not included in this paper for consistency with the other 301 catchments, calibration scores were slightly better for the naturalised flow data in these catchments, though both naturalised and observed calibrations easily met the hardest thresholds. An alternative approach is to focus studies on the “near-natural” catchments, which are deemed to have minimal human influence. Of the 303 catchments included in this study, 115 are classified as near-natural and are part of the Low Flows Benchmark Network (Harrigan et al., 2017). Since many of the UK's most significant catchments are heavily influenced, they were not excluded from this study, and the model does successfully manage to implicitly account for human influences in these large rivers. Localised issues in the model's performance, and therefore the quality of the reconstructed flow data, highlight the need for users to take caution when choosing a catchment from this set of 303. Depending on their needs, an alternative nearby catchment where model performance is better may be more suitable if model performance is poor in the initially selected catchment.
The modelling framework developed in this study has explored model parameter uncertainty in order to account for equifinality (Beven, 2006). A total of 500 000 parameter realisations were run, and the best 500 of these were selected for each catchment to allow for uncertainty quantification in applications of these flow data. Here, the uncertainty in the model runs was shown to vary more between catchments than over time (from 1890 to 2015, where long observational records were available). Whilst model parameter uncertainty was considered in this study, further sources of uncertainty can contribute to variations in model performance, including model input data (precipitation and PET), flow data used for model calibration, and the choice of hydrological model (Smith et al., 2018b).
The impact of precipitation uncertainty has been shown to be more significant than PET in hydrological modelling (Paturel et al., 1995; Bastola et al., 2011; Guo et al., 2017). Perry and Hollis (2005) and Legg (2015) state that the accuracy of gridded data is dependent on the density of the rain gauge network, with greater errors associated with sparse coverage. Therefore errors in the reconstructed precipitation data applied in this study will be higher in the early part of the record when the station density was lower. Since the model is calibrated to the more recent period 1982–2014, uncertainty from the rainfall data may propagate through to the flow reconstructions in the early part of the record. However, from the moving window analysis of model performance (see Fig. 6), there does not appear to be significant degradation in the quality of flow simulations in the early part of the record. Tanguy et al. (2018) considered the impact of poorer quality and lower density of temperature data on the derivation of the PET dataset that was employed in this study and concluded that, whilst PET is an important variable for predicting runoff, the influence of degraded PET input that results from low quality temperature data on runoff simulation can be limited by the adequate calibration of hydrological models (Bai et al., 2016; Seiller and Anctil, 2016). Thus, the Tanguy et al. (2018) PET dataset is considered suitable for use in hydrological models, especially if they are calibrated to this dataset.
Uncertainties may also arise from the observational flow data used to
calibrate models. Uncertainties from the precision of the instruments that
measure the water level (stage) and uncertainties from the derivation of
the stage-discharge relationship are both particularly sensitive in the
extreme flow ranges. For example, a 10 mm error in stage measurement at the
Q95 flow can result in a 20 % error in flow for around a third of the UK's
gauging stations (National River Flow Archive, 2019). The dataset used
in this study was taken from a reputable source (the NRFA) who, in order to
minimise such errors, conduct rigorous quality control procedures using both
automatic and manual validation procedures annually (Dixon et
al., 2013). Nevertheless, hydrometric data quality does vary across the
network and errors tend to cluster in the extreme flow ranges, so
hydrometric uncertainty could be influential in some periods in catchments
used herein – we recommend users consult the NRFA's extensive station and
catchment metadata (available at
Whilst the parameter uncertainty in the model was evaluated here, applying different model types and model structures can also yield dramatically different results. Many multi-model experiments have been conducted to assess the differences between hydrological models (e.g. Warszawski et al., 2014; Vansteenkiste et al., 2014). Similarly, different structures of the same model (e.g. GR4J, GR5J, and GR6J) can influence the results. However, Smith (2016) found that model parameter uncertainty can be as wide as that from using different hydrological models, and initial testing of the GR5J and GR6J models showed significant parameter interactions that resulted in poor simulations in many UK catchments. It was therefore decided that considering the parameter uncertainty of the GR4J model would be sufficient to devise an ensemble of flow reconstructions for this dataset and study. Future work will investigate these simulations against a wider set of model runs using other model structures as part of a follow-up study.
The modelling framework developed here has been shown to be fit-for-purpose
for drought reconstruction, across a very wide range of catchment
behaviours. The reconstructed series can be used to shed light on historical
drought occurrence, characteristics (severity, duration, termination,
seasonality), and variability. A first exploration of hydrological drought
using the reconstructions is presented in a companion paper by
Barker et al. (2019). The data can also be used to support
drought and water resources planning activities, whether directly or to
provide context for stochastic approaches to drought generation. Ensembles
of historical drought events can be used to provide insight into the
probabilities of the termination of a current event over a certain time
period (e.g. Parry et al., 2018). Knowledge of historic events
can also be used to explore statistical correlations with atmospheric
drivers of droughts that may help predict the onset of events
(e.g. Lavers et al., 2015). In these approaches, extending
the hydrological record by
In this paper, a novel multi-objective calibration method was derived and tested for 303 catchments in the UK, and the calibrations were used to reconstruct river flows back to 1891. The GR4J model was applied and calibrated using Latin hypercube sampling (LHS) and six evaluation metrics simultaneously to allow for the evaluation of high, median, and low flows, thus optimising the calibrations for a wide range of potential applications. A best run (LHS1) and 500 model parameterisations (LHS500) were used to assess model uncertainty. Overall, the multi-objective calibration procedure has yielded excellent model results when compared to the observations, with the exception of only a few catchments. The reconstructed flows were appraised over 30-year moving windows, and were shown to provide good simulations of flow in the early parts of the record, where observations were available. Model performance and uncertainty during drought events was explored in nine case study catchments, and varied by catchment. The model simulations were used to derive the Standardised Streamflow Index, which allowed for an assessment of the model's ability to simulate significant deviations from a catchment's “norm”. The results showed that, despite observations regularly sitting outside the range of the LHS500, the peaks and troughs of the time series were well represented. Drought event accumulated deficits were extracted from the SSI data and the results were overall very good, demonstrating that the data from these model calibrations are suitable for the identification and characterisation of hydrological drought events in the UK.
The contributions of this paper are threefold: firstly, the multi-objective
model calibration framework applied here has been shown to provide robust
model calibrations that can be applied in studies of both general and
extreme hydrology. This framework could be applied elsewhere across Europe,
and indeed globally, to allow for spatially and temporally consistent
simulations of hydrology with far reaching potential applications. Secondly,
the model calibrations that have been derived for these 303 catchments in
the UK can be used in further research and operational applications, such as
for seasonal hydrological forecasting, or for assessing changes in river
flows under climate change. Finally, this study has produced a crucial
dataset of
The potential evapotranspiration dataset used in this study is
freely available from the Environmental Information Data Centre
(Tanguy et al., 2017,
Observed river flow data was accessed via the National
River Flow Archive, which provides daily and peak river flows for the UK for
over 1500 gauging stations (
The flow reconstructions produced in this study are
freely available on the Environmental Information Data Centre
(EIDC,
The Standardised Streamflow Index data derived from the LHS1 runs
are also freely available from the Environmental Information Data Centre
(Barker et al., 2018,
The supplement related to this article is available online at:
TPL and MT provided input data for the modelling. CP, SP, JH, and SH assisted KAS in designing the modelling framework. KAS calibrated the model, ran the reconstructions, and produced the plots. LJB calculated the SSI data and extracted drought events from the reconstructions. All authors contributed to the production of the manuscript, and its revisions.
The authors declare that they have no conflict of interest.
This research is an outcome of two projects, funded by the Natural Environment Research Council Drought and Water Scarcity Programme: Analysis of Historic Droughts and Water Scarcity in the UK (grant number NE/L01016X/1), and Improving Predictions of Drought for User Decision-Making [IMPETUS, grant number NE/L010267/1]. The authors would also like to thank the developers of the airGR model at IRSTEA for their guidance.
This research has been supported by the Natural Environment Research Council (grant nos. NE/L01016X/1 and NE/L010267/1).
This paper was edited by Nunzio Romano and reviewed by James Stagge and two anonymous referees.