Technical Note – RAT: a Robustness Assessment Test for calibrated and 1 uncalibrated hydrological models

aimed detecting potential dependencies in hydrological the robustness to to all of models, calibrated a of applicability hypotheses and it to previously tests. Results show that the is an efficient evaluation approach, passing it successfully can be considered a prerequisite


13
• a new method (RAT) is proposed to assess the robustness of hydrological models, as an 14 alternative to the classical split-sample test 15 • the RAT method does not require multiple calibrations of hydrological models: it is therefore 16 applicable to uncalibrated models 17 • the RAT method can be used to determine whether a hydrological model cannot be safely used 18 for climate change impact studies 19 • success at the RAT test is a necessary (but not sufficient) condition of model robustness 20 Abstract 21 Prior to their use under future changing climate conditions, all hydrological models should be 22 thoroughly evaluated regarding their temporal transferability (application in different time periods) 23 and extrapolation capacity (application beyond the range of known past conditions). This note presents 24 a straightforward evaluation framework aimed at detecting potential undesirable climate 25 dependencies in hydrological models: the robustness assessment test (RAT). Although it is 26 conceptually inspired by the classic differential split-sample test of Klemeš (1986), the RAT presents 27 the advantage to be applicable to all types of models, be they calibrated or not (i.e. regionalized or 28 physically based). In this note, we present the RAT, illustrate its application on a set of 21 catchments, 29 verify its applicability hypotheses and compare it to previously published tests. Results show that the 30 RAT is an efficient evaluation approach, passing it successfully can be considered a prerequisite for any 31 hydrological model to be used for climate change impact studies. 32 1 Introduction 33 1.1 All hydrological models should be evaluated for their robustness 34 Hydrologists are increasingly requested to provide predictions of the impact of climate change (Wilby,35 2019). Given the expected evolution of climate conditions, the actual ability of models to predict the 36 corresponding evolution of hydrological variables should be verified (Beven, 2016). Indeed, when using 37 a hydrological model for climate change impact assessment, we make two implicit hypotheses 38 concerning: 39 • the capacity of extrapolation beyond known hydroclimatic conditions: we assume that the 40 hydrological model used is able to extrapolate catchment behaviour under conditions not or rarely 41 seen in the past. While we do not expect hydrological models to be able to simulate a behaviour which 42 would result from a modification of catchment physical characteristics, we do expect them to be able 43 to represent the catchment response to extreme climatic conditions (and possibly to conditions more 44 extreme than those observed in the past); 45 • the independence of the model set-up period: we assume that the model functioning is 46 independent of the climate it experienced during its set-up/calibration period. For those models which 47 are calibrated, we assume that the parameters are generic and not specific to the calibration period, 48 i.e. they do not suffer from overcalibration on this period (Andréassian et al., 2012). 49 Hydrologists make the hypothesis that model structure and parameters are well-identified over the 50 calibration period and that parameters remain relevant over the future period, when climate 51 conditions will be different. Unfortunately, the majority of hydrological models are not entirely 52 independent of climate conditions (Refsgaard et  The diagnostic tool most widely used to assess the robustness of hydrological models is the split-56 sample test (SST) (Klemeš, 1986), which is considered by most hydrologists as a "good modelling 57 practice" (Refsgaard & Henriksen, 2004). The SST stipulates that when a model requires calibration 58 (i.e. when its parameters cannot be deduced directly from physical measurements or catchment 59 descriptors), it should be evaluated twice: once on the data used for calibration and once on an 60 independent dataset. This practice has been promoted in hydrology by Klemeš (1986 A few authors also tried to propose improved implementations of these testing schemes. Seiller et al. 109 (2012) used non-continuous periods or years selected on mean temperature and precipitation to 110 enhance the contrast between testing periods. This idea to jointly use these two climate variables to 111 select periods was further investigated by Gaborit et al. (2015), who assessed how the temporal model 112 robustness can be improved by advanced calibration schemes. They showed that the robustness of 113 the tested model was improved when going from humid-cold to dry-warm or from dry-cold to humid-114 warm conditions when using regional calibration instead of local calibration. length, and conclude that parameters obtained on dry periods may be more robust. 123 All these past studies show that there is still methodological work needed on the issue of model testing 124 and robustness assessment. This note is a further step in that direction. 125 one may have with a hydrological model to be used in a changing climate context. One of the problems 129 of existing methods is the requirement of multiple calibrations of hydrological models: these are 130 relatively easy to implement with parsimonious conceptual models but definitively not with complex 131 models that require long interventions by expert modellers and, obviously, not for those models with 132 a once-for-all parameterisation. 133

Scope of the technical note
Here, we propose a framework that is applicable with only one long period for which a model 134 simulation is available. Thus, the proposed test is even applicable to those models that do not require 135 calibration (or to those for which only a single calibration exists). 136 Section 2 presents and discusses the concept of the proposed test, section 3 presents the catchment 137 set and the evaluation method, and section 4 illustrates the application of the test on a set of French 138 catchments, with a comparison to a reference procedure. 139 2 The robustness assessment test (RAT) concept 140 The robustness assessment test (RAT) proposed in this note is inspired by the work of Coron et al. 141 (2014). The specificity of the RAT is that it requires only one simulation covering a sufficiently-long 142 period (at least 20 years) with as much climatic variability as possible. Thus, it applies at the same time 143 to simple conceptual models that can be calibrated automatically, to more complex models requiring 144 expert calibration, and to uncalibrated models for which parameters are derived from the 145 measurement of certain physical properties. The RAT consists in computing a relevant numeric bias 146 criterion repeatedly each year and then exploring its correlation with a climatic factor deemed 147 meaningful, in order to identify undesirable dependencies and thus to assess the extrapolation 148 capacity (

232
To summarize the results on the 21 catchments, we present on Figure 5 the slope and intercept of a 233 linear regression computed between model streamflow bias and climatic variable anomaly, for the 234 GSST and the RAT over the 21 catchments: the slope of the regressions obtained for both methods are 235 very similar and the intercept also exhibits a good match (although somewhat larger differences). 236 237

240
We can thus conclude that the RAT reproduces the results of GSST, but at a much lower computational 241 cost, and this is what we were aiming at. One should however acknowledge that switching from the 242 GSST to the RAT unavoidably reduces the severity of the climate anomalies we can expose the 243 hydrological models to: indeed, the climate anomalies with the RAT are computed with respect to the 244 mean over the whole period, whilst with the GSST they are computed between two shorter (and hence 245 potentially more different) periods. 246 247 We now illustrate the different behaviours found among the 21 catchments when applying the RAT 248 procedure. The significance of the link between model bias and climate anomalies was based on the 249

Application of the RAT procedure to the detection of climate dependencies
Spearman correlation and a 5 % threshold. Five cases were identified: 250

1.
No climate dependency ( Figure 6): This is the case for 6 catchments out of 21 and the 251 expected situation of a "robust" model. The different plots show a lack of dependence, for 252 temperature, precipitation and humidity index alike. For the catchment of Figure 6, the p-253 value of the Spearman correlation is high (between 0.23 and 0.98) and thus not significant. 254  Significant climate dependency on temperature and humidity index but not on 282 precipitation (Figure 10). This case happens on 5 of the 21 catchments. 283 crash-test. As all crash tests, it will end up identifying failures. But the fact that a car may be destroyed 292 when projected against a wall does not mean that it is entirely unsafe, it rather means that it is not 293 entirely safe. Although we are conscious of this, we keep driving cars… but, we are also willing to pay 294 (invest) more for a safer car (even if this safer-and-more-expensive toy did also ultimately fail the crash 295 test). We believe that the same will occur with hydrological models: The RAT may help identify safer 296 models, or safer ways to parameterize models. If applied on large datasets, it may help identify model 297 flaws, and thus help us work to eliminate them. It will not however help identify perfect models: these 298 do not exist. 299 300

301
The proposed robustness assessment test (RAT) is an easy-to-implement evaluation framework that 302 allows robustness evaluation from all types of hydrological models to be compared, by using only one 303 long period for which model simulations are available. The proposed test has obviously its limits, and a first difficulty that we see in using the RAT is that it is 311 only applicable in cases where the hypothesis of independence between the 1-year subperiods and 312 the whole period is sufficient. This is the case when long series are available (at least 20 years, see last 313 graph in appendix). If it is not the case, the RAT procedure should not be used. Therefore, we would 314 indeed recommend its use in cases where modellers cannot "afford" multiple calibrations, or where 315 the parameterisation strategy is considered (by the modeller) as 'calibration free' (i.e. physically-based 316 models). A few other limitations should be mentioned: 317 1.
In this note, the RAT concept was illustrated with a rank-based test (Spearman correlation) and 318 a significance threshold of 0.05. Like all thresholds, this one is arbitrary. Moreover, other non-319 parametric tests could be used and would probably yield slightly different results (we also tested 320 the Kendall tau test, with very similar results, but do not show the results here); 321 2.
Detecting a relationship between model bias and a climate variable using the RAT does not allow 322 to directly conclude on a lack of model robustness, because even a robust model will be affected 323 by a trend in input data, yielding the impression that the hydrological model lacks robustness. 324 Such an erroneous conclusion could also be due to widespread changes in land use, construction 325 of an unaccounted storage reservoir or the evolution of water uses. Some of the lacks of 326 robustness detected among the 21 catchments presented here could be in fact due to 327 metrological causes; 328 3. Also, because of the ongoing rise of temperatures (over the last 40 years at least), we have a 329 correlation between temperature and time since the beginning of streamgaging. If for any 330 reason, time is having an impact on model bias, this may cause an artefact in the RAT in the form 331 of a dependency between model bias and temperature; 332 4. Similarly to the Differential Split Sample Test, the diagnostic of model climatic robustness is 333 limited to the climatic variable against which the bias is compared. As such, the RAT should not 334 be seen as an absolute test, but rather as a necessary but not sufficient condition to use a model 335 for climate change studies: because the climatic variability present in the past observations is 336 limited to the historic range, so is the extrapolation test. With Popper's words (Popper, 1959 The GR models, including GR4J, are available from the airGR R package. 378 8 Appendix -Checking the impact of the partial overlap between 560 calibration and validation periods in the RAT 561 In this appendix, we deal with calibrated models, for which we verify that the main hypothesis 562 underlying the RAT is reasonable, i.e. that when considering a long calibration period, the weight of 563 each individual year in the overall calibration process is almost negligible. We then explore the limits 564 of this hypothesis when reducing the length of the overall calibration period. 565 566  Evaluation method 567 In order to check the impact of the partial overlap between calibration and validation periods in the 568 RAT, it is possible (provided one works with a calibrated model) to compare the RAT with a "leave-one 569 out" version of it, which is a classical variant of the Split Sample Test (SST): instead of computing the 570 annual bias after a single calibration encompassing the whole period (RAT), we compute the annual 571 bias with a different calibration each time, encompassing the whole period minus the year in question 572 ("leave-one-out SST"). 573 The comparison between the RAT and the SST can be quantified using the root mean square difference 574 (RMSD) of annual biases: 575 where Bias RAT is the bias of validation year n when calibrating the model over the entire 576 period (RAT procedure), and Bias SST the bias of validation year n when calibrating the model 577 over the entire period minus year n (leave-one-out SST procedure). 578 The difference between the two approaches is schematized in Figure 11: the leave-one-out procedure 579 consists in performing N calibrations over (N-1)-year-long periods followed by an independent 580 evaluation on the remaining 1-year-long period. As shown in Figure 11, the two procedures result in 581 the same number of validation points (N). Eq. 1 provides a way to quantify whether both methods 582 differ, i.e. whether the partial overlap between calibration and validation periods in the RAT makes a 583 difference. 584 It is also interesting to investigate the limit of our hypothesis (i.e. that the relative weight of one year 610 within a long time series is very small) by progressively reducing the period length: indeed, the shorter 611 the data series available to calibrate the model, the more important the relative weight of each 612 individual year. Figure 14 compares the annual bias obtained with the RAT procedure with the annual 613 bias obtained with the leave-one-out SST, for 10-, 20-, 30-, and 40-year period lengths (selection of the 614 shorter periods was realized by sampling 10, 20, 30, and 40 years regularly among the complete time 615 series). The shorter the calibration period, the larger the differences between both approaches (wider 616 points scatter): there, we reach the limit of the single calibration procedure. We would not advise to 617 use RAT with time series of less than 20 years. 618 619 620 Figure 14. Annual bias obtained with the RAT procedure vs. annual bias obtained with leave-one-out SST. These differences can be quantitatively measured by computing the RMSD (see Eq.1) between the 624 annual bias obtained with the RAT procedure and with the SST for different calibration period lengths 625 (see Figure 15). The RMSD tends to increase when the number of years available to calibrate the model 626 decreases, but it seems to be stable for periods longer than 20 years. 627