Technical note: PMR – a proxy metric to assess hydrological model robustness in a changing climate

The ability of hydrological models to perform in climatic conditions different from those encountered in calibration is crucial to ensure a reliable assessment of the impact of climate change in water management sectors. However, most evaluation studies based on the Differential Split-Sample Test (DSST) endorsed the consensus that rainfall-runoff models lack climatic robustness. Models typically exhibit substantial errors on streamflow volumes applied under climatologically different conditions. In this technical note, we propose a new performance metric to evaluate model robustness without applying the 5 DSST and which performs with a single hydrological model calibration. The Proxy for Model Robustness (PMR) is based on the systematic computation of model error on sliding sub-periods of the whole streamflow time series. We demonstrate that the metric shows patterns similar to those obtained with the DSST for a conceptual model on a set of 377 French catchments. An analysis of sensitivity to the length of the sub-periods shows that this length influences the values of the PMR and its adequation with DSST biases. We recommend a range of a few years for the choice of sub-period lengths, although this should 10 be context-dependent. Our work makes it possible to evaluate the temporal transferability of any hydrological model, including uncalibrated models, at a very low computational cost.


Introduction
In the context of climate change, quantifying the performance of the models used for assessing the impact of a changing climate is essential for informing model selection and estimating uncertainty. Assessing the impact of a changing climate typically in-15 volves a modeling chain ranging from general circulation models to impact models such as catchment hydrological models (Clark et al., 2016). It is now acknowledged that the contribution of hydrological models to the total uncertainty of projections may be significant and should be addressed along with other sources of uncertainty (e.g. Hagemann et al., 2013;Schewe et al., 2014;Vidal et al., 2016;Melsen et al., 2018). A key issue in the reduction of hydrological model uncertainty is the assessment of robustness to climatic changes, i.e., their ability to perform in climatic conditions that differ from those encountered in 20 calibration.
Advocating that hydrological models needed to be tested under conditions that would "represent a situation similar to which the data are to be generated," Klemeš (1986) suggested a series of tests to evaluate the robustness of hydrological models. Among these testing procedures, the most popular scheme to assess model robustness to varying climatic conditions is the Differential Split-Sample Test (DSST). The DSST consists in a calibration-evaluation exercise in two periods of the available time series chosen to be as climatically different as possible. Variants of the DSST have also been proposed for specific purposes, such as the Generalized Split-Sample Test (Coron et al., 2012), which consists in a systematic calibration-evaluation experiment on every couple of independent periods that one can possibly define. However, these variants all rely on the same principles as the DSST (e.g. Dakhlaoui et al., 2019).
Many studies report poor model simulations resulting from the application of the DSST in various modeling contexts (e.g. 30 Thirel et al., 2015). Among the deficiencies observed in the tested models, a common feature is their tendency to produce biased streamflow simulations in evaluation conditions (e.g. Vaze et al., 2010;Merz et al., 2011;Broderick et al., 2016;Dakhlaoui et al., 2017;Mathevet et al., 2020). Although changes in catchment temperature and/or precipitation are usually associated with volume errors, these errors vary across the tested models and catchments (e.g. Vaze et al., 2010;Broderick et al., 2016;Dakhlaoui et al., 2017). The dire need to improve hydrological models is widely recognized and is considered as one 35 of the 23 unsolved problems in modern hydrology (Blöschl et al., 2019, UPH n°19). However, to improve models we first need a good diagnostic method, and the design of alternatives to the DSST for the evaluation of model robustness could contribute to these advancements.
The first shortcoming of the DSST is its limited application regarding a particular category of hydrological models. Indeed, Refsgaard et al. (2014) pointed out that split-sample procedures were not applied to models that are not calibrated. The eval-40 uation of such models is usually performed by testing their spatial transferability with data from proxy sites. It is therefore difficult to compare the robustness of highly complex hydrological models to simpler models such as the ones typically tested in the aforementioned DSST studies. A further limitation is the necessity to determine a set of climatic variables to define different calibration and evaluation periods. This is of course highly relevant in contexts where the direction of future changes is unambiguously predicted. In other situations, however, robustness assessment would benefit from evaluating the model on a 45 wider spectrum of hydro-climatic changes. Variants of the DSST, such as the Generalized Split-Sample test, may circumvent this problem, but at a high computational cost that not all modelers can afford (Coron et al., 2012). This technical note presents and assesses a way to quantify model robustness as a mathematical performance criterion computed without splitting time series into calibration and evaluation periods. This criterion is conceived to be a proxy for model robustness, i.e. to reproduce the hydrological model average error as obtained by applying the DSST. It is based on the com-50 putation of interannual model bias derived from graphical considerations in the work of Coron et al. (2014). In order to be reliable, the Proxy for Model Robustness (PMR) must allow for the indication of typical model biases in independent evaluation periods. It should also help to identify catchments where a model lacks robustness. We summarize the important aspects that we discuss in the following with two research questions: -Does the PMR faithfully relate to model robustness as assessed in DSST experiments?

55
-How do computation choices affect the results obtained when applying the PMR?
The first question will be addressed by comparing the metric with model bias obtained in the DSST for a conceptual model across a large set of French catchments. The underpinning mathematical choices will be discussed in a sensitivity analysis comparing the metric and the results obtained by applying the DSST. The description of the PMR is given in Section 2. The hydrological model and the data are presented in Section 3. The reliability of the metric is assessed in Section 4, and the results 60 of the sensitivity analysis are shown and discussed in Section 5.
2 Description of the Proxy for Model Robustness 2.1 Building the "moving bias curve" Model robustness to climate change is the ability of a hydrological model to perform well under different climatic conditions without parameters being recalibrated to match the changes in the precipitation-streamflow relationship. A robust model should 65 thus adequately simulate streamflow volumes for any type of climatic conditions experienced by a catchment. Coron et al. (2014) suggested a simple way to visualize model robustness by computing the bias of a model simulation on sliding subperiods of the available time series (Figure 1). The curve of model bias on the moving sub-periods, named here the "moving bias curve," indicates the temporal evolution of model errors. Since a robust model should perform similarly well whatever the considered sub-period, the flatter the moving bias curve, the more robust a model. Coron et al. (2014) showed that hydrological 70 models would typically not have the ability to flatten their associated moving bias curve. The authors indeed calibrated model parameters on each sub-period of the data and plotted all the produced moving bias curves on the same graph. One of the main conclusions of their study was that the obtained moving bias curves were all almost parallel and that calibration conditions influenced more the vertical positioning of the curves rather than their shape. This observation was true for models of different complexities across a small set of catchments. The phenomenon described by Coron et al. (2014) is illustrated in Figure 2.

75
The moving bias curve obtained with the model calibrated on the blue sub-period (1984)(1985)(1986)(1987)(1988), coldest sub-period of the time series) is almost parallel to the moving bias curve derived from the calibration on the total period. The y-axis shift corresponds to a model bias on the calibration sub-period almost equal to zero. Calibrating the model on another sub-period (1999-2003, warmest sub-period of the time series, in red) yields a different shift of the moving bias curve, which corresponds to a null model bias on the red calibration sub-period. The shape of the curve being almost identical whatever the calibration period 80 in the illustrated case, it offers an interesting perspective on model robustness. The flatness of the curve is indeed almost independent of the period used for model calibration.
Whether they are parallel or not, depending on the modeling context (model, catchment, data, etc.), the moving bias curves appear to be a relevant tool for analyzing model robustness. Before performing calibration-evaluation tests, assessing the flatness of the moving bias curve obtained by calibrating a model as well as possible (i.e., with all available data) could be 85 seen as a first estimate for model robustness. We thus propose a simple mathematical expression to calculate this flatness, i.e., a performance metric designed to be a PMR.

Computation of the Proxy for Model Robustness
The PMR is based on the computation of the average absolute difference between the actual moving bias curve computed on 5-year sub-periods and a hypothetical flat curve. This hypothetical flat curve is defined as the curve obtained for a hypothetical 90 model, perfectly robust so that its bias on different time sub-periods would remain constant, but imperfect so that this bias would be equal to the mean bias of the evaluated model on the total period. It should be noted that, if the evaluated model is unbiased, as is the case on Figure 2 for the moving bias curve obtained by calibrating the model on all the available data, then this reference hypothetical curve is simply defined by "y=0." The PMR is computed as the mean of absolute differences between the model average error on the 5-year sub-periods and the model average error on the total period, normalized by the 95 average observed streamflow (Equation (1)). It thus corresponds to the normalized area between the moving bias curve and the hypothetical flat curve.
Q obs and Q sim are the respective averages of the observed and of the simulated streamflows on the total period. Q obs,i and Q sim,i are the respective averages of the observed and of the simulated streamflows on the sub-period whose index is i. N is 100 the number of sub-periods that can be defined with a 5-year moving window. The reason for including a factor 2 is to reproduce the bias that would be obtained in a DSST on sub-periods that are on the opposite side of the moving bias curve (see Figure 2).
Although the errors for each sub-period are calculated in absolute terms, the normalization by the average observed streamflow allows the resulting value to represent the average relative error produced by the model on the sub-periods as compared with mean observed streamflow.

105
One reason for computing errors relative to the average streamflow on the whole time series instead of the average streamflow of each sub-period is that this reduces the weight of very dry years. It also avoids dealing with zeros in the denominator in intermittent catchments. This choice is further discussed in Appendix B.

110
A set of 377 French catchments was used ( Figure 3) (Delaigue et al., 2020). The selected French catchments cover a variety of physical and hydroclimatic characteristics and were selected as they are little impacted by human activities and have limited solid precipitation (< 10% of the total precipitation on average). Western France is characterized by an oceanic climate with no marked wet and dry seasons. The climate of the eastern part of the country is more continental, with a larger annual temperature range. Southeastern France has a Mediterranean climate, with humid springs and autumns and dry summers. The 115 yearly average precipitation of the catchments ranges from 662 mm to almost 1926 mm, while the average temperatures vary from 8 to 14.4°C. Daily streamflow measurements at the outlet of the catchments were retrieved from the Banque HYDRO (http://www.hydro.eaufrance.fr/ (last accessed: 21 January 2019), Leleu et al., 2014). Daily meteorological data were supplied by the SAFRAN atmospheric reanalysis (Vidal et al., 2010) aggregated at catchment scale. We used the temperature-and radiation-based formula proposed by Oudin et al. (2005) to compute potential evaporation. In every catchment, streamflow 120 observations cover at least 20 years (40 years on average).

Hydrological model
The tests were performed with GR4J (Perrin et al., 2003), a daily lumped hydrological model. The model is parsimonious (four parameters to calibrate, two reservoirs, two unit hydrographs) and has been widely used in research studies focusing on hydrological model robustness (e.g. Coron et al., 2014;Broderick et al., 2016;Fowler et al., 2016). The two-parameter 125 CemaNeige degree-day snow module (Valéry et al., 2014) was used to account for solid precipitation. The parameters of the snow module were fixed to median values as recommended by Valéry et al. (2014) for catchments with limited impact of snow.
The GR4J and CemaNeige models are used with the airGR R package (Coron et al., 2017(Coron et al., , 2018. The parameters of the hydrological models were calibrated by optimizing the Kling-Gupta Efficiency (KGE, Gupta et al., 2009) computed on the square-root of streamflow in order to limit error heteroscedasticity. The optimization algorithm is a simple 130 procedure consisting in a prior global screening on a gross predefined grid, followed by a descent local search from the best parameter set of the grid. The procedure has been successfully used in multiple studies involving GR4J (e.g. Mathevet, 2005;Coron et al., 2014).

DSST experiments
DSST experiments consist in selecting contrasted periods (according to some hydrologically relevant indicator) and performing 135 a calibration-evaluation experiment. Our DSST experiments are based on three hydroclimatic variables. The procedure consists Table 1. Summary of the different setups of the Differential Split-Sample Test.Q,P andT respectively stand for average observed streamflow, precipitation and temperature computed on the sub-periods.
Name of the DSST setup "dry" "humid" "warm" "cold" "unproductive" "productive" Calibration minP maxP maxT minT minQ/P maxQ/P Evaluation maxP minP minT maxT maxQ/P minQ/P in dividing the time series in sub-periods of L consecutive years, and selecting six sub-periods from these. The sub-periods of the DSST are chosen to be: -The driest and the wettest in terms of precipitation -The warmest and the coldest in terms of temperature 140 -The least and the most productive in terms of runoff ratio (computed as the ratio of mean observed streamflow to mean precipitation) The model parameters are then calibrated on each sub-period and transferred to the sub-period of opposite climate. The process is summarized in Table 1.
The runoff ratio was preferred to the humidity index since the latter is highly correlated to average precipitation in France 145 and would therefore be redundant with DSST experiments based on precipitation. Since runoff ratio is computed from average streamflow, it cannot be used for predictive purposes of model biases in future climate conditions. However, it estimates how catchments respond to precipitation forcings. Its use in the DSST may thus indicate how well a model is able to represent variations in catchment response to climatic conditions.
The sub-period length for the DSST experiment has been fixed at L = 5 years, so as to match the length of the sub-period 150 involved in the computation of the PMR. The length of sub-periods used in the computation of the PMR is discussed in Section 5. The length of the sub-periods used for the DSST are discussed in Appendix C. We remind the reader that the PMR is computed from model simulations obtained by calibrating the model on the whole time series, while the DSST results are obtained through calibration evaluation on sub-periods of the time series. It should also be mentioned that model biases obtained in the DSST were calculated as the absolute differences to 1 so that they could be compared to PMR values, which 155 are positive by definition, as follows: A drawback of this way of computation is that it removes the sign of model errors. Therefore, the sign of model errors in the different DSST setups has been analyzed in Appendix A. In the following, model bias obtained in the DSST will systematically be calculated in absolute terms without further notice.

Comparison of the distributions of PMR values and DSST bias
The PMR is theoretically designed to quantify the average bias that would be obtained from DSSTs of the model if they were calculated in an absolute way. The bias obtained for GR4J for each type of DSST setup is plotted in Figure 4. Compared to absolute biases obtained in the different DSST setups, PMR values have the same order of magnitude as biases in precipitation-165 or temperature-based experiments. However, the distribution of PMR values exhibits less spread than DSST biases. In the case of DSSTs designed on changes in runoff ratio, model biases are larger than PMR values. The PMR thus seems to relate rather well to model biases observed in typical differential calibration-validation experiments, but also appears to give an underestimated estimation of model biases in highly adverse transfer conditions (see Appendix A for more details about DSST results). In summary, one can say that the results presented in Figure 4 simply indicate that, on average, PMR is the same order 170 of magnitude as model bias in DSST.

Assessment of the predictive ability of the PMR for model robustness
To further investigate the link between the PMR and model robustness as measured by the DSST, we plotted the average model bias across the DSST setups for each catchment against PMR values ( Figure 5). The reader is reminded that model bias is calculated in absolute terms, and thus there are no compensations between the averaged six model biases for each catchment.

175
This comparison was made in order to assess the ability of the PMR to assess the variations of model robustness across the catchment set. Figure 5 shows an acceptable correlation between the two indicators. Overall, the PMR seems to be a satisfactory approximation for model robustness, even if the PMR values underestimate model bias in the worst catchments (and thus overestimate model robustness). The predictive power of the PMR for model bias is further confirmed in Table 2    should be noted that we did not find any particular differences in topographic or climatic properties between catchments where PMR values and DSST biases closely match and catchments where they do not. Even if the PMR as defined in Equation (1)  we conducted a sensitivity analysis with the objective of identifying the way to compute the PMR that best matches the bias that would be obtained by applying DSST procedures. Therefore, we strived to define the metric so that it corresponds as closely as possible to errors on streamflow volumes typically made by the model in adverse simulation conditions.

200
An important element to discuss in the definition of the PMR is the length of sub-periods on which model errors are computed.
Shorter sub-periods make it possible to reduce compensations between model errors, while longer sub-periods make it possible to reduce the weight of years when the model is drastically worse than others or when there are large measurement errors in the data. Sub-period length may also influence the adequation between model biases in DSST and PMR values.

205
We tested the sensitivity of the metric values to the length of sub-periods used for its computation in Figure 6. The PMR metric values decrease when the sub-period length used in the computation increases. This result indicates that model interannual errors on streamflow volumes tend to compensate when the PMR is computed on longer sub-periods. Therefore, sub-period length should preferably not be too long in order to avoid loss of information about model bias across the years. This statement is corroborated by the slight decrease in metric variability when sub-period length increases (standard deviation of the metric 210 on the catchment set decreases from 7% to 5%), which suggests that differences in model robustness across the catchment set are less clear when sub-periods are too long.

Effect of sub-period length on the reliability of the PMR
Previous results show that the length of the sub-periods influences the values of the PMR. Given that the metric should relate to model biases in the DSST to be useful and reliable, it is thus necessary to verify that its correlation with DSST biases 215 remains high when the length of the sub-periods on which the PMR is computed varies. Figure 7 displays the evolution of the correlation between PMR values and the DSST biases averaged for each catchment. The correlation of the PMR with DSST biases depends in effect on the length of the sub-periods. Overall, the best scores are obtained for lengths between 2 and 5 years. Therefore, the results described in Table 2 show a near-optimal situation for the reliability of the metric, although the worst correlation score, associated with 10-year sub-periods, still demonstrates a fair adequation between PMR values and DSST 220 biases. Interestingly, computing PMR on very short sub-periods does not benefit the reliability of the metric. We suggest that adequate sub-period length for the PMR should be close to the length of the evaluation periods of the DSST, so that compared model errors relate to similar temporal scales and patterns. This issue is further discussed in Appendix C.

The choice of an adequate sub-periods length
Overall, the choice of the best sub-period length for the computation of the PMR must satisfy two conditions: i) be small enough 225 to limit loss of information about model robustness, and ii) maximize correlation scores with DSST biases in order to ensure its robustness. The first condition relates to the sensitivity of the PMR to the model's actual robustness, while the second condition relates to the reliability of the PMR in different DSST setups for the evaluation of model robustness. We would suggest that the interpretability of the moving bias curve associated with the PMR accounts for a third condition for the choice of the subperiod length. In order to be interpreted easily, the curve should be smooth enough to clearly distinguish periods during which 230 a model overestimates or underestimates observed streamflow, and thus involve long enough sub-periods. Of course, for cases where only PMR values were to be used without any analyses of the moving bias curves, this issue is incidental.
Under the conditions of our experiment, we found that lengths between 2 and 5 years were relevant to fulfil the second requirement. The sensitivity requirement would lead to computing the PMR on 2-year sub-periods; however, we acknowledge from our experience with moving bias curves that such sub-periods are too short for quick visual analyses. Therefore, we 235 consider 3-5 years to be adequate lengths for the computation of the PMR.
However, it should be pointed out that these results are likely to be context-dependent and may have been different for other models or for another catchment set. For these reasons, the aim of the study was more the demonstration that it is possible to assess hydrological model robustness to climatic changes without performing a DSST, rather than demonstrating that the PMR is perfectly reliable and that it should substitute Split-Sample Tests. Moreover, the length of the sub-periods involved in the 240 computation of the PMR should also reflect the particular needs of each model evaluation study.

Conclusions
Traditional methods to assess the robustness of hydrological models to changes in climatic conditions rely on calibrationevaluation exercises, preferably performed on climatically different periods of a time series. Although the DSST or its variants represent the most appropriate procedure one can imagine in terms of model-robustness evaluation, it is never used on models 245 that are so complex that they need to be calibrated on all the available data. Furthermore, the DSST is based on the selection of hydro-climatic variables whose change is supposed to place the model in unfavorable conditions to perform, but whose actual link with robustness is strongly context-dependent.
In this technical note, we propose a performance metric able to evaluate model robustness on a single model realization. The so-called PMR thus does not need multiple calibrations of the model on sub-periods of the time series and can be used for 250 any kind of hydrological model. The PMR is constructed as an indicator of the flatness of the"moving bias curve," which is a graphical representation of the temporal evolution of model bias across sliding sub-periods of the data.
The reliability of the PMR was compared with the results obtained by applying different DSST setups on GR4J, a typical conceptual model, on a dataset of 377 French catchments. We tested the predictive ability of the metric to estimate model bias obtained by transferring model parameters from calibration periods to climatically opposite evaluation periods, for six types of 255 hydro-climatic changes (changes in both directions of average precipitation, average temperature and average runoff ratio).
Our results show that PMR relates well to absolute model biases in the DSST, especially when these biases derived from the six DSST setups are averaged. Although the metric values do not vary much across the catchment set, this sensitivity can be enhanced by reducing the length of the sub-periods on which PMR is computed. An analysis of the correlation between the PMR and model biases in the DSST for different sub-period lengths pinpointed that the reliability of PMR was better when the 260 metric was computed on sub-periods with lengths between 2 and 5 years. Ultimately, the need to find a balance between metric sensitivity and reliability lead us to recommend computing the PMR on 3-to 5-year sub-periods for GR4J.
Our results should encourage hydrological modelers to include the PMR as part of their panoply of evaluation metrics to judge their models or to inform model selection in climate change impact studies as it can be applied to any kind of model.
Further work should examine the potential of PMR to be incorporated as a hydrological signature in multi-objective calibration 265 procedures, and as an additional constraint on model parameters governing temporal changes in catchment response to climatic conditions.
Code availability. The GR4J model is freely available in the airGR R package. The code for calculating the PMR can be made available upon request.
Data availability. Streamflow data were provided by the French database "Banque HYDRO" and are available at http://www.hydro.eaufrance.

270
Meteorological data was provided by Météo-France and must be requested to this institute. volumes when runoff ratio increases and, conversely, overestimates streamflow volumes when runoff ratio decreases. DSSTs based on temperature yield situations in between, since median model bias is slightly negative (respectively positive) when calibrated on warmer (respectively colder) periods. When calculated in absolute terms, model bias was larger in DSSTs based on runoff ratio than for experiments based on temperature and precipitation ( Figure A1). Therefore, robustness issues for the model appear to be caused less by changes in climatic changes than by modification of the catchment response to precipitation.

285
This result is in line with the conclusion of Saft et al. (2016), who tested a number of hydrological models in southeastern Australia during prolonged droughts. The authors observed that many of these models would produce biased simulations of streamflow during the drought if, and only if, the catchments had experienced shifts in the rainfall-runoff relationship from pre-drought to drought conditions. Our results extend this statement for GR4J to situations where runoff ratio increases and shows opposite model biases depending on the sign of the change.
290 Figure A1. Distribution of model biases in DSST for each type of setup. The boxplots represent the 5, 25, 50, 75 and 95 quantiles and the little crosses denote the outliers. Blue, red and green boxplots are respectively associated to DSST setups based on precipitation, temperature and runoff ratio.

Appendix B: The choice of an adequate mathematical expression
The mathematical expression of the PMR also results from a choice that needs to be discussed. For example, Coron et al. (2014) proposed computing the flatness of the moving bias curve as the standard deviation of model bias on the sub-periods.
We discussed the mathematical form chosen for the PMR by comparing the metrics defined in Equation (1) and Equation

295
(B1). Figure B1 shows the differences between the metrics in Pearson's correlation with model biases obtained in the DSST performed on 5-year periods. The length of the sub-periods used in the PMR varies from 1 to 10 years. It appears that short subperiods confer greater benefit to the reliability of the PMR (Equation (1)), whereas longer sub-periods benefit the alternative PMR (Equation (B1)). Choosing a 5-year sub-period for the computation of the PMR does not, on average, favor either the one or the other formulation of the metric. As mentioned previously, we sought to formulate the PMR so that it maximizes 300 correlation with DSST biases while enhancing the sensitivity of the metric. For this reason, better agreement for the PMR as formulated in Equation (1) with DSST biases obtained for shorter sub-periods, where the PMR is most sensitive to model robustness, makes it more suitable. Therefore, the PMR computed as the sum of absolute average model error on 5-year subperiods is best suited to evaluating model robustness.
In addition, we note that the alternative PMR corresponds better overall to DSST experiments based on runoff ratio, which, we 305 remind the reader, yielded the larger model biases. The fact that model biases are squared in the computation of the alternative PMR enhances the weight of sub-periods where the model simulations are the worst, and thus potentially where the observed runoff ratio differs the most from the average. It is possible that calculating the error differences in absolute terms rather than squared makes the metric less dependent on years when the model is drastically worse or on years with large measurement errors.

310
Furthermore, the PMR as defined in Equation (1)  error on each sub-period is normalized by the average observed streamflow during the total period instead of the average observed streamflow during the sub-period may put less emphasis on very dry years when observed streamflow is close to zero. Model bias on such dry years can be undesirably large, thus the PMR as defined in Equation (1) could be a better option for arid catchments. This also makes it possible to compute the PMR in catchments where rivers might eventually cease to flow 315 for long periods of time without any further adjustments to the data. In addition, the interpretation of the PMR is perceived as more straightforward in Equation (1) than in Equation (B1), as model error is simply compared to the observed streamflow averaged on the whole time series rather than to a quantity that varies across the sub-periods.
Appendix C: Reliability of the metric for different DSST sub-period lengths Our results on the sensitivity of the PMR to the length of the sub-periods on which it is computed suggest that reducing the 320 length of the sub-periods involved in the computation of the PMR might slightly reduce loss of information about model robustness, and thus that selecting 1-year sub-periods is the best option. Computed this way, PMR would represent average annual model bias. Since in DSST experiments model bias is usually computed on the whole evaluation period, i.e., on periods that may vary in length from 1 to many years, it is unclear whether the metric would be representative of model biases as computed on periods longer than 1 year.

325
To evaluate the representativeness of PMR in such conditions, we computed the correlation between PMR values for different sub-period lengths of the moving bias curve, as in Figure 6, and also for different sub-period lengths for the DSST experiments.
Note that, as previously, PMR was computed on the whole time series after calibrating the model on the whole time series.
The heat map of correlations between PMR values and average DSST biases for varying sub-period lengths is displayed on Figure C1. The heat map clearly shows that generally shorter sub-periods for the computation of the PMR relate better to 330 Figure C1. Pearson's correlations between the PMR computed on varying sub-period lengths (vertical axis) and the average DSST biases obtained on varying sub-period lengths for calibration and evaluation (horizontal axis).
shorter calibration periods in the DSST experiments. Conversely, longer sub-periods for the computation of the PMR relate better to longer calibration periods in the DSST experiments. This result is not surprising given that PMR computed on n-year sub-periods represents the average model bias as computed on n years, and therefore should show similar patterns to model biases computed in DSST experiments involving n-year periods.
However, some sub-period lengths for the PMR computation exhibit a high correlation with a wider range of DSST setups. By 335 computing row-wise averages in the matrix, we observed that PMR computations based on 3-to 5-year sub-periods reach an average correlation of 0.74 with DSST biases across the range of sub-period lengths. In comparison, the correlation coefficient of PMR values computed on 1-year sub-periods is on average 0.67. Therefore, defining sub-periods with lengths between 3 and 5 years may be the most suitable choice to ensure PMR representativeness across a wide spectrum of possible DSST experiments.