Impact of bias nonstationarity on the performance of uni- and multivariate bias-adjusting methods

Climate change is one of the biggest challenges currently faced by society, with an impact on many systems, such as the hydrological cycle. To locally assess this impact, Regional Climate Model (RCM) simulations are often used as input for hydrological rainfall-runoff models. However, RCM results are still biased with respect to the observations. Many methods have been developed to adjust these biases, but only during the last few years, methods to adjust biases that account for the 5 correlation between the variables have been proposed. This correlation adjustment is especially important for compound event impact analysis. As a simple example of those compound events, hydrological impact assessment is used here, as hydrological models often need multiple locally unbiased input variables to ensure an unbiased output. However, it has been suggested that multivariate bias-adjusting methods may perform poorly under climate change conditions because of bias nonstationarity. In this study, two univariate and three multivariate bias-adjusting methods are compared with respect to their performance under 10 climate change conditions. To this end, the methods are calibrated in the late 20th century (1970-1989) and validated in the early 21st century (1998-2017), in which the effect of climate change is already visible. The variables adjusted are precipitation, evaporation and temperature, of which the former two are used as input for a rainfall-runoff model, to allow for the validation of the methods on discharge. Although not used for discharge modelling, temperature is a commonly-adjusted variable in both uniand multivariate settings and therefore important to take into account. The methods are also evaluated using indices based on 15 the adjusted variables, the temporal structure, and the multivariate correlation. For precipitation, all methods decrease the bias in a comparable manner. However, for many other indices the results differ considerably between the bias-adjusting methods. The multivariate methods often perform worse than the univariate methods, a result that is especially notable for temperature and evaporation. As these variables have already changed the most under climate change conditions, this reinforces the opinion that the multivariate bias-adjusting methods are not yet fit to cope with nonstationary climate conditions. Although the effect 20 is slightly dampened by the hydrological model, our analysis still reveals that, to date, the simpler univariate bias-adjusting methods are preferred for assessing climate change impact. 1 https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c © Author(s) 2020. CC BY 4.0 License.


Introduction
The influence of climate change is felt throughout many regions of the world, as becomes evident from the higher frequency 25 or intensity of natural hazards, such as floods, droughts, heatwaves and forest fires (IPCC, 2012). As these intensified natural hazards threaten society, it is essential to be prepared for them. Knowledge on future climate change is obtained by running Global Climate Models (GCMs), creating large ensemble outputs such as in the Climate Model Intercomparison Project 6 (CMIP6) (Eyring et al., 2016). Although they are informative on a global scale, the generated data are too coarse for local climate change impact assessments. To bridge the gap from the global to the local scale, Regional Climate Models have 30 become a standard application (Jacob et al., 2014), using the output from GCMs as input or boundary conditions. Although the information provided by both GCMs and RCMs is very valuable, both are biased with respect to the observations, especially for precipitation . The biases can occur in any statistic and are commonly defined as "a systematic difference between a simulated climate statistic and the corresponding real-world climate statistic" (Maraun, 2016).
These biases are caused by temporal or spatial discretisation and unresolved or unrepresented physical processes (Teutschbein 35 and Seibert, 2012; Cannon, 2016). An important example of the latter is convective precipitation, which can only be resolved by very high resolution models. Although the improvement of models is an important area of research (Prein et al., 2015;Kendon et al., 2017;Helsen et al., 2019;Fosser et al., 2020), such improved models are computationally expensive. As such, it is still necessary practice to statistically adapt the climate model output to adjust the biases (Christensen et al., 2008;Teutschbein and Seibert, 2012;Maraun, 2016). 40 Many different bias-adjusting methods exist (Teutschbein and Seibert, 2012;Gutiérrez et al., 2019). They all calibrate a transfer function using the historical simulations and historical observations and apply this transfer function to the future simulations to generate future 'observed values' or an adjusted future. Of all the different methods, the quantile mapping method (Panofsky et al., 1958) was shown to be the generally best performing method (Rojas et al., 2011;Gudmundsson et al., 2012). Quantile mapping adjusts biases in the full distribution, whereas most other methods only adjust biases in the mean 45 and/or variance.
An important problem with quantile mapping and most other commonly used methods is that they are univariate and do not adjust biases in the multivariate correlation. Although quantile mapping can retain climate model multivariate correlation (Wilcke et al., 2013), the ability of univariate methods to improve the climate model's multivariate correlation has been questioned Ehret et al., 2012;Hewitson et al., 2014). This is important for impact assessment, as local 50 impact models often need multiple input variables and many high-impact events are caused by the co-occurrence of multiple phenomena, the so-called 'compound events' (Zscheischler et al., 2018(Zscheischler et al., , 2020. For example, floods can be characterised by a rainfall-runoff model using evaporation and precipitation time series as an input. If the correlation between these variables is biased with respect to the observations, then it can be expected that the model output is biased as well. This results in a higher uncertainty when using these models and thus in the resulting assessment. During the past decade, multiple methods have been 55 developed to counter this problem. The first methods focused on the adjustment of two jointly occurring variables, most often precipitation and temperature, such as those by Piani and Haerter (2012) and Li et al. (2014). However, it became clear that ad-variate methods and univariate methods performed similarly, while Meyer et al. (2019) could not draw definitive conclusions.
These studies vary in set-up, adjusted variables and study area, which all could have caused the difference in added value. In all three studies, the same method, namely the Multivariate Bias Correction in n dimensions (MBCn) (Cannon, 2018) was the basis for comparison. Only recently, the first studies comparing multiple multivariate bias-adjusting methods were published (François et al., 2020;Guo et al., 2020). The study by François et al. (2020) focused on the different principles underlying the 70 multivariate bias-adjusting methods and concluded that the choice of method should be based on the end user's goal. Besides, they also noticed that so far, all multivariate methods fail in representing the temporal structure of a time series. In contrast to the focus of François et al. (2020), Guo et al. (2020) studied the performance of multivariate bias-adjusting methods for climate change impact assessment and concluded that multivariate methods could be interesting in this context. However, they also noticed that the performance of the multivariate methods was lower in the more recent validation period and suggested 75 that this could be caused by bias nonstationarity. As the use of multivariate bias-adjusting methods could be an important tool for climate change impact assessment, this deserves more attention.
The bias stationarity -or bias time invariance -assumption is the most important assumption for bias correction. It implies that the bias is the same in the calibration and validation or future periods and that the transfer function based on the calibration period can consequently be used in the future period. However, this assumption does not hold due to different types of 80 nonstationarity induced by climate change, which may cause problems (Milly et al., 2008;Derbyshire, 2017). In the context of bias adjustment, this problem has been known for several years (Christensen et al., 2008;Ehret et al., 2012), but has not received a lot of attention. A few authors have tried to propose new types of bias relationships (Buser et al., 2009;Ho et al., 2012;Sunyer et al., 2014;Kerkhoff et al., 2014). Recently, it has been suggested that it is best to assume a non-monotonic bias change (Van Schaeybroeck and Vannitsem, 2016). Some authors suggested that bias nonstationarity could be an important 85 source of uncertainty (Chen et al., 2015;Velázquez et al., 2015;Wang et al., 2018;Hui et al., 2019), but not all found clear indications of bias nonstationarity (Maraun, 2012;Piani et al., 2010;Maurer et al., 2013).
The availability of new methods and more data enables a more coherent assessment of the bias (non)stationarity issue. By comparing three bias-adjusting methods in a climate change context with possible bias nonstationarity, some of the remaining questions in François et al. (2020) and Guo et al. (2020) can be answered. The three multivariate bias-adjusting methods that 90 will be compared in this study are 'Multivariate Recursive Quantile Nesting Bias Correction' (MRQNBC, Mehrotra and Sharma (2016)), MBCn (Cannon, 2018) and 'dynamical Optimal Transport Correction' (dOTC, Robin et al. (2019)). These three methods give a broad view of the different multivariate bias adjustment principles, which we will elaborate on in Section 3.3.
As a baseline, two univariate bias-adjusting methods will be used: Quantile Delta Mapping (QDM, Cannon et al. (2015)) and modified Quantile Delta Mapping (mQDM, Pham (2016). QDM is a classical univariate bias-adjusting method and is chosen 95 for this analysis as it is a robust and relatively common quantile mapping method, especially as one of the subroutines in the multivariate bias-adjusting methods Nguyen et al., 2016;Cannon, 2018). mQDM, on the other hand, is one of the so-called 'delta change' methods, which are based on an adjustment of the historical time series. Using these univariate bias-adjusting methods, we can assess whether multivariate and univariate bias-adjusting methods differ in their response to possible bias nonstationarity. 100 The methods will be compared by applying them for the bias adjustment of precipitation, potential evaporation and temperature. The bias-adjusted time series will be used as inputs for a hydrological model in order to simulate the discharge. Discharge time series are the basis for flood hazard calculation, but can also be considered as an interesting source of validation themselves (Hakala et al., 2018). Although temperature is not needed as an input for the hydrological model, it is, together with precipitation, the most common variable to be adjusted in similar studies and therefore it is also included here. In order to 105 mimic climate change context, the 'historical' or calibration time series runs from 1970 to 1989 and the 'future' or validation time series runs from 1998 to 2017, which is only recent past. In the latter time frame, effects of climate change are already visible (IPCC, 2013). The change of some biases from calibration to validation time series will be calculated, to indicate the extent of the bias nonstationarity. Maurer et al. (2013) proposed the R index for this purpose (see Section 2.4). Calculating the bias nonstationarity between both periods will give an indication of the impact of a changing bias on climate impact studies 110 for the end of the 21st century. As Chen et al. (2015) mentioned: "If biases are not constant over two very close time periods, there is little hope they will be stationary for periods separated by 50 to 100 years" 2 Data and validation

Data
The observational data used were obtained from the Belgian Royal Meteorological Institute (RMI) Uccle observatory. The 115 most important time series used is the 10-min precipitation amount, gauged with a Hellmann-Fuess pluviograph, from 1898 to 2018. An earlier version of this precipitation dataset was described by Demarée (2003) andanalyzed in De Jongh et al. (2006).
Multiple other studies have used this time series (Verhoest et al., 1997;Verstraeten et al., 2006;Vandenberghe et al., 2011;Willems, 2013). The 10-min precipitation time series was aggregated to daily level to be comparable with the other time series used.

120
For the multivariate method, the precipitation time series was combined with a 2 meter air temperature and potential evaporation time series. The daily potential evaporation was calculated by the RMI from 1901 to 2019, using the Penman formula for a grass reference surface (Penman, 1948) with variables measured at the Uccle observatory. Daily average temperatures were obtained using measurements from 1901 to 2019. As the last complete year for precipitation was 2017, the data were used from 1901 to 2017, amounting to 117 years of daily data. 125 4 https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c Author(s) 2020. CC BY 4.0 License.
The IPCC report (IPCC, 2013) clearly states the influence of climate change on different variables. For Belgium, this is illustrated by Fig. 1, in which the temperature and evaporation anomalies for the 21st century are all higher than the long-term mean value. However, for precipitation, the effect of climate change is not yet visible. For the simulations, data from the EURO-CORDEX project (Jacob et al., 2014) were used. The Rossby Centre regional climate model RCA4 was used (Strandberg et al., 2015) as it is one of the few RCMs with potential evaporation as an output 130 variable. This RCM was forced with boundary conditions from the MPI-ESM-LR GCM (Popke et al., 2013). Historical data and scenario data for the grid cell comprising Uccle were respectively obtained for 1970-2005 and 2006-2100. The former time frame is limited by the earliest available data from the RCM. The latter time frame was only used until 2017, in accordance with the observational data. As climate change scenario, an RCP4.5 forcing was used in this paper (Van Vuuren et al., 2011).
Since only 'near future' (from the model point of view) data were used, the choice of forcing does not have a large impact.

135
However, when studying scenarios in a time frame further away from the present, using an ensemble of forcings is more relevant to be aware of the uncertainty regarding future climate change impact. Evaluations of the RCA4 model have shown that there is a bias in precipitation, especially in winter (Strandberg et al., 2015), but this bias is in line with the biases from other EURO-CORDEX models .
As Uccle (near Brussels) is situated in a region with small topographic differences, it is assumed that the conditions in Uccle 140 can be applied anywhere within the climate model grid cell and the variance within this cell is about the same. This assumption can be made as long as the resulting adjusted data are not used for extremely localized studies, such as urban hydrology impact assessment.

Time frames
As mentioned in the introduction, it is important to assess bias-adjusting methods in a context they will be used in, i.e. under 145 climate change conditions. The time series used in this study were chosen accordingly: 1970-1989 was chosen as the 'historical' or calibration time period and 1998-2017 was chosen as the 'future' or validation time period. Time series of 20 years were chosen here, although it is advised to use 30 years of data to have robust calculations (Berg et al., 2012;Reiter et al., 2018).
However, as no climate model data prior to 1970 are available, using 30 years of data would have led to overlapping time series.

Validation framework 150
An important aspect in bias adjustment is the validation of the methods. Different methods are available, of which a pseudoreality experiment (Maraun, 2012) is one of the most-used ones. In this method, each member of a model ensemble is in turn used as the reference in a cross-validation. However, while such a set-up is useful when comparing bias-adjustment methods, it only mimics a real application context. When sufficient observations are available, a 'pseudo-projection' setup (Li et al., 2010) can be used. This set-up resembles a 'differential split-sample testing' (Klemeš, 1986) and is more in agreement 155 with a practical application of bias-adjusting methods. Differential split-sample testing has been used in a bias adjustment context by Teutschbein and Seibert (2013), by constructing two time series with respectively the driest and wettest years. In our case study, it is assumed that the two time series differ enough because of climate change. Consequently, the approach is simple, and as the validation is not set in the future, it is considered a 'pseudo-projection'.
Besides the choice of time frames and data, also the choice of validation indices is of key importance. Maraun and Widmann 160 (2018a) stress that these indices should only be indirectly affected by the bias adjustment, as only validating on adjusted indices can be misleading. Such adjusted indices are the precipitation intensity, temperature and evaporation, which are used to build the transfer function in the historical setting and should be corrected by construction. Under bias stationarity, this correction will be carried over to the future, possibly hiding small inconsistencies that may arise for extreme values. If the bias is not stationary, the effect might be different between adjusted and indirectly affected indices. As such, besides the three adjusted 165 variables (indices 1 to 3 in Table 1) and their correlations (indices 4 to 12, which are directly adjusted by some of the methods), also indices based on the precipitation occurrence and on the discharge Q are used. The occurrence-based indices (13 to 16) allow for assessing how the methods influence the precipitation time series structure. The discharge-based indices (17 and 18) allow for the assessment of the impact of the different bias-adjusting methods on simulated river flow. The discharge-based indices combine the information of the other indices by routing through the rainfall-runoff model. They are the most important 170 aspect of the assessment, as they indicate the natural hazard. ETCCDI (Expert Team on Climate Change Detection and Indices) precipitation indices (Zhang et al., 2011) have also been considered and calculated. However, these are not included in this paper, as the differences in ETCCDI indices were minor and did not allow to clearly discern between the different methods. All 6 https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c Author(s) 2020. CC BY 4.0 License.
indices were calculated taking all days into account, instead of only calculating them on wet days, as some of the multivariate bias-adjusting methods do not discriminate between wet or dry days in their adjustment.

Bias nonstationarity
In a study on possible changes in bias, Maurer et al. (2013) proposed the R index: where bias f and bias h are the biases in respectively the future and historical time series, calculated on the basis of the observations and raw climate simulations. The R index takes a value between 0 and 2. If the index is greater than one, the difference 180 in bias between the two sets is larger than the average bias of the model and it is likely that the bias adjustment would degrade the RCM output rather than improve it. The index is calculated for the indices used for validation in order to have an indication of the influence of bias nonstationarity on these indices. Besides for the indices, the R index is also calculated for the average and standard deviation of each variable, in order to be able to more easily visualise the changes in distribution.

Hydrological model
Similar to Pham et al. (2018), we use the Probability Distributed Model (PDM, Moore (2007)), a lumped conceptual rainfallrunoff model to calculate the discharge for the Grote Nete watershed in Belgium. This model uses precipitation and evaporation time series as inputs to generate a discharge time series. The PDM as used here was calibrated by Cabus (2008) using the Particle Swarm Optimization algorithm (PSO, Eberhart and Kennedy (1995)). As in Pham et al. (2018), it was assumed that the differences between meteorological conditions in the Grote Nete-watershed and Uccle are negligible, and that thus the 190 adjusted data for the Uccle grid cell can be used as a forcing for the PDM. Furthermore, the goal is not to make predictions, but to assess the impact of different bias adjustment methods on the discharge values. To calculate the bias on the indices, observed, raw and adjusted RCM time series were used as forcing for this model. The discharge time series generated by the observations is considered to be the 'observed' discharge, and biases are calculated in comparison with this time series.

195
The residual biases relative to the observations and to the model bias are often used in this paper to graphically present and interpret the results. These residual biases are based on the 'added value' concept (Di Luca et al., 2015) and enable a comparison based on two aspects. The first aspect is the performance in removing the bias, the second is the extent of the bias removal in comparison with the original value for the corresponding index for the observation time series. The use of the residual biases allows for a detailed study and comparison of the effect of bias adjustment on the different indices.

200
The residual bias relative to the observations RB O for an index k is calculated as follows: with raw(k) the raw climate model simulations, adj(k) the adjusted climate model simulations and obs(k) the observed values for index k.
The residual bias relative to the model bias RB MB for an index k is calculated as follows: Absolute values are used in Eqs.
(2) and (3) to compute the absolute difference between the raw and adjusted values, thus neglecting a possible change of sign of the bias. If the values of these residual biases are lower than 1 for an index, the method performs better than the raw RCM for this index. The best methods have low scores on both residual biases for as many indices as possible.

Occurrence-bias adjustment: Thresholding
One of the deficiencies of RCMs, especially in Northwest Europe, are the so-called 'drizzle days' (Gutowski et al., 2003;Themeßl et al., 2012;Argüeso et al., 2013), i.e. the simulation of a small amount of precipitation on days that are supposed to be dry. This has an influence on the temporal structure of the simulated time series and should thus be adjusted (Ines and In this study, we use the threshlding occurrence-bias-adjusting method. Thresholding is one of the most common occurrencebias-adjusting methods and has been in use for many years (e.g. Hay and Clark (2003); Schmidli et al. (2006); Ines and Hansen (2006)). This method is only applicable in regions where the assumption holds that the simulated time series has more wet days than the observed time series. This is the case for Northwest Europe (Themeßl et al., 2012) and Belgium in particular. An 220 advanced version of the thresholding method is used here. To adjust the number of wet days, the frequencies of dry days in the observations and in the simulations are calculated. The difference between the two frequencies, ∆N , is the number of days of the simulated time series that have to be adapted. The simulated series is adapted by first sorting the wet days and thus only changing the ∆N lowest days of the simulation time series by setting them to 0. ∆N is computed for the past and applied in the future and consequently relies on the bias stationarity assumption. However, as thresholding is used prior to all methods, 225 the influence of possible bias nonstationarity on ∆N is assumed to be negligible.
In this advanced version of thresholding, some considerations are made. First, a day is considered wet if its simulated precipitation amount is above 0.1 mm, to account for measurement errors in the observations. Second, the adjustment is done on a monthly basis, to withhold a realistic temporal structure. This implies that to correct the number of wet days in month m, all days of month m of the time series are selected. Third, both historical and future simulations are adjusted during the 230 same calculation step, to ensure a sound comparison during the intensity phase of the adjustment. If only either the historical or future time series would have been adjusted, the assumption that the bias can be transferred from the historical to the future time period would be impaired. The thresholding method is summarized in Algorithm 1.

Algorithm 1 Thresholding
Input: Historical observations X ho Historical simulations X hs

Future simulations X fs
Output: Adjusted historical X hs out and future simulations X fs Replace the data in X hs out with the adjusted data for month m Replace the data in X fs out with the adjusted data for month m end for 3.2 Univariate intensity-bias-adjusting methods

235
The Quantile Delta Mapping (QDM) method was first proposed by Li et al. (2010). Its main idea is to preserve the climate simulation trends: it takes trend nonstationarity (changes in the simulated distribution) into account to a certain degree. Although it handles temperature adjustments well, it gives unrealistic values for precipitation and was therefore extended by Wang and 10 https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c Author(s) 2020. CC BY 4.0 License. Chen (2014) for precipitation adjustment. A comparison with other quantile mapping methods by Cannon et al. (2015) showed this method to perform best with respect to the preservation of trends. Cannon et al. (2015) bundled both the method by Li et al. 240 (2010) (Equidistant CDF-matching) and Wang and Chen (2014) (Equiratio CDF-matching) under the name Quantile Delta Mapping, because of the similarity with delta change methods (which are described in e.g. Olsson et al. (2009), Willems and Vrac (2011) and Räty et al. (2014)).
Mathematically, this method can be written as in the additive case, and in the ratio or multiplicative case. The superscripts hs, ho, fs and fa indicate respectively the historical simulations, the historical observations, the future simulations and the adjusted future. In this paper, the additive version is used for temperature time series and the multiplicative one for precipitation and evaporation time series. This choice is based on the 250 work of Wang and Chen (2014), who have shown that using the additive adjustment for precipitation results in unrealistic precipitation values and introduced a multiplicative adjustment. For evaporation, we follow the few available studies (e.g. Lenderink et al. (2007)) in using the same adjustment as for precipitation.
To to note that for precipitation, Eq. (5) was applied only on the days considered wet, i.e. with a precipitation higher than 0.1 mm.
For consistency, a threshold of 0.1 mm was also used for evaporation. As no prior examples for evaporation adjustment were 260 available, we assumed this consistency was the best option. It is important to note that although QDM is only applied on wet days, it can still transform low-precipitation wet days into days considered dry (e.g. with a precipitation amount < 0.1 mm) if the ratio in Eq. (5) is small enough.

Modified Quantile Delta Mapping
Pham (2016) proposed another version of QDM, following the delta change philosophy (Olsson et al., 2009;Willems and Vrac, 265 2011): the trend established by the RCM is assumed to be more thrust-worthy than the absolute value itself. When applying this type of methods, the simulated change between the historical and the future is applied to the observations. Thus, instead of the future simulations, the historical observations are adjusted to the future 'observations'. As Johnson and Sharma (2011) mention, this workflow could be problematic for future impact assessment, as it inherits the temporal structure of the historical observations. This method is mathematically very similar to the QDM method, exchanging the roles of x fs and x ho . Thus, it is named 'modified Quantile Delta Mapping' (mQDM), and can for the additive case be written as The ratio version is mathematically written as For the implementation, the same principles were used as for the QDM method: a 91-day moving window, empirical CDFs 275 and a threshold of 0.1 mm/day to be considered as a wet day.

Multivariate intensity-bias-adjusting methods
The increasing number of multivariate bias-adjusting methods throughout the 2010s urges the need to classify them according to their properties. One possible classification was done by Vrac (2018), who proposed the 'marginal/dependence' versus the 'successive conditional' approach. The former approach separately adjusts the 1D-marginal distributions and the dependence 280 structure and is applied in e.g. Vrac and Friederichs (2015), Cannon (2018) and Vrac (2018). These two components are then recombined to obtain data that are close to the observations for both marginal and multivariate aspects. The latter approach consists of adjusting one given variable and then adjusting a second variable conditionally on the second variable: this procedure is applied successively to each variable. Examples can be found in e.g. Piani and Haerter (2012), Li et al. (2014) and Dekens et al. (2017). According to Vrac (2018), the latter approach suffers from two main limitations. First, the adjustment 285 is performed conditionally on the previously adjusted data. However, the adjustment is often applied in bins. As a result, for each variable, the amount of data available for each bin decreases, thus decreasing the robustness of the adjustment. Second, the ordering of the variables in the successive adjustments matters. For example, Li et al. (2014)

point out that their 'Joint Bias
Correction for temperature' (JBCt) and 'Joint Bias Correction for precipitation' (JBCp) methods, which respectively first adjust temperature and precipitation, differ in performance. For these two reasons, Vrac (2018)  Another perspective on the multivariate bias-adjusting methods is to consider the amount of temporal adjustment that is allowed or applied by the method. This is important, as the amount of temporal adjustment is intrinsically linked with the main 295 goal, the adjustment of the multivariate distribution of the variables. This distribution, in which the dependence is characterised by the underlying copula (Nelsen, 2006;Schölzel and Friederichs, 2008), can be estimated using the ranks. Thus, to adjust the multivariate distribution, the ranks of the climate model are replaced by those of the observations, using methods such as the 'Schaake Shuffle' (Clark et al., 2004;Vrac and Friederichs, 2015). This implies that the temporal structure and trends of the climate model will be altered, which may have a considerable impact (François et al., 2020). This impact is especially 300 large when multiday characteristics strongly matter, such as in applications as the hydrological example we use in this study (Addor and Seibert, 2014). Vrac (2018) mentions this necessity to modify the temporal structure and rank chronology of the simulations. Yet, he also mentions that the extent of this modification is still a matter of debate. Cannon (2016) describes this as the 'knobs' that control whether marginal distributions, inter-variable or spatial dependence structure and temporal structure are more informed by the climate model or the observations. Thus, the choice between the temporal structure of the climate model 305 and unbiased dependence structures is a trade-off that has to be made. Some methods, such as those by Vrac and Friederichs (2015), Mehrotra and Sharma (2016) and Nguyen et al. (2018) rely on the observations for their temporal properties, while other methods try to find the middle ground (e.g. Vrac (2018) and Cannon (2018)).
Our choice of multivariate bias-adjusting methods takes the above classification into account. The oldest method in the comparison is 'Multivariate Recursive Quantile Nesting Bias Correction' . This method completely 310 replaces the simulated correlations by those of the observations and is a 'marginal/dependence' method according to François et al. (2020). 'Multivariate Bias Correction in n dimensions' (Cannon, 2018) is both a 'marginal/dependence' method and a method that tries to combine information from the climate model and the observations. The most recent method, 'dynamical Optimal Transport Correction' (Robin et al., 2019) differs considerably from the other two methods: it generalises the 'transfer function'-principle using the 'optimal transport' paradigm (Villani, 2008), thereby defining a new category of multivariate bias-315 adjusting methods: the above-mentioned all-in-one approach. Although far from complete, by comparing these three methods, a broad view of the different approaches in multivariate bias adjustment can be obtained.

Multivariate Recursive Quantile Nesting Bias Correction
In 2016, Mehrotra and Sharma proposed a new multivariate bias adjustment method, named 'Multivariate Recursive Quantile Nesting Bias Correction' (MRQNBC), based on a combination of several older methods by Johnson and Sharma (2012), 320 Mehrotra and Sharma (2012) and Mehrotra and Sharma (2015). The underlying idea of this method is to adjust on more than one timescale, an idea that most bias-adjusting methods do not incorporate . This adjustment on multiple timescales is applied by adjusting the biases in lag-0-and lag-1-auto and the cross-correlation coefficients, i.e. the persistence attributes, instead of focusing on the mean or the distribution.
As a first step in this method, QDM is applied separately on variables to adjust the empirical CDFs. This is followed by a 325 multivariate bias adjustment to adjust the lag-0 and lag-1 auto and cross-correlation coefficients. This combination of univariate and multivariate bias-adjusting methods is applied on all time scales. For the multivariate adjustment, two models are used: a multivariate first-order autoregressive (AR(1)) model with constant parameters at the daily and yearly level, and a multivariate AR(1) model with periodic parameters (Salas, 1980) at the monthly and seasonal level. All steps are applied to the different types of data: historical observations of temperature, evaporation and precipitation (combined in the matrix X ho ), historical series can be expressed as : and 340 with C and D the coefficient matrices ofX ho t , E and F the coefficient matrices ofX fa t and t a white noise term. The coefficient matrices are calculated using the N × N lag-0 and lag-1 cross-correlation matrices M 0 and M 1 . Using the standardised time series, the elements of these matrices can be expressed as (Salas, 1980): 345 with i and j the column numbers ofX t , referring to the variables whose correlation is calculated. This enables the calculation of C and E as (Matalas, 1967): and of D and F via which can be solved using the eigenvalues and eigenvectors of DD T or FF T :

355
with V the matrix of eigenvectors and S a diagonal matrix with the corresponding eigenvalues.
The multivariate bias adjustment is then implemented by removing the lag-0 and lag-1 auto-and cross-correlations from the future time seriesX fa t (the matrices E and F) and applying the observed lag-0 and lag-1 auto-and cross-correlations (C and D) to the future time series and thus creating a modified future time series,X fa t . These steps are applied by first rearranging and simplifying Eq. (9) for t : which can be rearranged as: This model preserves the observed persistence attributes. As a last step, destandardising results in the bias-adjusted time series When using the multivariate AR(1) with periodic parameters (PMAR), the parameters are derived separately for each period 370 to allow for periodicity. In this case, the vectors X ho t,τ and X fa t,τ respectively represent the observed and quantile mapped GCM time series. The subscript t refers to the year and the subscript τ to a specific period in the year.
The elements of the periodic version of M 0 and M 1 can be calculated as (Salas, 1980):

375
with T τ the number of time steps of the period τ ,x τ andx τ −1 the mean of periods τ and τ − 1 (for instance, if τ is summer, than τ − 1 is spring) and s τ and s τ −1 the standard deviations of periods τ and τ − 1. The correlation matrices are calculated in the same way as in the non-periodic steps. The only difference is that they are calculated for every period (e.g. separately for every season or month). For every time step in period τ , the corresponding value can be adjusted as follows to preserve the observed persistence attributes: The different time steps are combined with the nesting method proposed in Johnson and Sharma (2012) and Mehrotra and Sharma (2015). First, QDM (as described in Section 3.2.1) is applied at a daily level, followed by MAR. These adjusted time series are then aggregated and averaged to form a monthly time series, which is adjusted by QDM, standardised and adjusted by PMAR. Note that the standardisation of the aggregated time series does not imply that the variables of a period τ of that time 385 series have zero mean and unit variance. The results of the monthly adjustment are aggregated and averaged to form seasonal time series, which are also adjusted using QDM, standardised and adjusted by PMAR. As a last nesting step, the results are once more aggregated and averaged to build an annual time series, which is adjusted using QDM and MAR. The outcomes of all these steps are combined into a weighting factor that is used to modify the daily time series accordingly (Srikanthan and Pegram, 2009): with t the day, j the month, s the season, i the year, Y fa j,s,i the monthly adjusted value, Y fa j,s,i the aggregated-averaged monthly value, Z fa s,i the seasonal adjusted value, Z fa s,i the aggregated-averaged seasonal value, A fa i the adjusted yearly value and A fa i the aggregated-averaged yearly value. The full procedure is summarised in Algorithm 2.

Input:
Daily historical observations X ho The nesting method cannot fully remove biases at all time scales, thus Mehrotra and Sharma (2016) suggested to repeat the 395 complete procedure multiple times. However, in our case this seemed to exacerbate the results, so the method was run only once.

Multivariate Bias Correction in n dimensions
In 2018, Cannon (2018) proposed the 'Multivariate Bias correction in n dimensions' (MBCn) method as a flexible multivariate bias-adjusting method. The method's flexibility has attracted some attention, as it has already been used in multiple studies 400 (Räty et al., 2018;Zscheischler et al., 2019;Meyer et al., 2019;François et al., 2020). This method consists of three steps.
First, the multivariate data are rotated using a randomly generated orthogonal rotation matrix, adjusted with the additive form of QDM, and rotated back until the calibration period model simulations converge to the observations. This convergence is verified on the basis of the energy distance (Rizzo and Székely, 2016). Second, the validation period simulations are adjusted using QDM, as this method preserves the simulated trends. As the last step, these adjusted time series are shuffled using the 405 Schaake Shuffle (Clark et al., 2004) based on the rank order of the rotated dataset.
Considering the j-th iteration of the method, denoted by the subscript [j], the first step consists of rotating the data sets using an N × N randomly generated orthogonal rotation matrix R [j] . This orthogonal rotation matrix was created using the algorithm by Mezzadri (2007, pg. 597). This rotation is formulated as These steps have to be repeated until the multivariate distribution of X hs [j+1] matches X ho . The similarity is measured using the (squared) energy distance Rizzo, 2004, 2013;Rizzo and Székely, 2016), a measure of statistical discrepancy 415 between two multivariate distributions. For two N -dimensional independent random vectors x and y with respective CDFs F and G, this measure is given by: https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c Author(s) 2020. CC BY 4.0 License.
with E the expected value, || . || the Euclidean norm and x and x and y and y i.i.d.. A practical way of calculating this measure is as follows (Székely and Rizzo, 2013), with X = (X 1 , ..., X n1 ) and Y = (Y 1 , ...Y n2 ): with i, j, l and m denoting the time steps.
As a last step, the preservation of trends of QDM has to be combined with the restoration of the multivariate ranks by the transformations. To do this, first either the additive or multiplicative version of quantile delta mapping (depending on the variable) has to be applied to each variable of the original data set X fs , using X hs and X ho as historical baseline data.

425
As a final step, the elements of each column of X fa are reordered following a method known as the 'Schaake Shuffle' (Clark et al., 2004;Vrac and Friederichs, 2015). In this method, the ranks of each variable of X fa are swapped with the ranks of the corresponding variables of X fs [j+1] , thus reordering the time series' structure according to the ranks of the observations. The Schaake Shuffle can be mathematically formulated as follows (Clark et al., 2004). Let X be a vector of n time steps of a variable, and χ be the sorted vector of X, that is: Let Y be a vector of n historical observations and γ be the sorted vector of Y:

435
B is then the vector of indices describing the original observation number that corresponds to the values in the ordered vector γ.
The main step of the Schaake Shuffle is to reconstruct the reordered vector X ss : where x ss q = x (r) , q = B [r] , and r = 1, ..., n.

440
In MBCn, X fa is sorted according to X fs [j+1] instead of respectively X and Y, resulting in a final adjusted data set X fa . To account for ties in this procedure, a small random value is added before calculating the ranks of X fs [j+1] . The version of QDM applied in the step prior to the Schaake Shuffle is the same as in Section 3.2.1. As such, the shuffling procedure is the only difference with the univariate bias-adjustment, implying that differences in performance can be related to it.
The MBCn method was shown by Cannon (2018) to outperform many earlier multivariate bias-adjusting methods, such as 445 the EC-BC (Vrac and Friederichs, 2015), the JBC (Li et al., 2014), MBCr and MBCp methods (Cannon, 2016). In contrast, Cannon (2018) also pointed out some problems. Depending on the number of variables, the computational cost can get too high and convergence speed too low, or overfitting might become an issue. The first problem can be tackled by implementing sufficient time steps when having a lot of variables. Second, to address the convergence speed, Pitié et al. (2007) suggested using a deterministic selection of rotation matrices that maximizes the distance between rotation axis sets instead of randomly 450 generating them. It is also suggested by Cannon (2018) to use the most efficient form of quantile mapping and to limit the use of an advanced quantile mapping method to the last step. Third, to avoid overfitting (and to reduce the computational cost), early stopping is also suggested by Cannon (2018) (e.g. Prechelt (1998)). As the number of variables is limited in this case, overfitting did not seem to be a problem. Yet, to reduce unnecessary computational costs, the similarity in consecutive energy distances was used as a measure to stop the computation. A tolerance of 0.0001 was used: if the difference between 455 two consecutively calculated energy distances was lower, the computation was halted. The full procedure is summarised in Algorithm 3.

Input:
Historical observations X ho

Dynamical Optimal Transport Correction
Recently, Robin et al. (2019) indicated that the notion of a transfer function in quantile mapping can be generalised to the theory of optimal transport. Optimal transport is a way to measure the dissimilarity between two probability distributions and 460 to use this as a means for transforming the distributions in the most optimal way (Villani, 2008;Peyré and Cuturi, 2019).
Optimal transport was used by Robin et al. (2019) to adjust the bias of a multivariate data set in the 'dynamical Optimal Transport Correction' method (dOTC), which extends the 'CDF-transform' (CDF-t) bias-adjusting method (Michelangeli et al., 2009). dOTC calculates the optimal transport plans from X ho to X hs (the bias between the model and the simulations) and from X hs to X fs (the evolution of the model). The combination of both optimal transport plans allows for bias adjustment 465 while preserving the trend of the model. Optimal transport is applied on the basis of an optimal transport plan. The optimal plan between X ho and X hs is denoted as γ. The second optimal plan between X hs and X fs is denoted as φ. The goal is to transform φ according to γ, defining a new planφ. This optimal plan estimates X fa =φ X ho . Finally, X fs is adjusted with respect to X fa , creating the final adjusted X fa . These steps are summarised in Fig. 2.

470
For the definition of the optimal plan, the 'Optimal Transport Correction' (OTC) (Robin et al., 2019) is used. First, the empirical distributionsP X ho andP X hs have to be calculated. To achieve this, the subspace of R N that contains the data is partitioned into regularly spaced cells, generically denoted c * i , with N the number of variables of X ho and X hs . Using this notation,P X ho andP X hs can be estimated using the relative frequencies p X ho and p X hs as: with 1 the indicator function and m and n the total number of time steps of respectively X ho and X hs . Thus, the distributions are essentially estimated by counting the number of observations of each time series within each cell. The optimal plan γ between X ho and X hs can be estimated as: The coefficients γ i,j are the probabilities to transform an observation of X ho in cell c * i into an observation of X hs in cell c * j and I and J are the total number of cells containing observations of respectively X ho and X hs . Note that from here on, c * i and c * j denote only those cells containing observations of respectively X ho and X hs and that 'observation' is used generically for both observed and simulated time series. The coefficients are unknown, but obey the marginal properties: Central to the optimal transport theorem is the cost function C (Villani, 2008), which can here be approximated bŷ with || · || the Euclidean norm and c i and c j the centres of the cells defined above. Finding γ i,j comes down to solving the problem defined by the constraints of Eqs. (32) and minimisation of Eq. (33). Here, Sinkhorn's algorithm (Cuturi, 2013) is 490 used to find the solution to this problem and thus the optimal plan γ. Using this optimal plan, a vector of probabilities with length J can be defined for each cell c * i of X ho , each element j corresponding to the probability that an observation in that cell c * i will be transformed into an observation in cell c * j of X hs . In the transformation, the vectors of probabilities are used to introduce stochasticity, by sampling from these vectors the element j corresponding to cell c * j . The stochastic transformation of an observation of X ho into an observation of X hs can be repeated to create an ensemble of results. This ensemble accounts 495 for random weather effects and can thus be considered to be more similar to the true range of observations. The optimal plan φ can be calculated analogously. This optimal plan φ transforms an observation of X hs in cell c * j into an observation of X fs in cell c * k , with c * k defined analogously to c * i and c * j . What distinguishes dOTC from OTC is the next phase, in which φ is transformed according to γ, resulting inφ. This is conducted in three steps, the first being is the transformation of φ into a vector. The vector v jk := c k − c j represents the climatic trend from an observation of X hs in cell c * j to an observation 500 of X fs in cell c * k . The second step is the transfer according to γ. The resultφ can be defined by translating the observations of X ho along their respective vectors v jk : an observation of X fa is then given by X ho t + v jk , with X ho t the observation of X ho at time step t. However, the translation of X ho t along vector v jk does not always define an optimal transport plan: the vector has to be adapted to X ho , which is the third step. In this step, a matrix factor D is introduced, which rescales the vector v jk . This rescaling is actually the replacement of the scale of X hs by that of X ho . A Cholesky decomposition of the covariance matrix 505 has been proposed for this rescaling (Bárdossy and Pegram, 2012;Cannon, 2016). Denoting the covariance matrix as Σ, and the Cholesky decomposition as Cho (Σ), Robin et al. (2019) proposed to multiply v jk by the following matrix: Robin et al. (2019) remark that the Cholesky decomposition only exists if Σ is symmetric and positive-definite. Some covariance matrices, such as those of highly correlated random variables, do not have this property. Σ must then be slightly perturbed 510 to be positive-definite (Higham, 1988;Knol and ten Berge, 1989). It is also possible for the Cholesky decomposition to be poorly estimated if the available data are too small compared to the dimension. In that case, it is suggested to replace the matrix D by the diagional matrix of the standard devation: D = diag σ X ho σ −1 X hs . An observation of X fa is then given by X ho t + D · v jk . To finalize, the empirical distributionP X fa can be calculated. Using this distribution, OTC can be applied to X hs and X fa to generate X fa . A more elaborate mathematical explanation can be 515 found in Robin et al. (2019), a summary is given in Algorithm 4.

Input:
Historical observations X ho

Historical simulations X hs
Future simulations X fs Output: Adjusted future simulations X fa Calculate the empirical distributionsP X ho ,P X hs andP X fs Calculate the optimal plan γ betweenP X ho andP X hs Calculate the optimal plan φ betweenP X fs andP X hs Calculate the Cholesky factor D for all X ho t do Find cell c i containing X ho t Construct the vector of probabilitiesγ X ho t = (γ i,1 , . . . , γ i,J ) /p X ho ,i Sample j ∈ {1, ..., J} according to the probability vectorγ X ho t Construct the vector of probabilitiesφ X hs t = (φ j,1 , . . . , γ j,K ) /p X hs ,j Sample k ∈ {1, ..., K} according to the probability vectorφ X hs t Calculate the vector v jk Calculate X fa t = X ho t + D · v jk end for Calculate the empirical distributionP X fa Apply OTC between X fa and X fs to generate X fa 22 https://doi.org/10.5194/hess-2020-639 Preprint. Discussion started: 8 December 2020 c Author(s) 2020. CC BY 4.0 License.

Experimental design
Prior to all intensity-bias-adjusting methods, the thresholding occurrence-adjusting method was applied. In the intensity-biasadjustment step, a balance was sought between randomness and computational power for the calculation of the intensity-biasadjusting methods. Methods with randomised steps were repeated. As such, 10 calculations were made for dOTC. The resulting 520 values of each index were averaged for further comparison. Biases on the indices were always calculated as raw or adjusted simulations minus observations, indicating a positive bias if the raw or adjusted simulations are larger than the observations and a negative bias if the simulations are smaller.

Results
In this section, the results will be shown first for the R index calculations for bias change, and then for the validation indices. For 525 the validation indices, first the indices based on the adjusted variables are discussed, followed by an elaboration on the indices based on the derived variables. As the effect on discharge is the overarching goal of this paper and the discharge indices are affected by all other indices, those will be discussed last. All observed values and biases of both raw and adjusted simulations are presented in Table A1.

Bias change 530
The R index values for the variable averages, standard deviations and all indices are given in Table 2. The results vary considerably depending on the variable and/or index: for P, the bias can be considered almost stationary: only the 99.5th percentile has an R index above one. In contrast, for E, the R index values are above 1 for the middle percentiles and for the standard deviation, indicating some major changes in parts of the distribution, and consequently, the bias. For T, the mean and the lower extremes are clearly influenced, although the bias on the higher extremes does not change. The different effects on the variables 535 are linked with the effect on the (cross-)correlations. For example, the lag-1 cross-correlation between P and E has an R index value of only 0.19, whereas the R index value for the cross-correlation between E and T is 1.20. Although the R index values are low for P, this does not imply that the R index values for the precipitation occurrence indices are low. With an R index value of 1.44, the auto-correlation bias clearly changes between both periods. However, this is not reflected by the other precipitation occurrence indices, which all but one have R index values lower than one. Table 2 indicate that the bias changes between the two periods considered here (1970-1989 versus 1998-2017) might already be large enough to have an effect on the bias adjustment. As these periods are only separated by 10 years, this is an important indicator for the bias adjustment of late 21st century data, just as Chen et al. (2015) mentioned. However, it does not suffice to calculate just a few of these R index values. The results vary substantially among variables and even for the percentiles of a variable under consideration: while the 5th T percentile has an R index 545 value of 2, the value for the 95th percentile is only 0.07. This could give an indication of why the methods perform more poorly for some of these indices. However, purely based on these results, it is impossible to say exactly what causes the  (Papalexiou and Montanari, 2019) are poorly captured by the models, that limiting mechanisms such as soil moisture (Bellprat et al., 2013) are poorly modelled or that natural variability influences (Addor and Fischer, 2015) the biases. However, discussing this in depth 550 is out of the scope of the present study and deserves a separate study. In what follows, we will focus on the performance of the bias-adjusting methods and whether or not there is a link with these nonstationarities. percentiles could also have been plotted for wet days only (e.g. days with P higher than 0.1 mm/day), but as some methods change the number of dry days after the initial thresholding step, the dry days are also included in the calculation of the indices. below 0.5 are obtained. As seen in Section 4.1, P was one of the few indices having almost all R index values below 1. This might be linked to the generally good results for P illustrated in this section. A surprising result for P is the high RB MB value for P 99.5 for MRQNBC. This percentile is too biased for all other methods to be plotted. As the R index value was close to 1, it is possible that bias nonstationarity slightly influences the performance.

Precipitation amount
For MRQNBC, however, the combination of QDM with the focus on correlation seemingly improves the performance of this

Temperature
For the temperature adjustment, the RB O and RB MB values indicate that all methods result in a performance better than the 575 raw climate simulations, except for MRQNBC (Fig. 4). In contrast to all other methods, only the residual bias values of T 90 of MRQNBC are within the area delineated by the 1-1-lines. For all other indices, the bias is worse than in the climate simulations, with absolute biases up to 7ºC. The results for MRQNBC are interesting, as T is the best understood variable (Shepherd, 2014) and should thus not be hard to adjust. One line of reasoning could be the implementation of model trend preservation. While trend preservation is a prominent aspect in all other methods, the persistence preservation is at least as important in MRQNBC.

580
The trade-off between both aspects of the bias adjustment thus seems to influence the result of MRQNBC, while the other methods can more easily adapt to and adjust the simulated T.
When comparing the indices for the other methods, the results are rather similar.  For the lowest T values, all methods seem unable to handle the change in bias (as seen in Table 2): the RB MB values are all higher than 1. This poor performance, combined with the high values for RB O , might imply that it is better not to adjust T and work with the raw climate simulations. However, for the extreme T values, the absolute biases can be more than 1ºC. Thus, depending on the research goal and the R index value, it might be important to consider whether or not T should be adjusted. dOTC, indicating that the performance after adjusting the bias is generally worse than the raw climate simulations. The indices plotted are E 25 , E 99 and E 99.5 . The index E 5 also performs well, but cannot be plotted as its observed value is 0 mm. Thus, for the lowest and the highest percentiles, the bias-adjusting methods perform well, but they fail to capture the nonstationarity at the middle percentiles. These middle percentiles have high R index values: they are all greater than or equal to one. Only for dOTC, it is possible to plot a percentile between E 25 and E 99 : E 95 . However, for dOTC, this is also the only percentile for which 605 it is possible to plot the RB O and RB MB values (respectively 1.00 and 0.86). For all other methods and all other percentiles, both RB O and RB MB values are higher than 1. The poor performance of dOTC might be related to its trend preservation characteristics. Of all methods, it is the one most explicitly designed to follow the simulated trend. This might thus imply that the nonstationarity for E is caused by poor model performance, although this should be investigated more in depth.

Potential evaporation
The percentiles that are plotted all have a high RB O value, which is in this case caused by rather low biases to adjust. For 610 example, E 99.5 had an observed value of 5.24 mm/day and only a bias of 0.27 mm/day, or 5%. There is consequently not much room for improvement, though the RB MB values imply that the bias-adjusting method could be improved, except for E 99.5 , which consistently has an RB MB value lower than 0.5.
As in Section 4.3, there are interesting differences between QDM and mQDM. In contrast to the results for temperature, mQDM performs better. For mQDM, the highest percentiles (E 99 and E 99.5 ) have lower RB MB values than for QDM. In this 615 case, it thus seems better to only use the climate change signal. However, given the general poor performance of all methods, these results should be considered with care.
Given that the percentiles with a high R index value have a larger bias than the raw simulations after adjustment, and the added value for the other percentiles with respect to observed values is low, it can be advised not to adjust E. However, similar to T, this should be evaluated on a case-by-case basis.

Correlation
When considering the correlation (Fig. 6), the methods generally perform well: most of the correlation indices can be plotted.
However, there are some differences depending on the indices under consideration and the method. The indices that can always be plotted are the lag-0 cross-correlation between P and T, the lag-1 cross-correlation between P and T, the lag-1 cross-correlation between P and E and the correlation between P and T. Except for dOTC, all methods also perform well 625 for the lag-0 cross-correlation between P and E. Yet, for the indices that can be plotted, the RB O and RB MB values show a considerable difference among the methods. For example, the lag-1 cross-correlation between P and E has RB O values ranging from 0.29 (mQDM) to 0.69 (dOTC) and RB MB values ranging from 0.08 (mQDM) to 0.60 (dOTC). The best method varies for each index: while dOTC does not perform well for most indices, it has the best performance for the correlation between P and T. It is interesting that the performance of dOTC seems to be either very good, or very poor. As dOTC is built around 630 the idea of trend preservation and all-in-one adjustment, it is possible that the adjustment performs very well when the trend is properly modelled. Three out of the four correlations that dOTC adjusts well are based on T and P, the two variables that are well understood in the time frame under consideration. All indices that are not or less frequently present in the plots have one thing in common: they are based on the correlation between E and one of the other variables. The indices based on E thus consequently perform worst, except for the aforementioned lag-1 cross-correlation between P and E.

635
The correlation index performance seems to be related to the results of T (Section 4.3) and E (Section 4.4): correlations of E with another variable generally perform worse, and the (cross-)correlation between T and E with another variable performs the worst, in line with the R index values (Table 2). Although the multivariate bias-adjusting methods are supposed to adjust correlation, they seem to be unable to do so, as they generally have larger biases (though not on all indices) than the univariate bias-adjusting methods for the indices with the lowest residual bias values. This seems to indicate that the multivariate bias-640 adjusting methods, and especially MBCn and dOTC, are unable to adjust the correlation exactly because of the nonstationarity in the correlation that has to be overcome. In contrast, the univariate bias-adjusting methods neglect the adjustment of correlation and consequently do not have to overcome nonstationarity in the correlation bias. Yet, for the univariate bias-adjusting This is confirmed by the results for MRQNBC. Together with mQDM, this is the only method to have six indices with RB MB values lower than one. Besides, it is also the only multivariate bias-adjusting method to have an RB MB value for corr E,T lower than one, although it is only slightly lower.  index is absent from the MBCn and dOTC plots. The differences between those two plots are also notable. For MBCn, only the number of dry days has a very low RB MB value (0), as the number of dry days is unaffected after the thresholding, whereas the 660 lag-1 P auto-correlation and the wet-to-dry transition probability are more biased than the raw climate simulations. For dOTC, it is the other way around: the number of dry days is more biased after the application of dOTC, and the auto-correlation and the wet-to-dry transition probability perform well, with RB MB values 0.50 or lower.

Precipitation occurrence
Another peculiar result that can be seen from Fig. 7  What the difference in transition probabilities implies for the time series, becomes more clear in Fig. 8. Although all adjusted simulations and the observations have more short wet spells than long ones, MBCn pronounces the short wet spell length more than the other methods, while the probability of longer wet spell lengths is lowered in comparison with other methods. Closest to the observations is mQDM. QDM and MRQNBC also perform well, a conclusion similar to that of Fig. 7. The difference in performance between QDM and the other methods seems to demonstrate that most strategies for retaining 675 a certain temporal structure or adjusting the temporal structure do not perform well. MRQNBC and mQDM depend heavily on the temporal structure of the observations, and MBCn and dOTC have an important shuffling or recalculation aspect, all of which lead to less reliable results at the end of the process. The poor performance of dOTC and MBCn for the temporal structure was also discussed by François et al. (2020). As for mQDM and MRQNBC, it is notable that the temporal structure does not change much from the calibration time series to the validation time series. At least, this is suggested by their relatively 680 good performance, which is based on using the observed time series (mQDM) and observed persistence statistics (MRQNBC).
Yet, this is no guarantee that these methods will be able to realistically adjust climate model simulations for the end of the century.
Figure 7 also suggests that despite the high R index value, the P lag-1 auto-correlation is not necessarily poorly adjusted.
For QDM, this index has relatively low RB O and RB MB values. This could imply that the performance still depends on the 685 robustness to the bias nonstationarity of the methods under consideration. Or, as the other indices illustrate, the effect of bias (non-)stationarity is not as large as the effect induced by the methods themselves. An example of this is the number of dry days: though it has a low R index value, the performance varies substantially among the methods.

Discharge
All bias-adjusting methods perform better for the discharge percentiles compared to most other indices (Fig. 9). Although the discharge is influenced by a combination of many effects, these appear to be small in the end result. For example, the poor performance of E (Fig. 5)  those of MRQNBC, those that had (P lag1 and P P10 ), had a larger influence on the extreme discharge values. Both indices are partly based on the occurrence of wet days, and thus indicate that those need to be at the correct place in the time series for extreme floods to be correctly simulated.

Discussion
In the previous section, the results for the bias adjustment by different methods and under climate change conditions were 710 reported. The effect of climate change on the bias was evaluated through the R index, which showed that the bias for some indices cannot be considered stationary. For some of the indices (the lower percentiles of T and especially the middle percentiles of E) the methods performed poorly, which could often be linked with the R index values. The methods clearly handle this bias nonstationarity differently. It seems that the univariate bias-adjusting methods are far more robust: even for indices with high R values, they are sometimes able to perform very well, with low RB O and RB MB values. This good performance thus 715 seems to imply that the more indices a bias-adjusting method directly adjusts, the more susceptible it is to problems related to bias nonstationarity. However, this does not imply that QDM and mQDM are similar: while they are almost as good for many variables, the poorer performance of mQDM for the precipitation occurrence indices is an indication that assuming that the temporal structure of the past can be used for the future might be dangerous, as Johnson and Sharma (2011) and Kerkhoff et al. (2014) already mentioned. Given that mQDM performed worse for two time periods separated by 10 years only, it is unlikely 720 that it is safe to use this method, or other delta change-based methods, for impact assessments targeting the end of the 21st The results of the multivariate bias-adjusting methods too are not without nuance: though they are generally worse than the univariate bias-adjusting methods, their performance depends heavily on the variable under consideration and on the method 725 itself. A clear example of this dependence on variables is the contrasting performance of dOTC to adjust T (Fig. 4) versus E ( Fig. 9): the adjustment of E is much worse. This is a reminder that in a multivariate context, the multivariate methods are far less robust and can perform relatively good and poor at the same time for different variables. Therefore, there seems to be an interplay between the modelling of the variables and the method of calculation. Except for P, for which the results were similar, the methods performed differently for each variable. MRQNBC performed best in the context of temporal structure, 730 for which it was designed (Fig. 7). For T, MBCn and dOTC performed better (Fig. 4). This could be related to their trendpreserving properties, which are more pronounced for those methods than for MRQNBC. For E (Fig. 5) and correlation (Fig. 6), dOTC displayed the most different results. For the former, the all-in-one and trend-preservation method did not seem robust enough. For the latter, it depended heavily on the type of correlation under consideration. These results seem to imply that the difference under bias-nonstationary conditions is not clear-cut for the different types of multivariate bias-adjusting methods.

735
For the 'marginal/dependence' vs. the 'all-in-one' approach, consequently no clear conclusions can be drawn. For the amount of temporal alteration, it depends on the index under consideration. MRQNBC, which replaces the simulated correlations by those of the observations performs well for the temporal structure, but performs worse for many other indices. For MBCn and dOTC, the effect of the difference in temporal alteration is less distinct and other properties, such as trend preservation, seem to have more influence.

740
To have a better view of how these results should be interpreted, the perspective of the end user should be considered (Maraun et al., 2015;Maraun and Widmann, 2018b). We used discharge as an example, using the relatively simple PDM. Although the residual bias values for all methods for E (Fig. 5) indicate a poor performance, the influence thereof on the discharge seems to be negligible (Fig. 9). Discharge is the variable that is of the highest importance for hydrological impact modelling, and the results indicate that most methods are able to adjust the forcing variables sufficiently in order to have a good simulation of 745 discharge. However, the small differences between the methods should still be taken into account. Overall, QDM and mQDM perform best in adjusting the variables such that the discharge rates are the least biased in comparison with the observations. This is also important considering that bias adjustment can be applied for many different types of impact assessment. In other impact assessments, the differences could affect the result more than the discharge considered here. For example, forest fires (a typical compound event, discussed in a bias adjustment context in e.g. Yang et al. (2015), Cannon (2018), Zscheischler et al.

750
(2019)) depend more heavily on T and E to simulate fire weather conditions. Besides such compound events, other applications are ecosystem functioning (Sippel et al., 2016), agriculture (Galmarini et al., 2019), or climate zone classification (Beck et al., 2018). In such studies the effect of bias nonstationarity can even be worse, whereas in other studies, depending more on P or other (so far) less-affected variables, the need for a bias nonstationarity-proof bias-adjusting method is less compelling.
Anyway, the inability of some methods to adjust the biases in nonstationary conditions implies that a thorough assessment of 755 possible bias nonstationarity should be made before bias-adjusting. If not done, the risk of reporting a wrong future projection is likely increased. Given the knowledge of bias nonstationarity, such uncertainties can be better characterised.
Returning to the discharge, it might be interesting to discuss whether or not the adjustment of E is truly needed. On the one hand, this variable is the most affected by bias nonstationarity. On the other hand, discharge is far less influenced by this variable than by P or temporal structure. The discharge has been calculated for this setting with raw E, the result of which is 760 shown in Fig. 10

Conclusions
The goal of this paper was to assess how five bias-adjusting methods handle a climate change context with possible bias nonstationarity. Three of the methods were multivariate bias-adjusting methods: MRQNBC, MBCn and dOTC. The two other the bias-adjusting methods were univariate: one was a traditional bias-adjusting method, while the other was almost the same 775 method, but modified according to the delta change paradigm. These univariate methods were used as a baseline to compare the multivariate bias-adjusting methods with. The climate change context, using 1970-1989 as calibration time period and 1998-2017 as validation time period, allowed us to calculate the change in bias between the periods, or the extent of bias stationarity, using the R index. All methods were calculated and compared using different indices, for which the residual biases relative to the observations and model bias were calculated.

780
The calculated R index values differed depending on the variable and variable index under consideration, but generally demonstrated that the bias of some of these indices is not stationary under climate change conditions. These changes could in some cases be clearly linked to the poor performance of bias-adjusting methods, such as for the lower percentiles of T or the middle percentiles of E. The performance was often poorer for the multivariate bias-adjusting methods, which corroborates the conclusions of Guo et al. (2020) that bias nonstationarity influences the performance of multivariate bias-adjusting methods.

785
Although these methods have been developed during the last few years as a means to better adjust the biases, it seems that their more complex calculations make them more vulnerable to bias nonstationarity. Thus, the univariate bias-adjusting methods, computationally less complex and not taking (potentially changing) correlations into account, seem to be more robust. Although effective difference in climate change impact is weakened by the hydrological model we used, the univariate methods still perform best. Studying other types of climate change impacts, the effect of bias nonstationarity could possible be even larger 790 than discussed here.
The validation results could only be obtained by analysing and comparing a broad combination of indices. Considering only the mean or other standard statistics would have hidden many of the results seen. For example, in contrast to the results for the mean, the inclusion of both high and low extremes highlighted some problems with bias nonstationarity for some variables. As such, this study does not contradict earlier studies such as Maraun (2012), where the mean-based biases were found to be rather 795 stable. Even a broader set of indices, such as the ETCCDI indices, was not enough to clearly discern between the methods. As such, we repeat the advice by Maraun and Widmann (2018a) to use indices not directly affected by bias-adjusting methods and to analyse the user needs before deciding upon the bias adjustment validation method. An important limitation is that we only used one GCM-RCM-combination. Using a model ensemble will be more informative, but could hide a single model's poor performance. On the other hand, similar assessments could also be used to discard poor-performing models (expanding upon 800 methods such as those used in e.g. Brunner et al. (2019) or Tokarska et al. (2020)), based on the R index (also suggested by Maurer et al. (2013)) or the remaining bias after adjustment.
The results for the multivariate bias-adjusting methods assessed here are in line with François et al. (2020), especially for the problematic adjustment of occurrence. François et al. (2020) consequently state that the different multivariate bias-adjusting methods are based on different assumptions, and thus, the end user should make well-grounded choices on the method used.

805
This also became clear in our assessment. However, François et al. (2020) did not study the effect of climate change and bias (non)stationarity and instead focused on model trend preservation, or trend nonstationarity. The results presented and discussed here, such as the contrasting results of MRQNBC and dOTC, imply that whether trend preservation was the focus of a method or not, can have an influence on the bias adjustment. However, it is yet unclear how trend nonstationarity and bias nonstationarity influence each other and how the most appropriate methods can be discerned, although it has been suggested to 810 use trend-preserving methods whenever we can assume the models to correctly simulate the atmospheric processes (Maraun, 2016).
Although critical of their use, the results of this paper do not imply that multivariate bias-adjusting methods are not helpful.
Many of the methods developed during the past few years can also be used for spatial bias adjustment, in which case the locations can be used as extra variables (see e.g. Vrac (2018)). A similar set-up has not been tested here, but the study by François 815 et al. (2020) has proven the multivariate bias-adjusting methods to be very informative and robust for spatial adjustment: the spatial characteristics that influence local weather the most, such as orography.
Nonetheless, the results discussed in this paper indicate that many methods, and especially the multivariate bias-adjusting methods, fail in handling climate change and its resulting bias nonstationarity correctly. As authors have mentioned before (Ehret et al., 2012;Maraun, 2016;Nahar et al., 2017), this foremost implies that climate models have to become better at modelling the future: we need to be able to trust them as fully as possible. As long as this is not the case, bias adjustment methods have to be developed that are more robust and that are able to help us assessing the future correctly. Yet, impact assessment cannot wait for new methods to be developed and/or tested: we need to prepare ourselves for the future as soon as possible. As was shown here, we can state for the current generation of methods that the fewer assumptions and calculations a method needs, the more robust it is when used in a climate change context. Given this statement, we advise to use univariate 825 bias-adjusting methods, until it becomes more clear how it can be ensured that multivariate methods certainly perform well in a climate change context.