Hydrological model performance and parameter estimation in the wavelet-domain

. This paper proposes a method for rainfall-runoff model calibration and performance analysis in the wavelet-domain by ﬁtting the estimated wavelet-power spectrum (a representation of the time-varying frequency content of a time series) of a simulated discharge series to the one of the corresponding observed time series. As discussed in this paper, calibrating hydrological models so as to reproduce the time-varying frequency content of the observed signal can lead to different results than parameter estimation in the time-domain. Therefore, wavelet-domain parameter estimation has the potential to give new insights into model performance and to reveal model structural deﬁciencies. We apply the proposed method to synthetic case studies and a real-world discharge modeling case study and discuss how model diagnosis can beneﬁt from an analysis in the wavelet-domain. The re-sults show that for the real-world case study of precipitation – runoff modeling for a high alpine catchment, the calibrated discharge simulation captures the dynamics of the observed time series better than the results obtained through calibration in the time-domain. In addition, the wavelet-domain


Introduction
Most hydrological models have parameters that cannot be related to some measurable catchment characteristics and have to be calibrated.Classically, this calibration determines the best parameter values such as the simulations match as closely as possible one or several observed system outputs Correspondence to: B. Schaefli (b.schaefli@tudelft.nl)(for an overview of calibration methods, see, e.g.Gupta et al., 2005).The uncertainties inherent in the simulations of such calibrated models (e.g.Beven and Freer, 2001;Vrugt et al., 2003;Kavetski et al., 2006a) and the question how to reduce them are subject to intense research.Current strategies include a better description and understanding of the uncertainty inherent in the involved natural processes (e.g.Zehe et al., 2005), in the observation of these processes (e.g.Nicótina et al., 2008) or in the mathematical representation of these processes (e.g.Kavetski et al., 2006b).In parallel, the question how to increase the value of observed data through an improved extraction of its information content receives a constantly growing interest (e.g.Herbst and Casper, 2008;Reusser et al., 2008;Yilmaz et al., 2008).
Model parameter estimation is linked to the question how to measure model performance by suitable objective functions.The majority of parameter estimation methods is based on objective functions defined on the residuals, i.e. the difference between the observed and the simulated time series.Most methods minimize the mean squared error, i.e. the sum of the squared residuals (see, e.g.Gupta et al., 2005).By construction, the resulting calibrated model simulation fits well the individual values of the observed reference time series.Such an approach accounts only indirectly for differences in the autocorrelation of the observed and of the simulated times series.Assuming uncorrelated Gaussian residuals and inferring their variance in a full Bayesian approach (e.g.Kavetski et al., 2006a), for example, implicitly minimizes the difference in autocorrelation between the observed and the simulated series: if one of the time series shows a strong autocorrelation, e.g. at lag-1, the residuals cannot be uncorrelated and the assumptions are violated.At least partial explicit assessment occurs in Bayesian methods that assume correlated Gaussian residuals (e.g.Montanari and Toth, 2007;Schaefli et al., 2006) or in methods using calibration objective functions that minimize temporal slope differences between two time series (e.g.Reusser et al., 2008).
B. Schaefli and E. Zehe: Hydrological model performance in the wavelet-domain As a good simulation should mimic the dynamics underlying an observed time series, it is tempting to think that explicitly assessing how well a model reproduces the autocorrelation properties of an observed system response is a promising choice for model calibration.Keeping, furthermore, in mind that time series of hydrological signatures exhibit periodicity at different time scales, model performance measures that are based on spectral information appear rather appealing.A straight forward choice is of course the power-density spectrum of a process which equals the Fourier transform of its autocorrelation function (Priestley, 1981).This idea is indeed not new; Whittle (1953) proposed a method for parameter estimation in the Fourier-domain matching the theoretical power-density spectrum of the model to the estimated powerdensity spectrum of the process observations.The Whittle estimator has recently been applied to rainfall-runoff models by Montanari and Toth (2007).
Whittle's Fourier-domain estimator is a consistent approximation of the classical time-domain likelihood.For infinitely long time series, it will, thus, yield the same result as time-domain estimation (see, e.g.Hannan, 1973;Yao and Brockwell, 2006).However, as shown by Contreras-Cristán et al. (2006), it can produce unreliable estimates for non-Gaussian processes or show an important loss of efficiency if the autocorrelation of the process is high.

Objectives of this paper
The overall idea is to present a new performance measure to assess how closely the time-varying frequency content of a simulated time series matches the time-varying frequency content of the observed series.This objective function is based on a continuous wavelet transform that yields a representation of the time-varying frequency content of an observed time series -as opposed to a Fourier transform, where the moment of occurrence of the different frequencies is not preserved.The wavelet transform is particularly useful for application to natural processes, such as discharge, that integrate various time-varying small scale processes at a larger spatial scale and that, thus, have time-varying autocorrelation properties.This time-variation of the hydrological response of an ecosystem is, of course, partly induced by the timevariation of the relevant input processes, e.g.precipitation and temperature.Furthermore, the rainfall-runoff response is essentially nonlinear, including threshold behavior (Zehe et al., 2007;Blöschl and Zehe, 2005).The input frequencies are thus nonlinearly filtered by the catchment and its biotic (e.g.vegetation) and abiotic characteristics.In glaciated catchments, the real-world case study of this paper, the overall time-variability is particularly pronounced since discharge is induced by a combination of rainfall, ice and snowmelt (see Fig. 1a).
We will give evidence that an objective function that measures explicitly how well the time-varying frequency content of an observed series is reproduced can provide important  and new pieces of information to the puzzle of understanding performance and structural deficits of hydrological models.Such an objective function allows, furthermore, estimation of model parameters.We intend to show that such an approach represents a very valuable opportunity compared to parameter estimation in the time domain.As the suggested approach depends crucially on how similarity between estimated wavelet-power spectra is defined, this will be discussed in detail in Sect. 3 after an introduction to continuous wavelet transform in Sect. 2. We then illustrate the advantages and drawbacks of parameter estimation in the wavelet-domain through simple examples and synthetic case studies, i.e. using synthetic data generated either with a simple statistical model or with a conceptual, reservoir-based rainfall-runoff transformation model (Sects.4 and 5).Finally, we apply the wavelet-domain objective function to parameter estimation of the GSM-SOCONT (Schaefli et al., 2005) model for a highly glacierized catchment in the Swiss Alps.Based on this case study, we discuss the potential of wavelet-domain calibration and performance analysis and show how it can contribute to improve the structure of hydrological models (Sect.5).The main conclusions and open questions are summarized in Sect.6.
In hydrology, continuous wavelet transform became popular in different types of applications; it is for example used to characterize river regimes and to detect how discharge is related to climate variability indices (e.g.Labat, 2005) or to qualitatively analyze how certain features of the meteorological input time series are transferred to the hydrological system output (e.g.Gaucherel, 2002;Lafrenière and Sharp, 2003;Schaefli et al., 2008).Lane (2006) was the first to use it to investigate rainfall-runoff models, namely to investigate the impact of perturbing single model parameters on the resulting hydrographs.
Even though wavelet spectral analysis has found a wide spread application, few papers present all the mathematical details which we judge to be necessary to understand this paper and to interpret the results.Therefore, the following section might seem rather detailed to the reader with a background in wavelet spectral analysis.

Continuous wavelet transform
Given a stochastic process X(t), its wavelet transform W g [τ, s|X(t)] at time τ and scale s with respect to the chosen wavelet g(t) is where g * denotes the complex conjugate of g and c(s) is a normalization constant (see Sect. 2.3).For a detailed discussion of continuous wavelet transform (CWT) and for example the requirements on the wavelet g(t), we refer the reader to the comprehensive literature (e.g.Daubechies, 1992;Holschneider, 1998).
The choice of the wavelet g(t) depends on the type of application.In geosciences applications, the Morlet wavelet is frequently used (for a short discussion of how to choose a wavelet for hydrological applications, see Schaefli et al., 2008): where i= √ −1 and θ=(t−τ )/s.The parameter ω 0 adjusts the time/scale resolution.In the present application, we use ω 0 =6, a choice that has empirically been shown to work well for geosciences applications (Labat, 2005;Si and Zeleke, 2005;Torrence and Compo, 1998).
For a Morlet wavelet, the relationship between the scale s and the frequency f reads as (e.g.Holschneider, 1998): Therefore, for ω 0 =6, f ≈1/s.
It is important to note that the CWT transforms a time series from one to two dimensions (time and scale).This transformation re-uses the same original information several times and results, therefore, in a considerable amount of redundancies.The inherent correlations of a CWT, given by the reproducing kernel (e.g.Holschneider, 1998), make statistical analysis of wavelet-power spectra a non-trivial task (Maraun et al., 2007;Schaefli et al., 2008).They represent a fundamental difference to estimated Fourier power-density spectra where neighboring frequencies are asymptotically independant.

Wavelet-power spectrum
Analogue to Fourier analysis, the wavelet-power spectrum (WPS) is defined as the wavelet transform of the autocovariance function, which for a nonstationary process X(t) can be written as (e.g.Shumway and Stoffer, 2006): where η is the time argument of the autocovariance function and is the lag from time η.E [•] is the expected value and "conj" denotes here the complex conjugate (elsewhere denoted by *).For simplicity of notation, let's assume in the following that X(t) is a zero-mean process, i.e.E [X(t)]=0 for all t.The WPS does becomes (e.g.Holschneider, 1998): This last equation is often written in the following short form: The exact WPS of observed or simulated processes is generally unknown; we can estimate it based on the CWT of observed process realizations (observed time series): where Ŝg τ, s|x (m) (t), μ(t) is an estimator of S g [τ, s|X(t)].• denotes the averaging operator, x (m) (t) is a matrix containing m realizations (time series) of the process X(t) and μ(t) is an estimator of the expected value of X(t).In practice, an estimator of the true WPS is often B. Schaefli and E. Zehe: Hydrological model performance in the wavelet-domain obtained based on a single realization x(t)=x (1) (t) of length N : where the estimator μ is obtained as the average of the realization, i.e. μ= 1 In analogy to Fourier analysis, this estimator Ŝg [τ, s|x(t)] is called the wavelet periodogram.It is computed at a finite number of scales between the lowest and the highest resolvable scales that depend on the sampling time step and the number of sampled time steps.The selected scales are usually: with i=1,.., N voice ×N octave +1.The lowest calculated scale is s 0 corresponding to a frequency lower than or equal to the Nyquist frequency (i.e.half the sampling frequency, see, e.g., Priestley, 1981) and the highest scale is s 0 ×2 N octave where N octave denotes the number of octaves (i.e.powers of two), and N voice the number of voices (i.e.calculated scales) per octave.
It is important to note that wavelet analysis has the subtlety that, since neighboring points in time and scale are correlated, the wavelet periodogram looks smooth even if the fluctuations around the true wavelet-power spectrum are not smaller than for a (Fourier) periodogram (see Maraun and Kurths, 2004).Accordingly, as for the Fourier periodogram, the wavelet periodogram has to be smoothed to obtain a consistent estimator of the true wavelet-power spectrum (see, e.g.Maraun and Kurths, 2004).For the present application that uses the difference between the wavelet periodograms of two time series for model calibration, the consistency of the estimator is, however, of no relevance.

Normalization of the wavelet transform
The choice of the normalization constant in Eq. ( 1) is not unambiguous.It can in principle be chosen arbitrarily (Kaiser, 1994, p. 62) and just as in Fourier analysis, different conventions are in use.
To compute the wavelet-power-based performance criteria, we use the L 2 -norm preserving normalization c(s)= √ s, which ensures that (Kaiser, 1994, p. 63) 3 Wavelet periodogram -based performance assessment

Visual inspection
Wavelet periodograms (i.e.estimated wavelet-power spectra) have the potential to efficiently distinguish between time series that seem to be similar in the time-domain but that have a (locally) different frequency content and, thus, locally different autocorrelation properties.We illustrate this potential based on the wavelet periodograms of the simulated and of the observed daily discharge series from the Rhone River at Gletsch.Further details on these time series are given in Sect. 4. The simulated series appears to be very similar to the observed series (Fig. 1a), with a linear correlation of 0.97 or, using the classical hydrological performance criterion proposed by Nash and Sutcliffe (1970), a Nash value of 0.94 (see Sect. 4).
The visual interpretation of the 2-D wavelet periodograms (Fig. 2a and b) is difficult and error prone (Maraun and Kurths, 2004;Maraun et al., 2007).Therefore, Fig. 2c  and d show the so-called wavelet bands that correspond to the scale-average wavelet-power over given ranges of scales, where the scale-average power is defined as the scaleweighted sum of the wavelet-power (Torrence and Compo, 1998).The bands are normalized by the variance of the time series.As Fig. 2c and d in conjunction with Fig. 1a illustrate, such a plot reveals differences that are not readily seen in the original data.We see namely that for the band of scales 64 days to 128 days, the calibrated model does not correctly reproduce the dynamics.These differences would also be visible in a detailed inspection of the time series, by comparing the weekly, monthly and seasonal statistics or by subtracting the series from each other and analyzing the obtained "residuals".However, inspecting wavelet bands provides several views of the signal at the same time and has the main advantage of yielding a rapid overview over the differences.

Wavelet performance measure
We propose a method for the estimation of model parameters in the wavelet-domain.It is based on the following hypotheses: (i) the dynamics of two processes are similar if their time-varying autocorrelation properties are similar; (ii) these autocorrelation properties can be estimated based on the wavelet periodograms of process realizations.For model calibration, this translates into the assumption that the more similar the wavelet periodograms of a simulated and an observed time series are, the better the model mimics the behavior of the natural system.
Quantifying the similarity between the wavelet periodograms requires the definition of a distance metric that measures how different the periodograms are at a give time step.The overall distance of the two periodograms can then be expressed as the mean distance over all time steps.The Hydrol.Earth Syst.Sci., 13,[1921][1922][1923][1924][1925][1926][1927][1928][1929][1930][1931][1932][1933][1934][1935][1936]2009 www.choice of this distance metric has to take into account the fact that in the wavelet periodogram neighboring scales and neighboring time steps are correlated.
We use a metric similar to the Kolmogorov-Smirnov distance, which is classically used to measure the distance of the probability distributions of two samples and which equals the maximum distance between the cumulative distribution functions.This metric is particularly useful to measure whether at a given time step t, the power of the observed and of the simulated wavelet periodogram is similarly distributed over all scales: it is sensitive to the shape of the power distribution over the scales but compared to a squared error-based metric, it is much less sensitive to slight shifts in peaks and to the chosen normalization constant in Eq. ( 1).

The cumulative wavelet-periodogram
where s=s 0 , . . ., s max (τ ).s max is the maximum scale analyzed at each time step.This maximum scale varies from time step to time step because of the edge effects.In CWT, the area of the wavelet periodogram that is influenced by edge effects is called "cone of influence".In the present paper, we exclude edge effects by fixing a cone of influence that equals the e-folding time of the wavelet, which is defined as the time at which the wavelet-power drops to 1/e 2 and which is a measure of the wavelet width at a given scale.For a Morlet wavelet, this equals √ 2s (for a discussion, see Torrence and Compo, 1998).
B. Schaefli and E. Zehe: Hydrological model performance in the wavelet-domain The Kolmogorov-Smirnov distance D g [τ |x(t), y(t)] at time step τ between two process realizations x(t) and y(t) becomes: where is the normalized cumulative wavelet periodogram of the process realization x(t) at time step τ .
A good simulation should have a wavelet periodogram that fits the periodogram of the observed series at all time steps.Accordingly, the overall wavelet periodogram efficiency criterion, R W , averages D g [τ |x(t), y(t)] over all time steps.For an observed time series y(t) and the corresponding simulated series x(t|ϕ) this becomes where N is the total number of time steps and ϕ a vector containing all model parameters.R W takes values between 0 and 1 and has to be minimized during calibration.
In order to be applicable to parameter estimation, a distance metric has to fulfill formal requirements (e.g.Weisstein, 2008).For some general distance metric D(A,B) between A and B it has to hold that (i) D(A, B)≥0, ∀A, B, ii) D(A, B)=D(B, A) (iii) D(A, B)=0 if and only if A=B and (iv) D(A, C)≤D(A, B)+D(B, C).For the wavelet-based distance measure D g [τ |x(t), y(t) ], it follows directly from its definition that conditions (i) and (ii) hold.Condition (iv), the so-called triangle inequality, holds since the maximum distance between two monotonically increasing functions between 0 and 1 can never be bigger than the sum of the maximum distances between these two functions and a third function.Since R W [•] results from a simple average of D g [τ |x(t), y(t) ] over all τ , conditions (i), (ii) and (iv) also hold for R W [•]. Condition (iii) does not necessarily hold for D g [τ |x(t), y(t) ]: two process realizations x(t) and y(t) could, in theory, have locally exactly the same distribution of wavelet-power without having x(t)=y(t), ∀t.However, it holds that R W [•] =0 if and only if D g [τ |x(t), y(t) ] =0 ∀τ which implies x(t)=y(t).We conclude that R W [•] satisfies the formal conditions of a distance metric.
R W measures whether the wavelet-power content at every time-step is distributed similarly in a simulated series and a reference series, i.e. it measures differences in the autocorrelation properties at a given time step.Accordingly, it does not explicitly measure differences in the mean or in the variance of two time series.
As in every parameter estimation procedure, preserving the mean, or in physical terms the mass balance, is a very important criterion to accept or reject simulations and the underlying model.Traditional time-domain calibration ensures preservation of the mean either through the assumptions on the residual distribution (e.g.zero-mean Gaussian residuals, Kavetski et al., 2006b) or through explicit exclusion of parameter sets leading to a too high bias between the observation and the simulation (see, e.g., Montanari and Toth, 2007).We retain this last solution by deteriorating the wavelet performance criterion R W of a given simulation if its bias exceeds a certain tolerance factor.The exact value of this tolerance factor has to be fixed empirically.For perfect model situations where the true (and hence unbiased) simulation exists, the tolerance factor does not affect the best identified parameter set but restrains the search space.For realworld applications, this restriction of the search space might influence the best identified parameter set since it excludes not mass conservative, i.e. physically meaningless parameter sets.
In the present study, we use a tolerance factor of 10% for all case studies.For the real-world application, this choice is in line with the semi-automatic calibration method suggested by Schaefli et al. (2005).In general, the value of the tolerance factor should reflect the available information about the observational uncertainties of the different terms of the water balance.The exact penalization procedure based on this tolerance factor is discussed in Sect.4.3.3.
For general stochastic processes with stationary mean, variance and autocorrelation properties, these are a priori unrelated properties and a good process model should preserve them.Discharge processes, on which we focus in the present paper, have a time-variable mean, variance and autocorrelation (see an example in Fig. 1).Since discharge results from a time-variable combination of different hydrological processes (infiltration, snowmelt etc.), these statistical properties are strongly related.For such processes, as our empirical results show, preserving the mean and the time-varying autocorrelation properties ensures the preservation of the process variance.
We would like to add here that in the statistics literature, wavelet-based estimators have been proposed in the 1990s to estimate long-memory parameters (see Velasco, 1999, p. 107) but their statistical properties are analyzed only recently (e.g.Moulines et al., 2008).As the corresponding estimation problems are very different from the scope of the present paper, we do not discuss them here.
Experiment 2 uses the classical model structure, whereas experiment 3 is based on a slightly modified model version including a time-varying parameter.Details about all synthetic experiments are given hereafter.Additional illustrative toy examples as well as a synthetic case study with the wellknown HYMOD model (e.g.Schaefli and Gupta, 2007) can be found in (Schaefli and Zehe, 2009).

Input time series
The synthetic experiments have been designed to illustrate the differences between parameter estimation in the timedomain and in the wavelet spectral domain.We therefore use as external forcing a nonstationary precipitation series which is obtained by joining two precipitation series that have different statistical properties.To have a realistic situation, these two individual series are surrogate series generated based on the precipitation series observed at the station Bourg St. Pierre between 1903 and 1999, located in the Southern Swiss Alps (1620 m a.s.l., 7.21 • E, 45.95 • N), which is also used also for the real-world case study.The precipitation in this area is known to have undergone a substantial modification over the last century (Frei and Schär, 2001;Schmidli and Frei, 2005).The generation of the nonstationary rainfall series involves the following steps: (i) Generate a 250 days surrogate series based on the first 20 years of observed precipitation.The surrogate series is generated using the so-called Iterative Amplitude Adjusted Fourier Transform (IAAFT) algorithm (Schreiber and Schmitz, 2000).This is a classical method to obtain surrogates by first taking the Fourier transform of a time series, replacing the phases by randomly drawn phases and then completing the inverse Fourier transform.(ii) Generate a 250 days surrogate series based on the last 20 years of observed precipitation.(iii) Contract both series.
For experiments 2 and 3, we generate a longer series, 10 years, and use the first half for calibration and the second half for validation.

Output time series
For experiment 1, the used ARMAX process is: where t is the time step, z(t) is the input variable (in our case precipitation), n k is the delay parameter that is set to The resulting series is perturbed with uncorrelated Gaussian noise having zero mean and standard deviation 0.4, corresponding to 25% of the standard deviation of the generated y(t).
Experiments 2 and 3 are based on a reference discharge series simulated with the hydrological model GSM-SOCONT (Schaefli et al., 2005) (see also Sect.4.) using the same precipitation series as in experiment 1.We assume that there is no glacier cover and use a temperature time series corresponding to a low land station (the station called Grono, 380 m a.s.l., 9.15 • E, 46.25 • N).This makes the discharge time series explicitly distinct from the real-world case study (see Sect. 4.2); in particular there is a less strong annual cycle of the discharge (see Fig. 1b).
For experiment 3, we generate a reference series with GSM-SOCONT having a time-variable snowmelt parameter (see Table 4) and then calibrate the model with a constant snowmelt parameter on this reference series.This experiment illustrates a typical example of model misspecification.
For both experiments 2 and 3, the synthetic realizations are perturbed by adding white noise before the parameter calibration process (see results section for details).

Real-world case study
For the real-world case study, we use the GSM-SOCONT (Schaefli et al., 2005) model, which is a conceptual precipitation-runoff transformation model for high mountainous catchments having an ice-melt component.The discharge is simulated separately for the glacier part and the non-glacier part and within each part separately for 5 elevation bands.We apply it to a gauging station of the Rhone River located in Gletsch, in the Southern Swiss Alps (8.36 • E, 46.56 • N).This catchment is highly glacierized (around 50% of the surface covered by glaciers) and has a mean altitude of around 2700 m.For a more detailed description and the used meteorological input time series, see (Schaefli et al., 2005).We use the period 1981 to 1990 for calibration and 1991 to 1999 for validation.The meteorological conditions during these two periods where quite different.During the first period, there was in particular quite extensive snowfall (during this period the number of increasing glaciers in the Swiss Alps was much higher than during the 1990s, e.g.Herren et al., 2002).As a result, the hydrological regime of these two periods is quite distinct.The first period has its maximum mean monthly discharge in July, the second period in August.

Reference performance criteria
For comparison purposes, we use the classical squared errorbased Nash-Sutcliffe efficiency measure (Nash and Sutcliffe, 1970), called hereafter Nash value: where y(t) is the observed discharge at time step t, x(t|ϕ) is the simulated discharge given parameter set ϕ and N the number of observed and simulated time steps.
We define a Nash-based performance measure to be minimized as follows For the synthetic case studies, where the (exact) best model parameter set exists, we also use a Fourier-domain performance measure based on the Whittle likelihood, computed according to Montanari and Toth (2007) as: log J x (λ j |ϕ)+f e (λ j |ϕ) + J y (λ j ) J x (λ j |ϕ)+f e (λ j |ϕ) (17) where λ j =2πj /N are the Fourier frequencies.J x is the periodogram of the simulated discharge time series and J y the periodogram of the observed discharge time series.f e is the Fourier-power spectrum of the modeling error (for details, refer to Montanari and Toth, 2007).We define the performance measure R F as R F ϕ|J x (λ|ϕ), J y (λ) = − log L F ϕ|J x (λ|ϕ), J y (λ) (18) which has to be minimized.

Search algorithm
We use a global optimization algorithm for model calibration.The range of possible parameter values is fixed based on a priori information.The used optimizer is a multi-objective evolutionary algorithm called Queueing Multi-Objective Optimiser (QMOO) developed by Leyland (2002) for energy system design.For an application of this optimizer to hydrology, see (Schaefli et al., 2004) and (Schaefli, 2005).
The algorithm has been designed to identify difficult-tofind optima and to solve far more complex problems than the ones presented here, involving much more decision variables (parameter to identify) (see Leyland, 2002).We, therefore, assume that all identified parameter sets correspond to the best identifiable solutions of the optimization problem.The stopping criterion for the search algorithm is fixed as follows: we assume that that the algorithm has converged to the optimum solution if the objective function value of the best found solution does not vary more than 5‰ between two successively identified best solutions.

Penalization
As discussed in Sect.3.2, we penalize solutions (parameter sets) that lead to a large bias between the observed and the simulated time series.The penalization is completed based on where R k , k={W ,F ,N} is the objective function value (to be minimized) and B is the relative bias between the observed and the simulated time series computed as This penalization has been chosen because for all used performance criteria, B and R k have the same order of magnitude for good solutions.The penalization should not be too strong for low biases because this would hinder the optimization algorithm to explore the parameter space.

Experiment 1
The parameter ranges used as search space for model calibration as well as the identified best parameter sets under each performance criterion are given in Table 1.
For the perturbed reference series for which the results are reported here, none of the performance criteria leads to an exact recovery of the ARMAX parameters.For the specific realization of white noise, there is a parameter set that fits the signal better in a least square sense (Table 1).As expected, for this theoretic Gaussian case with uncorrelated error, the solution in the time-domain and in the Fourierdomain is equivalent.The best parameter set under R W is different, b 2 even has a wrong sign.In fact, adding Gaussian white noise adds power to all scales (recall that the Fourier power-density spectrum of Gaussian white noise is constant and equals its variance (see, e.g.Priestley, 1981).This induces, thus, an offset between the wavelet-power spectrum of the perturbed reference series and the exact series.As a result, for the model of Eq. ( 14), there is a parameter set with a closer match to the wavelet-power spectrum of the perturbed reference series.This effect becomes even more Table 1.Exact parameter values of the ARMAX process, intervals delimiting the search space for parameter estimation and the identified best parameter sets under R W , R F and R N (columns 5-7).For each parameter set, the values of the performance criteria are given (instead of R N , the more familiar L N =1−R N is given).The criteria values listed under "exact" are calculated between the unperturbed ("unpert") original series and the perturbed ("pert") series; other abbreviations: corr: linear correlation; min; best possible criterion value; max: worst possible criterion value; inf: no absolute reference value; in bold: the best performance of each row.important if we apply a stronger error (results not shown).In the unperturbed case, R W enables an exact recovery of the true parameter set.The convergence criterion was reached for all ARMAX experiments between 3500 and 4000 model evaluations.There is no significant difference between the different performance criteria.Another interesting question is whether one criterion needs a longer time series to converge efficiently.For all criteria, the convergence is slowed down if the length of the calibration time series is reduced; for R W this slowdown is more important because the data length limits the number of resolvable scales.For this case study, below 50 data points, the convergence time becomes prohibitive (more than 10 000 evaluations).

Experiment 2
The parameter set used for the generation of the synthetic reference discharge series set is given in Table 3 and a zoom on the time series is shown in Fig. 1b.This exact series is perturbed with a Gaussian white noise having zero mean and a standard deviation of 0.44 (corresponding to 25% of the standard deviation of the exact series).The parameter ranges used as search space for calibration are given in Table 2 and the identified best parameter sets under each performance criterion are given in Table 3.
For the case where the perfect model exists but the series is perturbed, R W as well as the other criteria recover the exact value of the most sensitive parameter, the degree-day factor for snowmelt (for details about parameter sensitivity, see Schaefli et al., 2005).For the 3 least sensitive parameter val-ues, the identified values are less close to the real values than for a calibration under R N or R F .The performance difference of the identified best simulations under all three calibration criteria is, however, hardly detectable.There is, nonetheless, an interesting difference: the optimum parameter values under the two frequency-domain criteria are clearly much better defined than under R N (Fig. 3a and b).This holds in particular for the least sensitive parameter, the nonlinear direct runoff parameter β.It is noteworthy, however, that this does not indicate a better identifiability in the waveletdomain in general but depends on the chosen formulation of the time-domain objective function (for a discussion of the shape of time-domain objective functions, refer to Beven and Freer, 2001).

Experiment 3
If we try to estimate the model parameters on a reference series that was generated with a different model structure, i.e. with a variable degree-day factor for snowmelt, the performances under R N and R W are very close but each of the criteria lead to distinct solutions for the constant degree-day factor (Table 4), both of which lead to good simulations.The two solutions are hardly distinguishable based on the used performance measures (Table 4) and are very close to the generated reference series (see Fig. 4).A look on the average wavelet-power over certain ranges of scales, however, clearly shows that the simulations having a constant degree-day factor do not reproduce the true dynamics (Fig. 4), neither for the best parameter set in the time-domain nor for the best parameter set in the wavelet-domain.

Parameter estimation in the wavelet-domain versus in the time-domain
There is certain trade-off between the time-domain and the wavelet-domain objective function.The optimum for R N does not correspond to the optimum for R W (Table 5).It is noteworthy that for this case study, the apparently high Nash values (0.94 for the best simulation under R N , 0.91 under R W ) do not necessarily mean that the hydrological model does a particularly good job, high Nash values are easy to achieve for times series with a strong annual cycle (see Schaefli and Gupta, 2007).At a first glance, the optimal parameter sets do not seem to be fundamentally different under R N and under R W (see Table 5).A closer look shows however some notable differences.Even if physically meaningless parameter sets are penalized, the optimization under R N leads to a global optimal solution where the degree-day factor for ice is smaller than for snow.This is physically highly questionable (e.g., Hock, 2003;Schaefli et al., 2005).The global optimal solution in the wavelet-domain respects this physical constraint.3; bottom: real-world case study, the two least sensitive parameters lk (c) and A (d), the other parameters are kept constant to the values of Table 5.In addition, the parameters have (as for the synthetic case study) a better identified optimum under R W than under R N , especially for the parameters with the lowest sensitivity, the soil transfer parameters (Fig. 3c and d).
Close inspection of the discharge simulations based on the best parameters obtained in the time-domain and in the wavelet-domain, respectively, shows that both parameter sets yield rather different discharge dynamics (see zooms on the simulations in Fig. 5 and scatter-plots of observed against simulated discharge in Fig. 6).Without further crossvalidation data (e.g.observed ice melt data), it is difficult to judge which parameter set captures the observed dynamics better.An interesting hint is, however, given by the following analysis: we build a prediction interval based on 90 of the 100 best random simulations.Under R N , this interval in- Table 2. Parameter intervals delimiting the search space for the real-world case study and for the synthetic experiments 2 and 3; parameter sets that do not respect the imposed physical conditions are penalized during parameter set evaluation (for more details about these parameters, see Schaefli et al., 2005).cludes around 80% of all observations for the calibration period.This success rate decreases constantly if we evaluate it for discharges above a certain threshold (Fig. 7).For R W , the success rate, which is overall slightly lower than under R N , remains constant for all discharge thresholds.This clearly suggests that for this case study, good simulations in the wavelet-domain capture equally well all discharge ranges.Good simulations under R N , however, capture particularly well small discharges, i.e. the long periods of easy to predict low flows, which, due to their temporal dominance, tend to have a strong influence on the time-domain objective function for this case study.The above results suggest an important difference between the best parameter sets in the time-domain and the best parameter sets in the wavelet-domain.A look on the parameter space of the most sensitive parameters, the degree-day factors for snowmelt (a snow ) and for ice melt (a ice ) illustrates this difference: Fig. 8 shows a scatter-plot of a snow against a ice for all "physically feasible" parameter sets, i.e. parameter sets that lead to a bias smaller than 10% (the visible dependance between physically feasible snow and ice melt factors is a common result for this type of models, see Schaefli et al., 2005).The 100 parameter sets that, among all physically feasible parameter sets, have the lowest R W values are highlighted in red; the parameter sets with the lowest R N values are highlighted as triangles.These two groups show that the best parameter sets under R N correspond to another area of the physically feasible parameter space than the best parameter sets under R W . Retaining good solutions under the Table 4. Experiment 3: parameters used to generate the synthetic reference discharge time series and the identified optimal parameters values using the R W and R N performance criteria; the reference time series has been generated using a variable a snow parameter throughout the year; the a snow parameter values for each month are {1.2,1.4, 1.4, 2.0, 4.0, 5.0, 6.0, 7.0, 5.0, 2.0, 1. quadratic error-based criteria R N further reduces the physically feasible parameter space.In contrast, the group of the best parameter sets under R W appears to show the same dependance between a snow and a ice as the overall group of physically feasible parameter sets.This result suggests that: 1.The bias criterion could be sufficient to ensure solutions that reproduce the dominant frequencies (which is a very interesting result for snow-and ice melt modeling).
This last hypothesis is supported by the fact that R N only leads to unbiased parameter estimates if the model residuals are Gaussian with a constant variance (for a discussion, see Kavetski et al., 2006a).For the present case study, the residuals clearly have a different variance during the summer and the winter season (see Schaefli et al., 2006).

Model diagnostics
A visual inspection of the average wavelet-power at bands of scales ranging from weeks to a few months shows that even the best models do not well capture the observed dynamics (see an example in Fig. 5c): both best models (under R W and R N ) do not well reproduce the frequency content of the observed series.The models have a somewhat different behavior but the plot of the average power at high scales suggests that given the observed input and the current model structure, the model cannot produce a closer fit to the estimated wavelet-power of the observed series.The power content tends to be either largely over-or underestimated.The model's inability to reproduce the frequency content at high scales (months) suggests that either a storage term is missing or is not well parameterized in the current model structure.In fact, the model does not simulate separately the melt and the transfer of firn, i.e. of old snow that lasts more than one season and that is a transition state between ice and snow.Firn has a degree-day factor and melt water transfer times that are between the ones of ice and of snow (e.g.Klok et al., 2001) and induces a further partitioning of the discharge during the melt season.Its contribution to the total discharge depends on the annual snowfall.
The mismatch of the calibrated and observed wavelet periodograms at intermediate scales (weeks) could be a hint that the model does not fully capture the relationship between temperature and snow-and ice melt.This relationship is constant in the current model parameterization but it is known to be variable throughout the melt season (for example due to changes in the albedo, see, e.g.Rango and Martinec, 1995).But the models' inability to reproduce these scales could also indicate the need to improve the meltwater transfer through the glacier and the overlaying snowpack.In the model, this transfer is encoded by a linear transfer function having a constant parameter.In reality, the transfer of melt water through a glacier is highly variable in time since the glacier drainage system evolves throughout the melt season (see, e.g.Willis, 2005).In warm years, it develops and evacuates water much faster than during cold years.A detailed analysis of the wavelet-power at different bands of the observed and the simulated time series during years with high snowfall and low snowfall, respectively during cold and warm years could help gaining further insights into which of the above model structural deficiencies are more important.The next step would now be to improve the model structure.Monitoring the model performance in the waveletdomain will help to verify that a model modification really acts on the dynamics at the assumed ranges of temporal scales or whether an achieved performance increase is just "pure chance", due for example to compensations between structural errors.This step requires a considerable reformulation of the model and is left for future research.

Computational costs
The computation of the wavelet-domain performance criterion involves first of all the computation of the wavelet periodogram of the analyzed time series, which requires convolving the signal (observed or simulated times series) with the wavelet at each scale.This implies a number of inverse Fast Fourier Transforms that equals the number of analyzed scales.This "pre-treatment" of the time series before the computation of the performance criterion increases the computational cost by a factor at least equal to the number of scales.The Kolmogorov -Smirnov distance-based performance criterion also involves more calculations than the computation of a squared error-based distance measure.
In our case, using a Matlab ® code on a laptop with a Intel ® Pentium ® M 1.5 GHz processor, the computation of the inverse Fast Fourier Transform for one scale is roughly twice longer than the computation of a Nash criterion over the entire time series.For a time series with 6939 time steps, computing the Nash efficiency takes typically 0.01 s whereas computing the wavelet periodogram for 122 scales takes 1.9 s and computing the Kolmogorov-Smirnov distance takes another 0.6 s.

Conclusions and outlook
The present paper discusses and illustrates parameter estimation and model performance analysis of rainfall-runoff models in the wavelet-domain with the main purpose to show how this could contribute to hydrological model diagnostics and to model structure improvement.
As discussed based on theoretical considerations and based on the presented examples, parameter estimation for at least partly misspecified models in the wavelet-domain can yield different results than parameter estimation in the time-domain.Especially for observed time series having a strongly time-varying frequency content, the suggested approach allows estimation of model parameter sets in the wavelet-domain that are internally consistent and allow simulations with more plausible dynamics than a parameter estimation in the time-domain.However, it is at the current stage difficult to determine a priori in which cases a calibration in the wavelet-domain could yield better representations of the true system dynamics.Future case studies and theoretical developments should provide insights into this question.A key hereby will be the detailed study of the behavior of the wavelet-domain performance criterion in presence of errors in the input or output data.
In general, a detailed investigation of the origin of the differences between the best solutions in the wavelet-domain and in the time-domain can offer additional and new pieces to the puzzle of understanding conceptual model behavior and shortcomings.For the real-world case study presented in this paper, the best parameter sets in the wavelet-domain do for example not show the same dependence structure as the best parameter sets in the time-domain.Such a result is a hint that the model has structural deficiencies.These deficiencies can then be further investigated by analyzing in detail how the model performs over relevant ranges of temporal scales, by visually inspecting the power content of the wavelet periodograms or by computing wavelet performance measures over certain scales instead of over the entire range of resolvable scales.As illustrated for the real-world case study, this can give valuable indications on model deficiencies and how to overcome them.
Just as different objective functions can be formulated in the time-domain, the presented wavelet-based criterion corresponds to one possible performance measure in the wavelet-domain.Other formulations (and other wavelets) are possible and would potentially yield other optimal parameter sets.While the statistical properties of different timedomain objective functions are well understood, applications of wavelet spectral analysis to geosciences are still relatively recent and statistical questions have to be further evaluated.We would thus like to emphasize that the potential of parameter estimation in the wavelet-domain lies in the information that it yields for model improvement.
For very long time series, the computational cost for the evaluation of the wavelet criterion can become important.This aspect is however counterbalanced by the gained insights.We are confident that future case studies including namely not only discharge data but also other sources of validation data will provide additional evidence for the potential of parameter estimation and model diagnostics in the wavelet-domain.
A Matlab ® code for the computation of the presented performance measure can be obtained from the first author.

Fig. 1 .
Fig. 1.(a) Observed discharge time series of the Rhone River at Gletsch and corresponding (time-domain) calibrated time series (Nash value L N =1-R N =0.94), for the years 1995 (top) and 1997 (bottom); (b) synthetic discharge time series generated with GSM-SOCONT, unperturbed series for experiments 2 and 3 (top), perturbed series for experiment 2 (bottom).

Fig. 2 .
Fig. 2. Estimated wavelet-power for the Rhone case study: (a) wavelet periodogram of the observed discharge (zoom on the 2nd half of the available time series, black line: cone of influence), (b) wavelet periodogram of the simulated discharge, (c) average estimated wavelet-power of both series between the scales of 16 days and 64 days, (d) same as (c) but between the scales of 64 days and 128 days.

B
. Schaefli and E. Zehe: Hydrological model performance in the wavelet-domain 4.3 Parameter estimation

Fig. 3 .
Fig. 3. Parameter sensitivity around the optimum value identified under the different calibration criteria; top: experiment 2, the most sensitive parameter a snow (a), and the least sensitive parameter β (b), the other parameters are kept constant to the values of Table3; bottom: real-world case study, the two least sensitive parameters lk (c) and A (d), the other parameters are kept constant to the values of Table5.

Fig. 4 .
Fig.4.Experiment 3, top: reference and calibrated discharge under R W (calibrated discharge under R N is almost identical and, thus, not plotted); bottom: estimated mean wavelet-power between the scales of 16 days and 64 days for the reference discharge and for the simulated discharge calibrated under R W , respectively under R N .

Fig. 5 .
Fig. 5. Real-world case study: (a) zoom on the observed discharge for the year 1986 (calibration period) and the simulation calibrated under R N ; (b) as (a) but simulation calibrated under R W ; (c) zoom on the average estimated wavelet-power between the scales of 128 days and 256 days for the observed discharge and the simulations calibrated under R W , respectively under R N .

Fig. 6 .
Fig.6.Real-world case study: scatter-plot of observed discharge (during calibration period) against best simulation under each of the two calibration criteria.

Fig. 7 .
Fig.7.Real-world case study: model performance for the best 100 random simulations (of 20 000 random parameter sets) under R W , respectively R N ; the success rate measures the relative number of observed daily discharges above a certain threshold that fall within the 90% prediction range of the retained 100 simulations; left calibration period, right entire period.

Fig. 8 .
Fig.8.Real-world case study: relationship between a snow and a ice for the 100 best parameters sets (of 20 000 random parameter sets) under R N , respectively under R W .

Table 3 .
Experiment 2: parameter values used to generate the synthetic discharge time series and the identified optimal parameters sets under R W , R F and R N ; the glacier surface is set to 0, which eliminates the parameters a ice , k ice and k snow ; calib: calibration period; valid: validation period; for other abbreviations, see Table1.

Table 5 .
Real-world case study: optimal parameter values identified using global optimization and R W and R N as objective-functions; calib=calibration period, valid=validation period, var.bias=relative difference between variance of observed and simulated time series.