Hydrological objective functions and ensemble averaging with the Wasserstein distance

. When working with hydrological data, the ability to quantify the similarity of different datasets is useful. The choice of how to make this quantification has direct influence on the results, with different measures of similarity emphasising particular sources of error (for example, errors in amplitude as opposed to displacements in time and/or space). The Wasserstein distance considers the similarity of mass distributions through a transport lens. In a hydrological context, it measures the ‘effort’ required to rearrange one distribution of water into the other. While being more broadly applicable, particular interest is payed 5 to hydrographs in this work. The Wasserstein distance is adapted for working with hydrographs in two different ways, and tested in a calibration and ‘averaging’ of hydrographs context. This alternate definition of fit is shown successful in accounting for timing errors due to imprecise rainfall measurements. The averaging of an ensemble of hydrographs is shown suitable when differences among the members is in peak shape and timing, but not in total peak volume, where the traditional mean works well.


Motivation
A fundamental aspect of hydrology is understanding the distribution and movement of water through the Earth system. It is therefore necessary to quantify the similarity of a pair of spatial (e.g. precipitation fields) or temporal (e.g. hydrographs) distributions of water. The goal of this quantification may simply be to gauge the semblance of the distributions. In other, more differences are observed, but these are combined into a single amplitude error to give large residuals, even though the human eye might perceive these two distributions as being quite close.
A commonly used metric for comparing both hydrographs and spatial fields is the root mean square error (RMSE), given by 25 Eq.
(1) where f and g are the functions under comparison, and have known values at the N points x i .
For the RMSE, f and g are compared by studying the difference in amplitude at specified points. While varying in the details, there are a variety of metrics, such as the Nash-Sutcliffe efficiency (NSE) and L 2 distance, that are similar in their derivation and compare the amplitude of the functions at specified locations. We will refer to the metrics of this type collectively as the residuals (green field minus purple field), which displays the double penalty effect as error is in displacement rather than amplitude.
As we can expect errors in both timing and amplitude when modelling streamflow, point-wise metrics alone are frequently 40 inadequate, and therefore require some type of multi-objective function (Yapo et al., 1998).
In a more general, multivariate setting, an analogous issue can occur. The misalignment of features displayed in Fig. 2 leads to the 'double penalty' effect, where large errors are assigned for small displacements errors, this occurring both where the model (purple in Fig. 2) has predicted the high amplitudes, and where the high amplitudes are observed (green) (Wernli et al., 2008;Farchi et al., 2016). 45 This poor quantification of fit under the influence of the displacement of features has led to a range of other 'scores' being used when validating forecasts from numerical weather predictions (see, for example, Roberts and Lean 2008;Wernli et al. 2008). These attempt to account for both the differences in particular features of each field, and the distance between these identified features.
An inadequate measure of fit can also lead to spurious local minima in the parameter estimation objective function, making 50 local optimisation algorithms ineffectual (Engquist and Froese, 2014). These are especially common when the recorded signal is multi-modal. In these cases, displacement of such multi-model features can lead to the 'peaks' becoming incorrectly aligned and producing a locally low misfit (see Fig. 3).
Whether the misfit is unsuitable due to incorrect quantification of timing errors (Fig. 1), the double penalty effect (Fig. 2), or local minima (Fig. 3), the key downfall of a point-wise metric can be traced to the same source: function comparisons occur 55 in complete isolation, independent of all other points. However, 'cross-domain' comparisons are fundamentally necessary to recognise the temporal or spatial displacements which generate the problems. These types of issues are not exclusive to hydrology. For example, seismic signals can be compared via differences in amplitude, or by considering differences in travel  and far right lower panels, with shift values of -4 and 4 respectively, both represent local minima in the RMSE function above, because the black curve has just a single peak aligned with the red dash-dot curve in each case, whereas the global minimum at shift value 0 has both peaks aligned. time of seismic phases and thus timing errors. Local minima in the misfit function frequently arise when using amplitude based misfit functions, this being the root of the infamous 'cycle-skipping' problem commonly encountered in full waveform 60 inversion, resulting from similar misalignment issues to those displayed in Fig. 3. Due to these similarities, we are motivated to explore the Wasserstein distance, derived from the field of optimal transport that has found recent use in seismology for quantifying waveform similarity in cases where amplitude-based misfit functions are inadequate (Engquist and Froese, 2014;Métivier et al., 2016).
While briefly addressed in Ehret and Zehe (2011), to the best of our knowledge the Wasserstein distance has had limited 65 exposure in hydrology. It has been found beneficial for parameter estimation problems in geophysics, due to it comparing data holistically, rather than in isolated pairings, with this key property being what we seek to exploit (Engquist and Froese, 2014; in some diagnostic data assimilation tasks with a similar rationale (Feyeux et al., 2018;Li et al., 2019).
Much like techniques considered in Roberts and Lean (2008) and Wernli et al. (2008), the Wasserstein distance is displacement-70 based, allowing more appropriate misfit quantification under temporal or spatial shifts. Its definition is derived from the concept of 'transporting' one mass distribution onto the other, so naturally accounts for any such shifts. Consequentially, small timing errors for hydrograph comparison do not incur large penalties resulting from the misalignment of peaks in flow, and is a more suitable measure of fit in the presence of timing errors than the RMSE or Nash-Sutcliffe efficiency. Use of the Wasserstein distance bears some similarity to the dynamic time warping technique of Ouyang et al. (2010), the series distance (Ehret and Zehe,75 2011), the hydrograph matching algorithm of Ewen (2011), and the cumulative distribution method of Lerat and Anderssen (2015). While distinct from objective functions due to the need to understand the noise statistics of the data, some likelihood functions for Bayesian inference have attempted to account for temporally correlated residuals (e.g. Schoups and Vrugt 2010;Vrugt et al. 2022). However, these generally operate directly in the point-wise residuals, while we use a different definition of residual based upon timing differences.

80
In this work, rather than detailing highly specific applications with complex models, we instead introduce methods that are suited for a range of tasks in (but not limited to) hydrology. The main goal is to highlight some of the key concepts of optimal transport, and how they may find use in hydrology via simple, illustrative examples. The limitations of its use are also discussed, and where further research is required. We also limit our experiments to one-dimensional problems, in particular the comparison of hydrographs. However, Sect. 1.2 details optimal transport in a more general sense, as it conserves its useful 85 properties in multiple dimensions so can also be used for comparing spatial fields.

Optimal transport
Optimal transport (OT) provides the numerical machinery for interrelating density functions. As distributions of water can be well represented as a density function with appropriate scaling, we suggest that OT gives the required tools for making more satisfactory comparisons than can be achieved by point-wise misfit functions.

90
Before the more modern interpretation with regards to density functions, OT originated with the work of Monge (1781), who was tasked with finding the most cost-effective way of filling a hole with a pile of dirt. OT is hence fundamentally concerned with reshaping one distribution of mass into another whilst minimising the requisite effort.
Little progress was made on generalised solution methods for this problem until it was rephrased by Kantorovich in 1942. This forms the foundation of many modern computational methods (Peyré and Cuturi, 2019). However, we will introduce 95 OT in a form closer to the original Monge definition, as this is based on continuous functions. While measured discretely, distributions of water have an underlying continuity in time or space so this description is appropriate.
The central aspect of Monge's studies was an optimal transport map, acting as a directory between each location in the pile of dirt (known as the source) and an associated location in the hole (known as the target). By following this optimal map, the source distribution can be rearranged into the target distribution with minimal work. The relation between the source, target, 100 and transport map is presented in Fig. 4. Key restrictions for transport are mass conservation, meaning the source and target must have equal mass, and non-negativity, meaning that there cannot be negative mass at any point in either. These properties have led OT to being framed as between probability densities, as these naturally obey both requirements. If we denote the source as f and the target as g, these functions must satisfy Eq. (2) and Eq. (3). (2) f (x) ≥ 0 for all x ∈ X , g(y) ≥ 0 for all y ∈ Y Here, X is the space over which f is defined, and Y is the space over which g is defined. In the case of hydrographs, these will be the time window over which the streamflow was measured, while for spatial fields, they will be the spatial extent of the recordings.

110
With the overarching goal of transporting f to g with minimal work, we must first define some notion of 'effort' through the cost function, c(x, y), giving the amount of work it takes to transport one unit of mass from x to y. With this cost function in hand, the total cost of following some transport map T will be given by Eq. (4).
Finding the optimal transport map (denoted as T * ) therefore amounts to minimising Eq. (4), subject to the additional require-115 ment that this map transports f exactly to g. Needless to say, this is a complicated functional minimisation problem, so will not be discussed in depth here as a simple analytical result is instead used. Direction to alternate methods is also provided in Sect. 2.1. As a mapping between densities, the optimal transport map is similar to transform sampling methods such as inverse transform sampling, or the trajectories in flow anamorphosis (van den Boogaart et al., 2017). However, we should reiterate that the optimal map differs in that it not only transforms between the densities, but also does so with minimal work.

120
Once found, the optimal transport map can be used to define the p-Wasserstein distance; the central concept of interest for this work. The p-Wasserstein distance (or simply, the Wasserstein distance) is defined using the p-norm raised to the p-th power as the cost function, giving the cost of Eq. (5) where D is the dimensionality of the densities and subscripts are vector components.
The choice of p reflects how severely mass transport is to be penalised. That is, if p is large, mass transport across long distances will require more 'work' relative to small transport distances and thus be avoided if possible. Upon finding the optimal transport map under the cost function Eq. (5), the Wasserstein distance is given by Eq. (6).
The Wasserstein distance is symmetric, and a true metric on the space of probability density functions, a deeper discussion 130 of which can be found in Ambrosio et al. (2008); Villani (2009). However, for the purpose of the present discussion, it is sufficient to state that similar mass distributions require little work to transport between and thus have a low Wasserstein distance in comparison to dissimilar distributions, which require more work for transport. We will pay particular attention to the square of the 2-Wasserstein distance, W 2 2 , due to its convex properties that make it well suited for parameter estimation purposes (Engquist and Froese, 2014). We therefore set p = 2 in Eq. (5) and Eq. (6).

135
Armed with the Wasserstein distance we can, for example, gauge the similarity of two hydrographs, or a pair of precipitation fields. Again, note the key property that this is a global comparison, considering the total cost of transporting the entire source to the target, rather than locally quantifying the distance using point-wise comparisons. It therefore naturally accounts for displacements through the transport map and and assigns a more appropriate penalty for this type of error. We can see this through the simple numerical experiment in Fig. 5, where the Wasserstein distance provides a consistent interpretation of displacement, even when the features are not overlapping. It also mitigates the occurrence of local minima, due to cross-domain comparison. In a parameter estimation context, this provides a smooth, convex function well suited for gradient-based local optimisation methods.
These properties with respect to displacements have encouraged application in atmospheric chemistry and seismology alike (Farchi et al., 2016;Engquist and Froese, 2014;Métivier et al., 2016). However, while the Wasserstein distance naturally 145 measures differences in shape and displacement, it is blind to differences in total mass between the source and target as they are scaled to unit mass prior to computation. Potential methods to counter this are proposed in Sect. 3 for applications where this may prove to be problematic.

Wasserstein barycenters
Beyond a straightforward measure of fit, the Wasserstein distance can also be used to define an 'average' of density functions, 150 known as the Wasserstein barycenter. The barycenter is the distribution that is the closest, in a Wasserstein sense, to the distributions we are finding the average of. This makes it the density that minimises a weighted sum of Wasserstein distances.
If the ensemble of densities are denoted g i and the barycenter is f , then f must obey both the requirements of a density Eq. (2) and Eq. (3) whilst minimising Eq. (7), with the relative weights of the component densities given by λ i .
As OT makes cross-domain comparisons, the Wasserstein barycenter gives a more intuitive intermediate distribution when densities are separated in time or space. We can explore this through a simple example in Fig. 6.
Taking the arithmetic mean at each point of the two distributions gives a poor intermediate representation. Both of the densities have a single peak, but the intermediate in Fig. 6 has two, so has not captured the characteristics of either. However, the Wasserstein distance provides a more satisfactory average conserving the shape of the peak. This again can be traced to 160 how we define similarity. The conventional mean is defined with respect to amplitude, while the Wasserstein barycenter is an average with respect to the displacement pathway, also known as the geodesic (Ambrosio et al., 2008). Section 3 uses the Wasserstein barycenter to find the average hydrograph among an ensemble, each perhaps derived from a different climate or hydrological model, allowing conservation of peak shape and number in the case where equivalent peaks are equal in volume.

A brief survey
Now that the favourable properties of the Wasserstein distance for mass transport problems have been explored, we must discuss the computational methods. This aspect is perhaps the greatest difficulty in large-scale implementation. There are a variety of available techniques, some of which have been explored previously in similar applications, whilst others are a source of future discussion. A brief survey follows, with further details given in the associated references, many of which have unexplored 170 potential for hydrological applications.
It was shown by Brenier (1991) that minimising Eq. (4) can be transformed into a partial differential equation when the squared Euclidean distance is used as the cost function. This can then be solved numerically, an example of one such solution given in Benamou et al. (2014). This was the preferred method of Engquist and Froese (2014); Yang et al. (2018) for comparison of seismic waveforms, although it is a technique not frequently seen in applied studies. The applicability of the method in The more commonly used approach, especially for machine learning and graphics applications, is to use the Kantorovich formulation and thus consider the source and target densities as discrete probability distributions, for which a linear program can be solved to obtain a transport plan (a generalised version of the transport map). While this is effective for a broader range of problems than the Monge-Ampere PDE that is specific to the squared Euclidean distance cost function, solving 180 the linear program is computationally expensive and scales poorly with the size of the problem. This burden can be somewhat alleviated using the entropically regularised approach pioneered by Cuturi (2013), which has become the foundation for modern machine learning applications. Further efficiency can be achieved on regularly spaced data using the convolutional entropic regularisation developed in Solomon et al. (2015). Also derived from the entropically regularised approach are the stochastic optimisation methods of Genevay et al. (2016) and Seguy et al. (2018). These are particularly useful when samples can be 185 drawn from the source and target densities.
There appears to be a bias in the literature towards discrete approaches, seemingly due to many machine learning problems not having a continuous interpretation, leaving the Monge-Ampere and related continuous approaches meaningless. We therefore should not be discouraged from using continuous methods for problems in the Earth sciences with spatial or temporal data, despite the relative lack of previous applications in the present literature. Indeed, we can see previous success of these 190 methods in Engquist et al. (2016) and Yang et al. (2018).

One-dimensional transport
The method used in this work is structured around the special case of transport in one-dimension, which accommodates a highly computationally efficient solution. When the source density f and target density g are both one-dimensional, the optimal transport map between them is given by Eq. (8).
Here, we have used F and G to represent the cumulative distribution functions (CDFs) of the source and target, these defined according to Eq. (9).
We should note here that in the one-dimensional case, using the optimal transport map is the same as the inverse transform 200 sampling method. Upon substituting the optimal map Eq. (8) into Eq. (6) to obtain the p-Wasserstein distance, a change of variables gives the simple form Eq. (10).
This reveals that, in the one-dimensional case, the Wasserstein distance is a measure of discrepancy between the inverse CDFs of the source and target. A visualisation of the one-dimensional Wasserstein distance under this interpretation is given in  In this study, we use the Nelder-Mead algorithm for optimisation, so the gradient of the Wasserstein distance is not required.
However, if a gradient based method were to be used, the derivative for the one-dimensional case can be derived directly from Eq. (10). Indeed, it was shown by Sambridge et al. (2022) that both the Wasserstein distance and its derivative can be computed through a sorting algorithm followed by a dot product. Engquist and Froese (2014) and Métivier et al. (2016) have also made 210 use of the derivatives of the Wasserstein distance for parameter estimation, although did not make use of the one-dimensional form.
Much like the Wasserstein distance, finding the Wasserstein barycenter of a series of densities is vastly simplified in one dimension. Rather than performing some type of optimisation scheme to minimise Eq. (7), we can find the barycenter directly from the inverse CDFs. Keeping the notation from Eq. (7), we have Eq. (11) (Bonneel et al., 2015).
That is, the inverse CDF of the barycenter is the weighted sum of the inverse CDFs of the component densities. Finding the barycenter then becomes a problem of converting an inverse CDF evaluated at a set of discrete points into a density function. Interchanging the (s, F −1 ) coordinate pairs converts from the inverse CDF F −1 to the CDF F . A finite difference differentiation method can then be used to move from the CDF to the density function f . Finally, this can be interpolated into 220 a smooth density function.
Overall, the Wasserstein barycenter definition of average is equivalent to the histogram interpolation of Read (1999), although the author appears to have proposed this idea independently. The relation between the densities and their barycenter is displayed in Fig. 8.
Again, it should be made clear that the computational results discussed here apply only to the one-dimensional case. Different 225 methods must be used for multivariate problems, a selection of which are given in Sect. 2.1. There is also the potential to use the alternative sliced Wasserstein distance (Rabin et al., 2011) or the more general continuous extension known as the Radon Wasserstein distance (Bonneel et al., 2015), which both extend the one-dimensional results using one-dimensional projections of the multivariate mass distribution. The Radon and sliced Wasserstein distances display similar properties to the true Wasserstein distance, and are a potential alternative to the more computationally expensive methods.

Adaptation to hydrology
OT is defined only for probability densities. For the applications we envisage, the non-negativity requirement will naturally be obeyed, as we are measuring inherently positive masses of water. The restriction of unit mass requires a little further consideration. We could take one of two differing philosophies: modify the data so it is compatible with OT, or redefine OT such that it works with the type of data used. Here, we will consider both. Firstly, we can scale the water distributions such that 235 they have unit mass, and then compute the Wasserstein distance. The total mass of the densities is given by Eq. (12). The scaling to unit mass is then given by Eq. (13).
Though this indeed makes the data compatible with OT, it also means that the Wasserstein distance becomes completely 240 insensitive to differences in total mass. A multi-objective approach can be used to allow both shape (Wasserstein) and total mass (additional penalty term) to be accounted for, giving a misfit of the form Eq. (14).
Here, γ provides the relative importance of the total mass error compared to the Wasserstein distance, while h is the specific form of the penalty term. The quadratic function Eq. (15) is a simple choice for this penalty.
By selecting an appropriate weighting factor, we can develop a balanced misfit measure that contains the favourable properties of the Wasserstein distance, whilst also removing the insensitivity to total mass differences. Note however that this weighting is a choice, so may require some tuning procedure to attain the desired balance between the two aspects of fit.
We will now take the alternative approach of making a modification to OT to suit our hydrological purposes. The proposed 250 result is derived for one-dimensional data, although could potentially be extended to the multivariate case in the same manner as the Radon Wasserstein distance (Bonneel et al., 2015). We can build the new definition upon the foundation that we can compute the cumulative mass of a non-negative function, regardless of whether this mass sums to one, and that this function will be monotonically increasing. The key step of one-dimensional OT is that points of equal cumulative mass are matched in the optimal map (see Eq. (8) and Fig. 7). A similar concept with cumulative densities of arbitrary mass can be designed.

255
Inspired by the ordinary Wasserstein distance, we can map the points of half cumulative mass together (medians in the case of probability densities). To ensure these points are always aligned regardless of total mass, a shift can be applied such that the cumulative densities are centred at this halfway point using Eq. (16) and Eq. (17).
For the matching of points of equal cumulative mass, the inverses of these are required. As we are working in the time window [0, T ], we can limit the inverse to this range, giving the piece-wise functions Eq. (18) and Eq. (19).
In a transport sense, we can see from Fig. 9 that this amounts to transporting the excess (or deficit) mass to (or from) the boundaries, which act as a source or sink to maintain the mass balance. In this case with a temporal domain, the boundaries are the start and end of the time window. The excess or deficit mass is therefore associated with times before t = 0 or after t = 500, whichever is closer in time. This makes it an alternate implementation of the method discussed in Farchi et al. (2016), who also proposed using having sources or sinks at the boundaries of a spatial domain to complete the mass balance. Note that, 270 like the traditional one-dimensional Wasserstein distance, this is a result that can be computed efficiently without the need to solve a partial differential equation, linear program, or iterative scheme. Now that the mapping and form of the inverse cumulative distributions have been proposed, the modified Wasserstein distance under the interpretation that it is the total transport cost can be defined according to Eq. (20). While Eq. (20) is expressed with infinite bounds, the integrand is only non-zero over the interval [−M max /2, M max /2], where M max = max{M f , M g }, so can easily be computed numerically over this range.
The adjustment of barycenters for distributions of arbitrary total mass is much simpler. We can use the scaling defined in Eq.
(13) and find the barycenter of these distributions, then scale the result such that its total mass is the average of the composite 280 distributions. This is equivalent to replacing the Wasserstein distance in Eq. (7) with the penalised distance Eq. (14) under a quadratic penalty Eq. (15).

Calibration
As a first application, the calibration of conceptual rainfall-runoff models using the Wasserstein distance as a minimisation 285 objective was considered. This experiment is similar to that found in Lerat and Anderssen (2015), but uses all model parameters in the hydrological model.
A conceptual rainfall-runoff model gauges the relationship between a rainfall time-series (hyetograph) and streamflow time series (hydrograph) for a particular watershed. By calibrating the model parameters to best capture this relationship, the model can be used for forecasting future streamflow under chosen rainfall conditions, or to infer properties of the watershed itself.

290
Automatic calibration methods seek to do this by minimising the discrepancy between the simulated streamflow from the rainfall-runoff model and the streamflow that was observed at gauging stations (Nash and Sutcliffe, 1970).
As hydrographs are a time-series, the efficient one-dimensional techniques described in Sect. 2.2 were used. The application of the Wasserstein distance to this problem bears some similarity to the technique of dynamic time warping (DTW), an application of which to hydrology can be found in Ouyang et al. (2010), the series distance developed in Ehret and Zehe (2011), the 295 cumulative distribution method in Lerat and Anderssen (2015), and the hydrograph matching technique in Ewen (2011). These also strive to allow timing errors to be better accounted for when comparing hydrographs.
As was used by Lerat and Anderssen (2015), a non-linear storage model was employed to test the misfit functions. The storage in the model evolves according to the differential equation (21).
The outward streamflow flux is given by the non-linear model Eq. (22).
Using some initial volume of water in storage, we can update the storage and compute the streamflow at each time step for any given set of model parameters and hyetograph. This gives three model parameters (m 1 ,m 2 and m 3 ) that must be calibrated using given rainfall and runoff measurements. To work in a controlled environment while testing the main characteristics of 305 the Wasserstein distance, synthetic rainfall events were used and the model used to generate the expected streamflow. We then attempted to recover the model parameters used to generate the synthetic observations. As precipitation is a more uncertain measurement than streamflow, we subjected the synthetic rainfall event to timing errors, recalling that these types of errors are poorly represented when using point-wise misfit metrics (Ehret and Zehe, 2011). Figure   10 shows the true synthetic hyetograph, the observed hyetograph corrupted by timing errors, and the streamflow generated by 310 the true rainfall using the storage model defined in Eq. (21) and Eq. (22).
We then calibrated the three model parameters to this streamflow, using the erroneous rainfall measurements with a variety of misfit functions. Note that in the case of error-less data and a unique solution, all misfit functions discussed here would have a global minimum of zero at the true model parameters. Optimisation was performed with the RMSE, Wasserstein distance (γ = 10, 100) and the hydrograph-Wasserstein distance using the Nelder-Mead algorithm, implemented through the scipy Python 315 package (Nelder and Mead, 1965;Virtanen et al., 2020). Figures 11, 12 and 13 show the misfit as a function of the model parameters, along with the true model parameters and the optimised parameters under each misfit.
The Wasserstein-based distances provided a misfit minimum closer to the true model parameters than the RMSE. A greater understanding of this can be garnered by examining the hydrographs produced by the calibrated model Fig. 14.
The Wasserstein-based distances better recovered the model parameters, and visually give a better hydrograph fit. The RMSE 320 calibration underestimated the maximum flow peaks, in favour of more sustained lowered flows. The reason for this can be traced back to Fig. 1, where misaligned steep peaks gave large residuals. This can be lessened by having wider peaks, even though this gives a visually less satisfactory hydrograph and poorer estimates of model parameters. We can conclude that for this simple synthetic case study, that the Wasserstein distance has promising attributes with respect to timing errors in hydrograph fitting.  Of course, this only shows the efficacy of the Wasserstein distance for one particular rainfall event. Random synthetic rainfall events were therefore generated using the method described in Appendix A, with the model calibrated to each event.
The calibrated model parameters are compiled into boxplots in Fig. 15.
Whilst still having some variability and bias, the Wasserstein-based distances performed significantly better across the 500 trials with a median result closer to the true value for all parameters and lowered variability. The ordinary Wasserstein distance 330 outperformed the hydrograph-Wasserstein distance for this model. The size of the penalty weighting factor γ had little effect on the location of the optimum for the Wasserstein distance, but altered the shape of the misfit surface (see Figs. 11,12 and 13).
The better performance of the penalised Wasserstein distance compared to the hydrograph-Wasserstein distance may be due to the form of this model. As m 1 and m 2 purely control hydrograph shape and m 3 purely controls total output mass, there is a complete separation in which model parameters affect which aspect of misfit; that is, no model parameter simultaneously Subsequently, there is no 'trade-off' between lowering the Wasserstein distance or penalty term; both are always possible as the model parameters only influence both if mass is pushed outside the time window (hence there being any variability in the m 3 panel of Fig. 15).

340
If there were model parameters that influenced both shape and total mass, this separation of aspects would not be present and the magnitude of γ may have increased importance due to the trade-off between fitting the shape and balancing the mass of the hydrographs when tuning a model parameter.

Instantaneous unit hydrograph model
Section 3.1 showed that the Wasserstein distance gives a better account for timing errors than can be made from point-wise 345 metrics. The effect of delay times from within the hydrological model itself is now considered. A simple unit hydrograph model was used to generate such delays between rainfall and the associated streamflow. The instantaneous unit hydrograph (IUH) is defined as the streamflow response for a unit impulse of effective rainfall over the catchment. Under the assumption of a linear response, the stream hydrograph can be modelled via a convolution of the rainfall and IUH, giving the model Eq. (23).
In reality, we can only measure the streamflow and rainfall, leaving us to infer the IUH from such measurements. To enforce some set of desired characteristics on the IUH, such as smoothness, positivity, and unit mass, it can be restricted to belonging to a parametric set, each of which is of the correct form. A commonly used synthetic unit hydrograph is the Nash hydrograph (Nash, 1957). This assumes the watershed is composed of a series of linear reservoirs, giving a unit hydrograph of the form This leaves us with two parameters, θ and k, to calibrate. Although there are multiple existing ways of solving this problem, so the OT approach is not necessarily warranted, this application exhibits some of the key features we would like to explore more generally in rainfall-runoff model calibration whilst keeping the model structure simple.
The effective rainfall time-series was generated using the method described in Appendix A with 10 rainfall events, giving 360 the output displayed in Fig. 16, where the true model parameters of the IUH are θ = 3, k = 5.
The behaviour of both the Wasserstein distance and RMSE were explored here with respect to the time delays induced by the IUH model. The misfit as a function of the two model parameters is shown in the left panels of Fig. 17.
The key difference between the misfit surfaces in Fig. 17 is that local optimisation methods will converge for a greater number of starting locations towards the true optimal point (marked red in left panels) when using the Wasserstein-based 365 distance. While many initial estimates with the RMSE will also converge to this same point, there also is alternative minimal points to which the optimisation scheme may also converge. A grid of starting estimates of the model parameters was used to examine this more concretely in the right panels of Fig feasible. Furthermore, a good initial estimate may be generated using the method of moments for the Nash IUH. However, 380 these results show the ability of the Wasserstein distance to recognise the error in improperly aligned peak flows.
The power of the Wasserstein distance may come to the fore when a more complex watershed and thus model is used, perhaps producing multiple pulses per rainfall event and yielding multiple peaks in the IUH. These results also capture the general behaviour of models possessing significant delay times, allowing the simulated hydrograph to 'shift' in time across the observations, in a similar manner to Fig. 3.

Hydrograph barycenters
Beyond the use of the Wasserstein distance purely as an objective function, attention will now be turned to the application of Wasserstein barycenters to hydrology, using hydrographs as the object of study. The power of the Wasserstein barycenter is that it gives a notion of 'average' when features have been displaced.
This means that an ensemble of hydrographs describing the same event, perhaps with different climate or hydrological 390 models, can be 'averaged' into a single hydrograph that carries characteristics of each ensemble member. Note that this will only be true if the main source of difference between ensemble members is timing and peak shape differences, rather than peak volume differences.
To test this premise, two different IUH models were applied to the same synthetic rainfall event, giving the differing hydrographs shown in the top panel of Fig. 18. As they were from the same rainfall event and a unit hydrograph model was used, the  Unlike the ordinary mean, the Wasserstein distance captured the correct number and general characteristics of the peak flows.
Again, this is built on the assumption that the peaks of each hydrograph mainly differ in shape and timing, but not volume. If they differed in volume, the barycenter could exhibit peaks in unusual locations, as the mass is being 'viewed' halfway through 400 transport across the domain.
Consider the two hydrographs shown in the top panel of Fig. 19. They are the same in timing, but differ in the volume of water. In this case, the regular mean gave a much better notion of average, as the difference between the hydrographs is better explained in terms of amplitude, rather than displacement as was the case in Fig. 18.
We therefore see that while the Wasserstein barycenter is well suited for an ensemble of hydrographs with differing peak 405 flow timings, it is not well suited for amplitude differences. This is to be expected from its definition. We therefore suggest that the conventional mean is best suited for an ensemble with differing amplitudes but consistent timings, but the Wasserstein barycenter when timing is variable but volumes in peak flows are consistent.

Conclusions
Quantifying the similarity of temporal or spatial distributions of water occupies an important role within hydrology. The way in 410 which a 'good' fit is defined directly influences the character of the results. Many commonly used misfit functions, such as root mean square error, and Nash-Sutcliffe efficiency, quantify fit by considering differences in amplitude. While this is perfectly acceptable when errors are restricted to amplitude, these measures of misfit do not well quantify displacement errors.
In this work, we have suggested the Wasserstein distance, derived from optimal transport, as a candidate misfit function for applications where displacement or timing errors are prominent. This quantifies difference in terms of the effort required 415 to transform one mass distribution into another. A modification of traditional OT and the associated 'hydrograph-Wasserstein distance' was also developed.
While the Wasserstein-based measures certainly gave better calibration results than RMSE, there was still a slight bias towards peaks of reduced amplitudes, although to a much lesser extent. Some type of multi-objective method may circumvent this, by using a measure comparing the amplitude of peak flows. Although the Wasserstein distance with a penalty term 420 outperformed the hydrograph-Wasserstein distance for these synthetic tests, there may be broader implications for this second misfit function. Very little interest has been given thus far to the modification of OT for particular applications, with most preferring to force data into the density function mould. Other ways of modifying, whilst still capturing the essence of OT are therefore a key point of further research, as they may allow the simultaneous capture of displacement and total mass errors without the need for a user-defined weighting term between the aspects. Focus in this work has been placed upon the 425 Wasserstein distance, rather than the optimal map from which it is derived. There is unexplored potential in this optimal map for providing a two-way mapping between collected data and a reference distribution in a similar vein to flow anamorphosis (van den Boogaart et al., 2017;Talebi et al., 2019).
As proposed by Ehret and Zehe (2011), a major drawback of the Wasserstein distance for hydrographs is the mapping of water masses between functionally different parts of the hydrographs. For example, if the modelled volume of a particular 430 storm event exceeds the observations, this excess water is mapped to the next (and unrelated) storm event. If the storm events are separated by long dry periods, a large transport cost would be incurred by moving the mass between the events. Therefore, if fitting multiple events interspersed with dry spells, it may be more appropriate to segment the hydrographs into each discrete event, and compute the Wasserstein distance individually within each time window to prevent this mapping between events.
The objective function would then become the summation of the Wasserstein distances for each time segment. This would not 435 be possible for wet periods with overlapping rainfall events, but the penalty for transporting mass between events would be much smaller in this case as they occur closer together in time.
It is also important to remember that for more complex models and data, we cannot expect the Wasserstein distance to capture all aspects of fit. Much like the amplitude-based metrics, it prioritises particular aspects of fit and has the potential to yield large amplitude errors at specific points, as long as mass has not been shifted far from the target location. We are therefore 440 led back to our introductory question of what is a good 'fit'? If we know, or expect, timing or spatial displacement errors in our measurements or model output, a transport-based definition of fit would seem appropriate. However, this may not be the case when other types of errors are at play. A multi-objective approach may therefore still be a necessary, and even encouraged technique. The results here do however show that the Wasserstein distance is suitable for quantifying displacement errors, so a valuable addition to the toolbox of misfit functions in operation in hydrology.
The length of each of these rainfall events is then chosen from a uniform distribution of specified maximum length t max , so the length of the i-th event is given by Together, these give a start and end time for each storm, where the end time e i is given by The intensity of the rainfall r i for each storm is chosen independently from an exponential distribution with rate parameter λ. r i ∼ Exponential(λ) In this study, we set λ = 1, as the rainfall units are arbitrary.

460
To corrupt the measurements with timing errors, the timing of each storm is shifted independently by a number δ i drawn from a uniform distribution ranging between a positive and negative maximum time shift, δ max , so This gives the altered start and end times for each storm as Between the start and end time of each rainfall event, constant rainfall of the randomly chosen intensity is applied. If rainfall events overlap, the intensities are added for this cross-over period.
Author contributions. J.C.M. wrote the manuscript and code for this paper. J.C.M. and M.S. both developed the theory of the work. M.S. assisted in editing and preparation of the manuscript, and supervised the project.