Numerical daemons of hydrological models are summoned by extreme precipitation

Hydrological models are usually systems of nonlinear differential equations for which no analytical solutions exist and thus rely on approximate numerical solutions. While some studies have investigated the relationship between numerical method choice and model error, the extent to which extreme precipitation like that observed during hurricanes Harvey and Katrina impacts numerical error of hydrological models is still unknown. This knowledge is relevant in light of climate change, where many regions will likely experience more intense precipitation events. In this experiment, a large number of hydrographs 5 is generated with the modular modeling framework FUSE, using eight numerical techniques across a variety of forcing datasets. Multiple model structures, parameter sets, and initial conditions are incorporated for generality. The computational expense and numerical error associated with each hydrograph were recorded. It was found that numerical error (root mean square error) usually increases with precipitation intensity and decreases with event duration. Some numerical methods constrain errors much more effectively than others, sometimes by many orders of magnitude. Of the tested numerical methods, a second-order 10 adaptive explicit method is found to be the most efficient because it has both low numerical error and low computational cost. A basic literature review indicates that many popular modeling codes use numerical techniques that were suggested by this experiment to be sub-optimal. We conclude that relatively large numerical errors might be common in current models, and because these will likely become larger as the climate changes, we advocate for the use of low cost, low error numerical methods. 15 1 https://doi.org/10.5194/hess-2021-28 Preprint. Discussion started: 5 February 2021 c © Author(s) 2021. CC BY 4.0 License.

Abstract. Hydrological models are usually systems of nonlinear differential equations for which no analytical solutions exist and thus rely on numerical solutions. While some studies have investigated the relationship between numerical method choice and model error, the extent to which extreme precipitation such as that observed during hurricanes Harvey and Katrina impacts numerical error of hydrological models is still unknown. This knowledge is relevant in light of climate change, where many regions will likely experience more intense precipitation. In this experiment, a large number of hydrographs are generated with the modular modeling framework FUSE (Framework for Understanding Structural Errors), using eight numerical techniques across a variety of forcing data sets. All constructed models are conceptual and lumped. Multiple model structures, parameter sets, and initial conditions are incorporated for generality. The computational cost and numerical error associated with each hydrograph were recorded. Numerical error is assessed via root mean square error and normalized root mean square error. It was found that the root mean square error usually increases with precipitation intensity and decreases with event duration. Some numerical methods constrain errors much more effectively than others, sometimes by many orders of magnitude. Of the tested numerical methods, a second-order adaptive explicit method is found to be the most efficient because it has both a small numerical error and a low computational cost. A small literature review indicates that many popular modeling codes use numerical techniques that were suggested by this experiment to be suboptimal. We conclude that relatively large numerical errors may be common in current models, highlighting the need for robust numerical techniques, in particular in the face of increasing precipitation extremes.

Introduction
Computational hydrological models describe the movement and distribution of water within a region. They enjoy frequent use within and outside of academia, addressing a diversity of topics from the determination of catchment characteristics (Kirchner, 2009;Rempe and Dietrich, 2014;Wrede et al., 2015;Melsen et al., 2018), and assessing water supply security (Paton et al., 2013), to deciding which areas are in danger of flooding (Jasper et al., 2002;Madsen et al., 2014).
Hydrological models usually have state variables, which describe relevant physical quantities, and fluxes, which describe how the state variables change over time or space. For example, a state variable could be the amount of water in the unsaturated zone of a catchment, and fluxes interacting with that state variable could be evaporation, percolation to the saturated zone, discharge from the catchment, or precipitation, among others. Differential equations are used to describe the relationships between fluxes and state vari-cal strategies. Therefore, hydrological models contain mathematical relationships between state variables and fluxes (as well as relationships between state variables themselves) that need to be solved numerically (approximately) rather than analytically (exactly).
While discharge predictions resulting from hydrological models are often reasonably accurate (Refsgaard and Knudsen, 1996;Refsgaard, 1997;Addor et al., 2011), they are always subject to error. Total hydrological model error can be decomposed in a few ways, i.e., observational, structural, and numerical, among others. Observational errors are differences between real and observed values caused by inaccurate measurements, and structural errors are differences between model results and observed quantities due to the conceptual simplification or misrepresentation of processes in a model compared to reality. Numerical errors are differences between the exact and approximate solutions to the set of equations composing the model that result from the choice of numerical method used to find an approximate solution (Higham, 2002). While many recent efforts in hydrology have advanced observational data quality (Wrede et al., 2015;Kittel et al., 2018) or structural representations of nature (Wrede et al., 2015;Melsen et al., 2018;Melsen and Guse, 2019;Prancevic and Kirchner, 2019;Dralle et al., 2018;Coxon et al., 2014), relatively few studies investigate the effects of numerical choices on model error. Notably, it has been demonstrated that numerical method choice has a large impact on hydrological modeling error Kavetski and Clark, 2010). These papers test the ability of various numerical methods to approximate exact solutions and to predict real discharges, including fixed step and adaptive methods, implicit, explicit, and semi-implicit methods, and first-and second-order methods. Changing these options leads to differences in how fluxes are calculated, and it is shown that certain combinations of these qualities allow for a relatively accurate approximation of the exact solution of a hydrological model.
The numerical daemons papers Kavetski and Clark, 2010) provide useful numerical conventions in observational contexts, mainly under general hydroclimatic conditions. As the climate changes and extreme precipitation events become more common (Trenberth, 2011;Meehl et al., 2005;Prein et al., 2017;Huang et al., 2020), one might expect numerical errors to become larger. It is simple to demonstrate that numerical errors could depend on precipitation extremeness, as in Fig. 1.
This figure shows discharge hydrographs resulting from a model using two different numerical methods. The conditions yielding the left and right plots are identical in every way, except for the precipitation intensity used to simulate discharge, which is 3 times larger in the graph on the right. The only differences between the blue and green hydrographs are caused by choice of numerical method. Clearly, the two methods agree relatively poorly when higher intensity precipitation data are used; thus, there is a greater oppor- Figure 1. Illustration of the impact of precipitation intensity on differences between discharge simulations from the same model but with two different numerical techniques. Discharge hydrographs are shown resulting from two different numerical methods, indicated by blue and green. These are the fixed step explicit Heun and adaptive implicit Heun methods, respectively, as discussed in Sect. 2. The only difference between the left and right plots is that the precipitation intensity is 3 times greater during days 2 through 6 for the right plot. With this figure, we do not suggest that any particular climate's precipitation will triple; this is merely used to demonstrate the relationship between precipitation intensity and potential numerical error.
tunity for a numerical method to provide an erroneous representation of the exact solution when precipitation is more intense. This is compatible with the findings that models can perform relatively poorly under relatively intense precipitation regimes (Weerts and El Serafy, 2006;Noh et al., 2014;Jasper et al., 2002) where it is possible that numerical error contributes to total modeling error. Furthermore, one might intuitively expect greater numerical errors for more extreme precipitation. The amount of water precipitated by a more extreme storm may be comparable to the total storage of a model, whereas this is not the case for milder storms. Given a sufficiently large time step or large enough flux, fixed step numerical solvers are not equipped to handle large precipitation events. In essence, it is known that numerical error can contribute significantly to total modeling error, and it is known that total modeling errors can increase with increasing precipitation extremeness, but it is currently not known how changing precipitation extremeness (intensity or duration) will impact the numerical error associated with a model.
In this paper, the same eight numerical methods as in Clark and Kavetski (2010) and Kavetski and Clark (2010) are studied (see Sect. 2 for descriptions of these methods). We aim to determine which qualities of these numerical methods contribute to numerical error as precipitation is varied from mild to extreme. To this end, the numerical error associated with each method is assessed over a broad range of precipitation intensities, ranging from mild to more intense than historically observed, for various event durations.
In order to thoroughly investigate the relationship between numerical error and precipitation extremeness for each numerical technique, testing must occur under a variety of con-ditions. Numerical error could depend on model structure (i.e., the specific relationship between state variables and fluxes, as well as choice of which state variables to include), physical setting (given by a specific combination of model parameters), and antecedent state variable values. This dependence is due to the fact that each of these will affect the set of differential equations composing a model. In order to systematically test these different conditions, a modular modeling framework (MMF) is required. Modular modeling frameworks are tools that are able to rapidly create hydrological models with various processes or structures included, varying numerical schemes (with some MMFs), and other options related to model setup or input. This allows a comparison of model results as these settings are systematically altered. The Framework for Understanding Structural Errors (FUSE) is the MMF selected for this experiment.
We are also interested in the change in efficiency of numerical techniques with respect to precipitation extremeness, where efficiency involves both the numerical error associated with a method and its computational expense. Therefore, trends in computational expense with changing precipitation are also assessed.
It is possible that many existing hydrological modeling codes use numerical methods that yield large numerical errors. In order to gain insight into how often relatively erroneous numerical methods might be employed, a small literature review is conducted on 12 popular modeling codes, determining the numerical options available in each. When the numerical techniques used by the reviewed codes are placed in the context of this study, an estimate of the potential magnitude of numerical errors arising from these codes is obtained.
In summary, we use an MMF to study the numerical error and computational expenses associated with different numerical techniques as precipitation varies from mild to extreme. This is carried out under a variety of initial conditions, parameter sets, and model structures for further generality. Trends in error and efficiency are analyzed. Then, a literature review suggests the prevalence of potential numerical errors in popular modeling codes. Finally, we present a discussion of the key findings and their practical implications for modeling, where we hypothesize that numerical errors can be reduced in practice via a careful selection of the numerical method.
2 Overview of numerical methods used for approximating differential equation solutions in this experiment A hydrological model is usually composed of a system of differential equations relating to state variables and fluxes. These equations are usually highly nonlinear, which means that they cannot be solved in closed form (though this is not always the case; see Coxon et al., 2019). Thus, a numerical approximation must be used. In this section, a description of the numerical techniques studied in this paper is given, together with a justification for the selection of these specific methods. The reader might choose to skip this section if familiar with numerical methods for approximating solutions of systems of differential equations. In this experiment, each numerical technique is used to solve a system of coupled ordinary differential equations, where the equations are coupled because the flux from each state variable generally depends on multiple state variables. For example, the net flux from the saturated zone could be a function of the amount of water in the saturated zone and the amount of water in the unsaturated zone. These systems can be solved in two ways: via sequential solving (or operator splitting), where the equations are solved (and the state variables are updated) in some predetermined order, or via simultaneous solving, where all equations are solved at once (which requires a space-state formulation of the equations which compose a model). Sequential solvers are able to use different numerical methods for individual fluxes, which in some fields is desirable (for example, Glowinski et al., 2017 state that "splitting of diffusion terms and convection terms in a convection-diffusion partial differential equation" allows for a faster solution of the differential equation). However, when using sequential solvers, the model output may depend on the sequence in which state variables are solved, which is undesirable when comparing different model structures Glowinski et al., 2017). Using the simultaneous method, N equations will be solved at the same time with the same numerical technique, where N is the total number of state variables, so that n = 1, . . . , N in the equations of this section.
The simplest numerical method studied in this paper is the fixed step explicit Euler method (credited to Leonhard Euler). This method works by evaluating the flux for a state variable at the start of a time step, then adding this flux to the state variable at the beginning of the time step to generate the value for the state variable at the end of the time step. An illustrated example is given in Fig. 2a. Symbolically, the method is as follows: where t is the time step, S n is a state variable, f n is the time derivative, or net flux, of S n at time t and is a function of S and t, and S is a vector of all relevant state variables for computing f n , determined by the model structure. The flux f is indicated to be a function of S and t because fluxes generally depend on both (for example, net flux to the unsaturated zone depends on precipitation, which is a function of time, and the amount of water in the unsaturated zone, which is a state variable). O( t) indicates that the truncation error -or the difference between this numerical approximation and an exact solution to the differential equation -is proportional to the size of the time step to the first power. This can be shown by equating the right-hand side of Eq.
(2) to a Taylor series expansion of the left-hand side of Eq.
(2), yielding error terms where the lowest order in t is 1. Thus, this method is first-order accurate. The method is known as fixed step because the time step can only be one value (in this experiment, all fixed steps are daily) and explicit because it calculates the flux using state variables that are already known (Süli and Mayers, 2003). The fixed step implicit Euler method is similar to its explicit counterpart in that it is first-order accurate and has a fixed time step. It is different in that the derivative at the end of the time step, rather than the start of the time step, is used to calculate the state variable at the end of the time step. However, the derivative of a state variable is generally a function of the state variables, which are initially unknown for the end of the time step. So, the implicit Euler method must iteratively solve a system composed of equations in the style of Eq. (4) by searching for the values of S(t + t) which yield a derivative that matches S n (t). In the graphical exam-ple of Fig. 2b, the value for y at the end of the time step is first approximated with an explicit Euler calculation, and then the method checks if the derivative there leads back to the value at the beginning of the time step within a given mass balance error tolerance (the mass balance error tolerance and monitoring methods are not studied in this paper). If not, this process is repeated with a new guess for S n (t + t) until the derivative at the end of the time step leads back to the value of y at the beginning of the time step or graphically until the dotted orange line leads back to the value of y at t = 0. Symbolically, the method is given by the following: where f n is now the derivative of S n at time t + t. The iterations required by fixed step implicit methods can make them more computationally expensive than their explicit counterparts. However, a property of implicit methods is that they are unconditionally stable (Jameson and Turkel, 1981) and can, in some situations, be more accurate than explicit methods. The fixed step semi-implicit Euler method works by first performing an explicit Euler calculation, followed by a single correction in the style of implicit Euler. Visually, this would yield a result at the second guess value in Fig. 2b, and mathematically, this is achieved by a single iteration of Eq. (4) after the initial estimate of S n (t + t). This method then theoretically has a computational expense usually somewhere between the fixed step implicit and explicit Euler methods and is first-order accurate. It is typically able to constrain instabilities (Kavetski et al., 2002).
The fixed step explicit Heun method (credited to Karl Heun) works by first explicitly calculating the derivative at the start of a time step (in exactly the same style as fixed step explicit Euler), then explicitly calculating the derivative at the end of the time step using the initial explicit Euler prediction, averaging the derivatives from the start and end of the time step, and then using this corrected average derivative in order to make a final prediction of the state variable value at the end of the time step. In Fig. 2c, the blue slope, or the explicit Heun derivative, is simply the average of both dotted red slopes, where the dotted red slope on the right is calculated using the initial explicit Euler result y * at t = 1, and the dotted red slope on the left is the same slope as in Fig. 2a. Symbolically, the method is as follows: where S * (t + t) is the initial explicit Euler prediction of the state variables. This is a second-order method; its trun-cation error is proportional to t 2 , meaning that as t approaches 0, the truncation error approaches 0 faster than it would with first-order numerical methods. The final tested fixed step method is the fixed step implicit Heun method, which is also second order. This method is commonly known as the implicit trapezoidal or trapezium method (Süli and Mayers, 2003). It is called the implicit Heun method in this paper due to the fact that it is an implicit analog of the explicit Heun method and for consistency with Clark and Kavetski (2010). The two Heun methods are similar in that they both use the explicit Euler prediction at the beginning of the time step, but the derivative is calculated implicitly in the implicit Heun method, incorporating both the known value of the state variables at time t and the initially unknown value of the state variables at time t + t. Symbolically, the method is as follows: where Eq. (8) must be solved iteratively. Graphically, this will appear somewhat similar in method to Fig. 2b. While so far only fixed step methods have been discussed, this paper also tests adaptive methods, which are able to adapt their step sizes based on various criteria. Adaptive methods can reduce numerical error, due to the fact that a reduced step size will reduce the truncation error. They can also reduce computational expense because a small step size is not always necessary for high numerical accuracy (Press and Teukolsky, 1992). This paper examines the following three adaptive methods: the adaptive semi-implicit Euler method, the adaptive explicit Heun method, and the adaptive implicit Heun method, which are all adaptive analogs of their fixed step counterparts. The adaptive semi-implicit Euler method adapts its step size by calculating the error e or the difference between its explicit Euler component and its final value. Specifically, after computing S n (t + t), in the following, it checks whether: where τ r is the relative error tolerance (unitless), and τ a is the absolute error tolerance; respectively, these default to 0.01 and 0.01 mm. Both e and S n have units of millimeters. Each state variable must satisfy this threshold. If the step size is accepted, then S n (t + t) is taken as the state variable for the end of the time step, and the step size is adjusted according to Appendix B in Clark and Kavetski (2010). If the step size is rejected, then it is reduced incrementally until convergence criteria are satisfied, which is also detailed in Appendix B of Clark and Kavetski (2010). The adaptive implicit and explicit Heun methods make the same comparison, although they do calculate e based on differences between a first-order component and the second-order prediction at the end of the time step rather than two first-order predictions as in the case of the adaptive semi-implicit Euler method . In all of these adaptive methods, various precalculated components are compared, so additional calculations are not needed for the sake of error control. Therefore these are embedded error control methods (Press and Teukol sky, 1992). In total, eight distinct numerical methods are employed in this experiment. These broadly represent popular choices in hydrologic models ; this is further tentatively supported by the results of the literature review (see Sect. 4.7), where none of the numerical techniques used by the surveyed modeling codes were significantly different from those available in FUSE. Nonetheless, here a few other choices in numerical methods are briefly described. Midpoint methods are similar to Heun methods in that they are second order (Süli and Mayers, 2003). The difference is that midpoint methods calculate an intermediate flux in the middle of the specified time step and then average this result with another preliminary flux calculated at the end of the time step in order to produce a final flux for the end of the time step, whereas Heun methods perform the same strategy on subsequent time steps (rather than subdividing a single time step). Higher order Runge-Kutta methods (credited to Carl Runge and Wilhelm Kutta) are also promising choices to numerically solve systems of differential equations; these methods take further terms in the Taylor expansion that is used to approximate the exact solution of a system of differential equations. They therefore have numerical errors which are proportional to the time step raised to the order of the method (e.g., a fifth-order method, as in Schoups et al., 2010). Finally, note that the models in this experiment are lumped rather than spatially distributed, and therefore, the backward Euler method, which is directly analogous to the implicit Euler method, is not considered.

Methods
This study aims to examine the relationship between precipitation extremeness and numerical error for a variety of numerical methods implemented in hydrological models. This section describes the conditions under which models are run, the methods by which numerical error and computational expense are assessed, and the conditions of the literature review by which the approximate magnitudes of errors resulting from popular modeling codes are assessed. Modular modeling frameworks are fairly new tools in hydrology. These are pieces of code or software that are able to rapidly create hydrological models with various processes or structures included, varying numerical schemes (with some MMFs), and other options related to model setup or input. This allows for the controlled comparison of models, enabling studies on model structure, model uncertainty, and a wealth of other topics. In this case, an MMF offers the opportunity to study the change in numerical error resulting from the instantaneous (unrouted) discharge hydrograph as the dimensions of model structure, numerical method, parameters, initial conditions, and forcing data are systematically altered. This allows us to test for generality concerning the relation between numerical error and precipitation extremeness over a large dimensional space, as well as account for interactions among these dimensions. The Framework for Understanding Structural Errors (FUSE) was selected for this experiment. To our best knowledge, this is the only MMF that allows for implicit and explicit numerical methods of higher order than 1. Note that the update to FUSE allowing for higher-order numerical techniques was introduced by Clark and Kavetski (2010), after FUSE's initial development .
The five different FUSE members, or five different model structures, are used to generate all hydrographs. Different combinations of state variables and fluxes are used in each. These include FUSE 070, 536, 550, 092, and 330, which are taken directly from Clark and Kavetski (2010) and are from a list of "models broadly representative of the wide spectrum of conceptual hydrological models used in research and practice". All of these models are lumped rather than distributed. The FUSE snow module is always off in this experiment.
The model runs are generated across 20 parameter sets. This makes the results more generalizable, simulating hydrograph behavior in different physical settings. The parameter sets are generated via a Latin hypercube (LH) method, ensuring they cover as large a region of the parameter space as possible. The LH technique is used on the total set of parameters used across all models; each model, having a different structure, does not require the full set of LH-optimized parameters. In order to determine whether a sufficient number of parameter sets was used, hydrographs resulting from both 20 and 80 parameter sets are separately generated and compared.

Synthetic forcing data
In order to determine the change in model numerical error as precipitation varies from mild to extreme, a clearer idea of what constitutes extreme precipitation is required. Here, extreme precipitation is some combination of intense and long-lasting precipitation. In order to test the effect of both changing precipitation intensity and duration on numerical error, precipitation data sets are synthetically generated along the intensity, duration, and frequency (IDF) curves shown in Fig. 3. All points in Fig. 3 are World Meteorological Organization world records (World Meteorological Organization, 1994) in total precipitation for a given duration. We generate six IDF curves, representing interpolations between the world record events, scaled by factors of 0.01, 0.025, 0.075, 0.25, 1, and 1.2. These factors logarithmically span a large range of rainfall intensities, ranging from mild to larger than historically observed. The inclusion of precipitation events that are more intense than have been historically observed is due to the projected increase in wet day precipitation intensity under emission scenario SRES (Special Report on Emissions Scenarios) A2; for many regions, this exceeds 20 % when comparing the simulated periods 2081-2100and 1980-1999(Seneviratne et al., 2017. Synthetic precipitation data were generated along these interpolations for the following three different event durations: 5, 10, and 20 d. These are selected with the motivation that flood modeling for larger catchments is often done for precipitation events at these timescales (Jasper et al., 2002;Weerts and El Serafy, 2006). Thus, in total, 18 synthetic precipitation data sets were generated. The shape of the rainfall signals used in the majority of the analysis is flat, such that each day has the same precipitation intensity for a given event, though the effect of using precipitation data based on real events is briefly discussed. Total accumulated precipitation per event ranges between 40.1 mm (for the 5 d event with a daily intensity of 1 % of the average intensity of the interpolated 5 d historical maximum) and 9787.2 mm (for the 20 d event with a daily intensity of 120 % of the average intensity of the interpolated 20 d historical maximum).
It is also important to determine to what extent results depend on initial conditions. To this end, the forcing data are preceded by a 500 d spin-up period with one of three constant precipitation intensities, i.e., 2.5, 5, or 10 mm d −1 . Each spinup intensity is used with each of the above-described precipitation events. The goal of using these three intensities is not necessarily to simulate initial conditions which are characteristic of three distinct climatologies but rather to establish a broad range of initial conditions in terms of percent of maximum storage. Because we incorporate a broad variety of model structures and parameter sets, the total storage in the model at end of the spin-up period spans a variety of values, ranging from nearly empty (less than 1 % full) to approximately 70 % full. From least to most intense spin-up periods, median storages expressed as a percent of maximum storages at the end of the spin-up period are 21 %, 25 %, and 28 %, respectively. Comparing results from three different spin-up precipitation intensities allows for a systematic method by which to investigate initial conditions and their effect on the relative numerical errors of various numerical methods. . Synthetic IDF (intensity, duration, and frequency) curves along which synthetic forcing data were generated, where the vertical black lines show the precipitation event durations used. The IDF curves are interpolated between world record events for given durations, scaled by 6 different factors, ranging from 0.01 to 1.2, indicated by color. The precipitation world records are World Meteorological Organization data (World Meteorological Organization, 1994) and are indicated by orange points. The power function giving the best fit for the world records is shown. Black points are real events which are described here in order to give further context for the IDF curves. (a) The largest precipitation ever recorded in the Netherlands in a 24 h period (Brauer et al., 2011). The synthetic forcing data sets are completed with potential evapotranspiration (ET) and temperature data. For all forcing data sets, constant potential ET values of 2 mm d −1 and temperatures of 10 • C are used. In most cases, the ET flux is small relative to the precipitation flux.

Method for determination of numerical error
For a fixed model structure (out of five), initial condition (out of three levels), parameter set (out of 20), and forcing data set (out of 18), there are nine instantaneous discharge hydrograph runs, corresponding to the eight tested numerical methods and the benchmark method. The benchmark method is the near-exact solution against which all other methods are compared. It is generated with the most sophisticated numerical technique available in FUSE, adaptive implicit Heun, with error tolerances 1000 times smaller than the defaults (τ r = 10 −5 ; τ a = 10 −5 mm for the benchmark). A desirable property of the benchmark is that it yields approximate solutions that are much closer to the exact solutions than those resulting from any other method. To ensure this, the hydrograph resulting from the benchmark method, but with 10 times larger error tolerances, was compared to the discharge hydrographs from all other methods. It was found that the larger tolerance benchmark had lower error (when compared to the real benchmark) than any other method more than 99 % of the time.
The error for any tested run is determined by calculating the root mean square error (RMSE; millimeters per day, hereafter mm d −1 ) and normalized root mean square error (NRMSE; percent) with respect to the benchmark run. The RMSE is given by the following: where T is the total time evaluated in days,q t is the nearexact discharge resulting from the benchmark method at time t, and q t is the discharge resulting from the numerical method being assessed at time t. The NRMSE is simply the RMSE normalized by the mean discharge of the associated benchmark run and then multiplied by 100 to become a percent. The time T for which the error is evaluated is 7 d for 5 d precipitation events, 14 d for 10 d precipitation events, and 28 d for 20 d precipitation events, always beginning on the first day of rainfall after the spin-up period. This is a somewhat arbitrary albeit valid period on which to evaluate hydrograph performance; it captures all or part of the rising limb, the crest segment, and the falling limb. Figure 4 shows example discharge hydrographs resulting from the fixed step explicit Heun method and the benchmark method. The total number of times the system of differential equations composing a model must be solved, or, alternatively, the number of system flux evaluations, is also recorded and represents the computational expense of generating the hydrograph. RMSE and NRMSE are both included in this experiment because they show errors in two different yet valid ways, namely in the original units of the discharge and as a relative measure of the near-exact discharge, respectively. RMSE is useful in that its units (mm d −1 ) are easy to interpret, directly showing the numerical error in the daily discharge, while the benefit of NRMSE is that it provides context for the error. These definitions of error are purely numerical and, therefore, differ from fidelity as described in Clark and Kavetski (2010), which further incorporates real discharge observations.

Literature review
A small literature review determining available numerical strategies was conducted on 12 hydrological modeling codes. This included seven off-the-shelf models, which are the objects of study in Addor and Melsen (2019), and five popular MMFs. The reviewed modeling codes are shown in Appendix A. Information is gathered on whether recently updated versions of each modeling code are capable of sequential or simultaneous solving, available orders of numerical solver, the implicit or explicit (or other) nature of the numerical solver, and whether or not adaptive substepping is available. When solvers were found to have the option of simultaneous solving, all available numerical options were recorded. When solvers were found to only have sequential . Example hyetographs and discharge hydrographs, including results from the fixed step explicit Heun method and the benchmark method for 5, 10, and 20 d precipitation events. The vertical orange lines are the differences between the two hydrographs and represent the numerical error. From the shortest to the longest duration, the RMSEs are 0.55, 0.14, and 0.10 mm d −1 . The precipitation forcing data come from the IDF curve scaled at 2.5 % of the historical maxima interpolation. These hydrographs all result from the same structure (FUSE 092), the same parameter set, and the same spin-up precipitation intensity (5 mm d −1 ). The last day of the spin-up period is shown.
options -allowing for different numerical options per individual flux -the modeling code is considered to be restricted to the least sophisticated method specified among the fluxes encountered (specifically, we consider explicit methods to be less sophisticated than implicit methods, fixed step less sophisticated than adaptive, and lower order less sophisticated than higher order). This reflects our assumption that a numerically erroneous component of the model will produce errors which propagate to other components of the model; in other words, numerical accuracy is only as good as the most erroneous part of a model. This review has the objective of approximately placing other models or MMFs in the context of this experiment, which sheds light on what kinds of numerical errors might be expected as a result of using popular modeling codes.
We restrict the sampled models to those examined by Addor and Melsen (2019) for two reasons. First, Addor and Melsen (2019) chose these models based on their popularity rather than their numerical techniques. In this way, we do not cherry-pick models which either support or refute the premise that sophisticated numerical methods are common in current hydrologic models. Second, we limit the number of reviewed modeling codes because compiling a list of all current hydrologic models and their used numerical methods would provide a more accurate description of which numerical methods are widely used but represents a much larger project. Figure 5 details the evolution of numerical errors for the tested numerical techniques for 5 d events of increasing pre-cipitation intensity, for 20 parameter sets, all model structures, and all initial conditions. It is immediately apparent that, for all methods, median RMSE increases with increasing precipitation intensity, and second-order adaptive methods usually outperform fixed step methods (in some cases by a few orders of magnitude). The following two paragraphs explore trends in RMSE, shown in the top panel of Fig. 5, from left to right. For lower precipitation intensities, adaptive Heun methods usually outperform the other methods. The adaptive semiimplicit Euler method yielded more error than the other adaptive methods, putting this method on par with the secondorder fixed step methods. The relatively low performance of adaptive semi-implicit Euler is likely due to the fact that the Heun methods adapt their step sizes based on comparisons between first-and second-order results, while the semiimplicit method adapts its step size based on a comparison between two first-order results. Here, all adaptive methods use the same default error tolerances, where the effects of changing these will be discussed later. Among fixed step methods, for lower precipitation intensities, first-order methods (the Euler methods) are on average outperformed by second-order methods (the Heun methods). Apparently, the extra flux evaluations required by second-order methods are able to reduce error in these low intensity cases.

Numerical error and precipitation intensity for 5 d events
As precipitation intensity increases, it is clear that adaptive methods outperform fixed step methods, where the adaptive semi-implicit Euler method begins to yield lower error than fixed step methods. For fixed step methods with higher precipitation intensity, it is no longer the case that second-order methods outperform first-order methods. Instead, implicit or semi-implicit methods outperform explicit methods; the fixed step explicit methods yield median errors above 70 mm d −1 for the two highest precipitation intensities. This indicates that instabilities contribute more signifi- cantly to numerical error as precipitation intensity increases, as implicit methods are unconditionally stable (Jameson and Turkel, 1981), and the semi-implicit method constrains instabilities more readily than the explicit methods (Kavetski et al., 2002). Furthermore, because fixed step explicit Heun clearly outperforms explicit Euler for low precipitation intensities but not for high precipitation intensities, and because fixed step implicit Heun and implicit Euler have this same property, we can conclude that higher-order truncation error is more sensitive to increasing precipitation intensity than first-order truncation error at the daily time step. This suggests that simply selecting an arbitrarily higher-order fixed step method will not always constrain error -and in some cases, it might make the error worse. Evidently, adaptive methods are most suitable for constraining numerical errors.
In the bottom panel of Fig. 5, the NRMSE is shown. All of the above-identified trends can be found here as well, except that the median error always increases with increasing precipitation intensity. With many (especially adaptive) meth-ods, NRMSE decreases with increasing precipitation intensity. This is due to the fact that RMSE might usually increase, but not as fast as the discharge by which the RMSE is normalized. When this analysis was repeated with Kling-Gupta efficiency (Gupta et al., 2009), Nash-Sutcliffe efficiency (Nash and Sutcliffe, 1970), and normalized error in maximum discharge, similar trends in performance and precipitation intensity were obtained when compared to NRMSE.
To put the errors introduced by numerics into perspective, we compare it to other sources of error in hydrological modeling. Uncertainty in discharge measurements depends on the employed method, e.g., whether the observation is based on a rating curve or acoustic Doppler current profiler (ADCP; McMillan et al., 2017). Using a rating curve, Westerberg et al. (2011) observed for a specific case 20 % error for medium and high flows, due to nonstationarity of the channel, and McMillan et al. (2012) estimate discharge measurement uncertainty for medium to high flows even up to 40 %. Estimating the error in precipitation observations is more challenging because it not only depends on the measurement device but also on the spatial representativeness of the measurement. Wood et al. (2000) estimate a 50 % error in rainfall observations when comparing radar and tipping buckets. When employing a fixed step explicit Euler method, numerical errors are in the same order as discharge and precipitation measurement errors. Observation errors in precipitation and in discharge both show heteroscedasticity, i.e., the error increases with an increasing value of the variable. This study shows that the numerical error increases along with increased precipitation values (Fig. 5; upper panel) and, for some numerical methods, also increases along with discharge values (Fig. 5; lower panel).
The effect of measurement uncertainty in forcing and discharge observations on parameter and model structure inference has already been explored in the literature (Kavetski et al., 2006a, b;Vrugt et al., 2008). This study shows that numerical errors, having the same order of magnitude, can also hamper the process of parameter identification, model structure identification, uncertainty estimation, or, in short, in testing hydrologic theory.

Robustness of results
The previous subsection makes a variety of claims relating to numerical error and method choice. These are reliant on a ranking of numerical methods, based on numerical error, that is different for low and high precipitation intensity events. For low-intensity events, we state that the secondorder adaptive methods occur as the group with the lowest error, the adaptive semi-implicit Euler method and the fixed step second-order methods occur as the group with the second lowest error, and the fixed step first-order methods occur as the group with the largest error. For high-intensity events, we state that the adaptive methods occur as the group with the lowest error, fixed step implicit and semi-implicit methods occur as the group with the second lowest error, and fixed step explicit methods occur as the group with the highest error. While these groupings are clearly viable via median errors in RMSE and NRMSE (see Fig. 5), it could be that various model dimensions can sometimes interact in such a way that these groupings are not observed. In this subsection, we determine how often these groupings are observed in the above-ranked orders over the chosen modeling dimensions. In short, we find that the above-described rankings are robust over the majority of the tested dimensional space.
When spin-up precipitation intensity, forcing data set, model structure, and parameter set are all fixed to a single choice, there are eight model runs which have their error calculated, corresponding to the eight tested numerical methods. For each individual set of eight hydrographs, for the most and least intense 5 d precipitation intensities, methods were given a rank of 1 to 8, based on RMSE, where 1 indicates the lowest RMSE and 8 indicates the highest. Any ranking with ties was discarded, which occurred about 20 % of the time for the lowest intensity and did not occur for the highest intensity. The numerical techniques were sorted into one of the three groups based on their rank, where, for the lowest intensity, the two lowest error methods occupy ranked group 1, the next three methods occupy ranked group 2, and the three most erroneous methods occupy ranked group 3 and for the highest intensity, the three lowest error methods occupy ranked group 1, the next three methods occupy ranked group 2, and the two most erroneous methods occupy ranked group 3. Then, the composition of each ranked group is reported. When a ranked group is mostly composed of a single selection of methods, these methods can be said to have the rank corresponding to the ranked group over most of the tested dimensional space. This is indicated by the predominance of a single color for a given ranked group in Fig. 6.
As a concrete example, consider the ranking wherein adaptive second-order methods generally outperform other methods for the least intense 5 d precipitation events. This is reliant upon the adaptive Heun methods occurring often in the ranked group with the lowest error (ranked group 1), given the least intense 5 d precipitation data. To determine how often this is the case, each numerical method in each set of eight runs was given a ranking based on RMSE, where 1 is low error and 8 is high error, and rankings with ties were discarded. Whenever the second-order adaptive methods had either rank 1 or 2 in an individual set of eight runs, they were placed into ranked group 1. As can be seen in Fig. 6, on average, adaptive explicit Heun and adaptive implicit Heun occur in ranked group 1 at a frequency of 74 % across the tested dimensional space for the lowest precipitation intensity for 5 d events, meaning it is usually the case that these two methods as a group have the lowest error for the lowest precipitation intensity for 5 d events. This indicates that the claim "adaptive second-order methods outperform other methods for the least intense 5 d precipitation events" is valid over the majority of the dimensional space tested in Fig. 6 (i.e., it is valid over a variety of choices in model structure, parameter set choice, and initial conditions). Note that there are different rankings for the lowest and highest precipitation intensities, as some numerical methods react very differently to increasing precipitation intensity (described in the previous subsection).
Because the frequency with which methods occurred in their desired groups is generally high, we can be confident that the trends in relative method performance in the previous subsection are usually true over the tested model structures, parameter sets, and initial conditions. The fraction of each ranked group composed of its dominant methods always changed by less than 2 % as a result of changing the initial conditions (see Fig. 6). We can conclude that the rankings are relatively insensitive to initial condition. We determined that the ranking is robust over different model structures as well; the desired ranking was usually observed a high percentage of the time when individual model structures were assessed. However, we found that the model struc- Figure 6. The robustness of rankings of numerical techniques in RMSE for high and low precipitation intensity events for three different initial conditions. A ranking is more robust when a selection of numerical methods is prevalent in a given ranked group, indicated by the predominance of a single color in a ranked group. All adaptive methods here use FUSE default error tolerances. These results incorporate hydrographs generated across five model structures and 20 parameter sets. Different rankings are observed for high and low precipitation intensities.
tures containing interflow process representations -namely, FUSE 536, FUSE 550, and FUSE 330 -were less likely than structures without interflow to produce the dominant ranking of Fig. 6 for both the lowest and highest precipitation intensity for 5 d events. The most common deviations from the most common ranking for the lowest precipitation intensity occur due to fixed step implicit Heun outperforming adaptive explicit Heun or adaptive semi-implicit Euler performing approximately as well as the first-order fixed step methods. In order to determine if the rankings were sensitive to the number of parameter sets used, hydrographs were further generated over 80 parameter sets, where the spin-up precipitation intensity was set to 2.5 mm d −1 , with no change in the other dimensions. Then, the rankings were compared between the two cases where 80 or 20 parameter sets were used and the spin-up precipitation intensity was 2.5 mm d −1 . Because the frequency with which the required ranking is observed changed very little between the two cases -with a largest change for any grouping of 4.9 % -it is evident that the ranking is fairly insensitive to the number of parameter sets used. This indicates that enough parameter sets were used to establish generality over the parameter space. The hydrographs generated over 80 parameter sets were not used beyond investigating ranking robustness.
The same procedure was used to evaluate rankings in computational expense, using number of system flux evaluations rather than RMSE, because further analysis involving computational expense should also be robust over the tested model dimensional space. A total of five distinct ranked groups were discovered. From lowest to highest expense, these groups include the following: This ranking appeared to be common for both low and high precipitation intensity cases. In the style of Fig. 6, and over the same model dimensions, the dominant selection of method(s) never occurred in any ranked group less than 73 % of the time. This indicates that these rankings are robust over a variety of choices in initial condition, parameters, and model structure. Changing either number of parameter sets or initial conditions also always yielded a change in frequency of less than 12 %, demonstrating the insensitivity of computational expense ranking to initial conditions or number of parameter sets used. Based on the results of this subsection, we conclude that the impact of model structure, different parameter sets, and initial conditions on the ranking of numerical methods based on numerical error or computational expense is limited.

Impact of event duration on error
It was found that longer-duration events have lower associated RMSE, where the median RMSE across all methods decreases with increasing duration. This was analyzed across 20 parameter sets, all five model structures, and for all three spin-up precipitation intensities. The average reduction in median RMSE across all methods between 5 and 20 d events was 74 % for the least intense scaling factor (1 % of interpolated world record average intensity) and 54 % for the most intense scaling factor (120 % of interpolated world record average intensity), where the median RMSE of all individual methods decreased monotonically with increasing precipitation duration. This is, to some extent, a mathematical artifact; because the rainfall signal is flat, a larger portion of a longer time series is closer to an equilibrium discharge. This reduces the average error for the hydrograph. In order to test the extent to which RMSE depends on the geometry of the rainfall signal, the experiment was repeated with new forcing data based on CAMELS precipitation data observed during Hurricane Katrina (Newman et al., 2015). Original forcing data were scaled such that the total precipitation per event is consistent with the rest of the experiment, but relative daily precipitation intensities reflect real data. Under these conditions, the same dominant compositions of ranked groups are observed (and in fact become slightly more robust), and the trend of increasing median RMSE with increasing precipitation intensity is still clear. The decrease in error with increasing duration is still present, though less pronounced, with variable rainfall signals. In this case, the average reduction in median RMSE across all methods between 5 and 20 d events was 69 % for the least intense scaling factor (1 % of interpolated world record total precipitation) and 35 % for the most intense scaling factor (120 % of interpolated world record total precipitation). In either the case of the flat or variable rainfall signal, increasing the duration along an IDF curve necessarily means reducing the daily intensity of the event. Therefore, the reduction in numerical error due to increasing duration might simply be due to the lower daily intensity per event of longer events. Nonetheless, longer-duration events along an IDF curve do represent more total rainfall per event, and it appears as if models yield less error for longer-duration events along an IDF curve (not shown).

Impact of rainfall signal geometry on error
Here, we present the errors which arise when the experiment is repeated with a variable rainfall signal based on Hurricane Katrina (Newman et al., 2015 and see the previous subsection). Magnitudes of median errors are generally larger with the variable intensity precipitation data but tend to be on the same order of magnitude as in the cases with no variance; because a larger variance in the rainfall signal can increase numerical errors, we can expect that numerical errors in practice might be somewhat larger than reported in the previous subsection. Numerical errors resulting from the 5 d forcing data based on Hurricane Katrina can be found in Fig. 7. It is also interesting to note that precipitation intensities greater than the 1 d world record event were contained within the variable intensity 5 d forcing data. Because these data generally yield larger errors than with the flat rainfall signals, we can conclude that shorter-duration events incorporating larger intensities can produce larger numerical errors. This further establishes the relationship between numerical error and precipitation intensity and especially underscores the importance of numerical technique choice for intense, short-duration events.

Computational efficiency
So far, this study makes assessments of numerical errors associated with various numerical techniques. As a result, and with default error tolerances, it seems as if adaptive secondorder methods tend to yield the lowest numerical error. Low error is desirable, but computational expense can also be a decisive factor when selecting a numerical method. Figure 8 shows the computational efficiency of the tested numerical techniques. More efficient methods will produce a relatively low median numerical error and will take a relatively small median number of flux evaluations in order to achieve this, i.e., they will occur closer to the leftmost gray line. Mathematically, we can also consider high efficiency to be achieved when the product of number of function calls and numerical error is minimized. We explore multiple error tolerances for the adaptive semi-implicit Euler method. These are examined in order to determine how efficient the adaptive semi-implicit method is when it yields approximately as much error as the adaptive Heun methods with default error tolerances.
Because each precipitation intensity is shown for 5 d events, it is clear that the efficiency of each method tends to decrease with increasing precipitation intensity. Arbitrarily small errors can be achieved with adaptive methods by selecting smaller error tolerances. The semi-implicit method with the smallest error tolerance has the lowest error of any tested numerical method, though this comes at a large computational cost. Adjusting the error tolerances of the adaptive semi-implicit method does not seem to change its ef-ficiency. The central feature of Fig. 8 is that the adaptive explicit Heun method emerges as a clear leader in terms of efficiency, significantly outperforming all other methods for any precipitation intensity. This was also established for the longer-duration events (not shown).
It is possible to examine individual sets of eight runs, rather than median errors and expenses, to determine how often the adaptive explicit Heun method has the highest efficiency. We find that the adaptive explicit Heun method is the most efficient method among those tested 85 % of the time, on average over all 18 forcing data sets, three initial condition levels, 20 parameter sets, and five model structures. When model runs were grouped based on the percentage of world record precipitation intensity, it was found that the percentage of the model runs for which the adaptive explicit Heun method was most efficient ranged between 80.3 % (when precipitation intensities were at 1 % of historical maxima) and Figure 8. Relationship between computational efficiency and the numerical error, where methods are more efficient if they have a low RMSE and a low number of flux evaluations. All medians are calculated across all five model structures, 20 parameter sets, and three spin-up precipitation intensities. All six precipitation intensities for 5 d events are shown, where the light gray dot represents the lowest intensity event, darker gray dots represent increasingly intense precipitation events, and the black triangle represents the highest intensity event. Unless otherwise stated, adaptive methods use the FUSE default tolerances. Purple lines depict adaptive semi-implicit Euler methods with error tolerances below the FUSE default, where the brightest purple method has the smallest error tolerance. Gray lines are visual aids of the form y= 10 n x , where n is an integer, and x and y, respectively, represent RMSE and number of flux evaluations (gray lines do not represent real data and are used only for visualization purposes). Moving between adjacent gray lines represents an order of magnitude change in efficiency.

Numerical choices control numerical error more than structural choices
Earlier, numerical error was defined as the difference between the exact and the approximate solution to the set of equations composing the model. While numerical error ultimately comes from numerical method choice, the same method in a different model structural context might produce different error magnitudes, so model structure might be able to contribute to numerical error. A basic analysis of variance (ANOVA) was conducted on NRMSE for all model runs using 20 parameter sets, all three spin-up precipitation intensities, and with all 18 forcing data sets. The p values for groupings based on numerical method choice and structural choice were calculated. The p values for groupings based on the numerical method were always less than 1 × 10 −20 , demonstrating that numerical method choice strongly controls NRMSE. The p values for groupings based on structure were below 0.05 for zero of the six 5 d forcing data sets, two of the six 10 d forcing data sets, and three of the six 20 d forcing data sets. The p values for numerical groupings were always lower than the p values for structural groupings. This demonstrates that numerical choices control numerical error more strongly than model structural choices, and structural choices seem to become more important in controlling numerical error for longer duration events. Figure 9. Grouping the reviewed modeling codes (12 in total) into three categories, based on available numerical techniques. Category C codes are the most common.

Literature review results
Finally, a short literature review was conducted to place the results of this study in context. The surveyed hydrological modeling codes were lumped into three categories (shown in Fig. 9). Presently, we discuss numerical errors that might result from each category.
Of the 12 reviewed codes, one belongs to category A, three belong to category B, and eight belong to category C; the highest order numerical method found in any code was second order. The codes of categories A and B, both having the option of adaptive substepping, can guarantee any level of numerical accuracy with sufficiently small error tolerances. The difference is that the numerical techniques of category B did not include higher-order (second order or higher) adaptive explicit methods and so are not identified as being optimally efficient, according to Fig. 8.
The codes of category C have fixed step solvers. These codes can be further subdivided into more categories, based on the ability or inability to specify a fixed time step, sequential or simultaneous solving, order of numerical method, or implicit or explicit (or other) nature of the numerical methods available. However, the fixed time step either causes an extremely large computational expense when the fixed time step can be arbitrarily set by the user to constrain error Press and Teukolsky, 1992) or relatively large numerical inaccuracies when the time step cannot be adjusted by the user (as shown by this experiment, at least for the daily time step).
The most common method of category C is closely analogous to the fixed step explicit Euler method, albeit in sequential rather than simultaneous form. If willing to accept the flat rainfall signal and chosen dimensional space of Fig. 5 as physically possible, and if willing to accept that FUSE can at least approximately mimic other codes, one could tentatively estimate that category C codes could often yield large median numerical errors from the perspective of RMSE or NRMSE. Specifically, the fixed step explicit Euler method yields a median NRMSE of more than 4 % for 5 d events where the precipitation intensity is 2.5 % of the average intensity for the 5 d historical maximum. This is already fairly large, but median NRMSE for this method can become much larger; by the time precipitation is at 25 % of the world record, numerical NRMSE is over 30 %. Similar trends can be found for longer-duration events, where median errors in RMSE and NRMSE for this method increase with increasing intensity. For 10 and 20 d events, the median NRMSE resulting from the explicit Euler method is maximized at 29 % and 23 %, respectively. When judging a numerical technique by its extremes in error rather than its average, the fixed step explicit Euler method performs poorly; in this experiment, extremes in NRMSE when using this numerical technique easily exceed 100 % for any forcing data set. These results seem to indicate that the most frequently used numerical technique that we encountered in our short survey performs relatively poorly for any precipitation event, where its performance is especially poor for (increasingly common) extreme precipitation events. Though the dimensional space tested in this experiment is reasonably large (incorporating various model structures, parameter sets, initial conditions, forcing data, and numerical techniques), it does not cover all possible behaviors of its dimensions. There are model structures in common use outside of the five tested, especially considering that we do not incorporate processes related to freezing or snow. Additionally, we do not use an energy balance; instead, FUSE generates bucket-style hydrological models that only solve a mass balance. Because land surface models incorporate processes involving energy, these might have a different (perhaps more complicated) relationship between numerical error and precipitation extremeness. Furthermore, this experiment incorporates multiple FUSE members (model structures) for generality of results regarding numerical error over multiple structural choices. Our analysis does not extend to determining the extent to which individual physical processes are responsible for numerical error; this presents a clear opportunity for substantial future work. It is important to note that all models generated by FUSE are composed of systems of nonlinear first-order ordinary differential equations; these could have different error characteristics than partial differential equations (such as the Richards equation or shallow wave equations) or higher-order differential equations. Next, even higher-order or differently implemented low-order numerical methods, more initial conditions, less intense precipitations, or longer or shorter precipitation durations remain untested. Still, the results were shown to be rather robust across the fairly broad tested dimensional space, which indicates that extrapolation of the results to other precipitation durations and intensities might be justified. All rainfall signals in this experiment are either constant with respect to time or are based on a single real event P. T. La Follette et al.: Numerical daemons of hydrological models are summoned by extreme precipitation with variance. As such, we do not account for a broad variety of precipitation regimes in terms of variable intensity, which might introduce complexity that is not accounted for in this experiment (Müller-Thomy and Sikorska-Senoner, 2019). However, similar results were obtained with different rainfall signal geometries, indicating that the results might be robust across rainfall geometries. Furthermore, the trends relating numerical error resulting from a numerical method and precipitation extremeness appeared to be rather insensitive to the other tested dimensions, providing robustness to the conclusions.
Because we used a Latin hypercube approach in generating parameter sets and demonstrated that using a larger number of parameter sets did not significantly change the results, we can conclude that we have sufficiently sampled the parameter space. However, in practice, it is likely that most parameters obey a given distribution, which would limit the parameter space which represents physically likely catchments. By sampling broadly rather than using calibrated parameter sets, we do not necessarily represent catchments that are commonly observed. Both approaches have merit; using only calibrated parameter sets might provide a greater assurance that results arise from only physically likely catchments but could bias the results via parameter restriction, while using a broad sampling of the parameter space gives more weight to catchments that might be underrepresented but might not proportionately represent average or common physical settings. Nonetheless, a large variety of parameter combinations is used in Monte Carlo optimization methods regardless of physical realism, which further makes the investigation of numerical error resulting from a broad sampling of parameters worthwhile. While this experiment uses a broad sampling of the parameter space, using only calibrated parameter sets is also a defensible choice -although there is a risk of interaction between numerical error and calibration optimization.

Choices in temporal and spatial discretization
In this experiment, we examine the daily time step because it is commonly used with hydrologic models. However, hydrologic (especially flood) modeling is often carried out at a finer temporal resolution, e.g., hourly (Boithias et al., 2017;Ficchì et al., 2016). If this experiment was repeated with an hourly time step instead of a daily time step, one might expect similar trends in terms of the evolution of numerical error with respect to precipitation intensity and duration, simply with smaller magnitudes of error. This is supported by the theory described in Sect. 2, wherein numerical error is proportional to the time step raised to some integer power. With a smaller time step, one could expect smaller numerical errors, especially for higher-order numerical methods as opposed to lower-order numerical methods (with higher-order numerical methods, numerical error is proportional to the time step raised to a larger integer). Note that this speculative extension of results for a smaller time step also assumes a uniform precipitation intensity. It is possible that, when using real forcing data, rainfall that is locally intense with an hourly temporal resolution could be smoothed out if resampled to the daily temporal resolution. This implies that, with real hourly forcing data, it is possible that a large precipitation intensity which lasts for a short time can produce significant numerical error for a fraction of 1 d; this numerical error might be lessened if the forcing data were aggregated to the daily resolution. Still, the theory presented in Sect. 2 implies that, from a mathematical perspective, one would expect smaller numerical errors when using a smaller time step.
All models in this experiment are lumped rather than distributed and, therefore, do not contain numerical error related to choices in spatial discretization. In a distributed model, numerical error is proportional to the chosen spatial resolution raised to some power and to the time step raised to some power. This theoretically implies that numerical error could potentially increase in terms of maximum value in the context of distributed models under spatially and temporally uniform precipitation. However, a fully distributed model (with real, distributed forcing data) could have its local extremes in space smoothed out given the choice of spatial discretization, while this option is not available for lumped conceptual models. More generally, it is possible for numerical errors due to spatial and temporal discretizations to interact. Therefore, our results are not directly applicable to distributed models. How large the net numerical error is in case of a spatially explicit model, and what controls the magnitude of the errors involved, are interesting subjects for future research. Nonetheless, we tentatively suspect that similar trends might be found in distributed models; namely, that adaptive higherorder methods yield significantly lower numerical error than fixed step lower-order errors do.

Numerical details
Several numerical details remain untested by this experiment. First, FUSE employs embedded error control, which is more efficient than error control requiring extra calculations by about a factor of 2 (Press and Teukolsky, 1992). While we can assume that embedded error control methods are relatively efficient, unembedded methods remain untested, and it is possible that some hydrological models do not use embedded methods. It is possible that this could alter the relative efficiency of various adaptive methods, albeit probably not to a different order of magnitude. Second, all implicit techniques here use a single method to ensure that mass balance errors are sufficiently small; changing the treatment of mass balance errors could lead to somewhat different median computational expenses for implicit methods. However, the median number of flux evaluations per time step under a variety of mass balance error monitoring strategies for fixed step implicit Euler apparently ranges between approximately 8 and 21 , where the median number of flux evaluations for fixed step implicit Euler in 5 d events ranges between 14 and 20 (from lowest to highest precipitation intensities) in this experiment. This suggests that the implicit Euler method studied in this paper would probably not achieve a new order of magnitude of efficiency, as shown in Fig. 8, regardless of mass balance error monitoring method. Third, this study does not examine different methods for enforcing solution constraints (e.g., making sure physically impossible storages do not occur as model results). The method for enforcing solution constraints is outlined by Clark and Kavetski (2010), and altering this method could affect numerical error (Shampine et al., 2005). Finally, the adaptive explicit Heun method appears to be the most efficient technique given the limited space of techniques tested; this may no longer be the case when even higher-order adaptive methods are studied. For example, adaptive fourth-or fifth-order Runge-Kutta methods (Schoups et al., 2010) are untested here. A truly optimal solver might be able to dynamically change its order or other options associated with the solver, as in other fields (Karimov et al., 2017;Rackauckas et al., 2020;Lauritzen et al., 2010;Ullrich et al., 2017). Though this point is technically a subset of issues with total dimensional space sampling, testing even higher-order or further modularized numerical techniques would be a straightforward and useful advancement of this work.

Breadth of literature review
The literature review is not an exhaustive survey of all existing modeling codes; instead, it is a smaller investigation of some of the modeling codes that we consider to be widely used. Therefore, it should not be interpreted as a quantitative result of how common numerical errors are but rather a preliminary indication of what might be. Furthermore, note that we only suggest the magnitude and prevalence of errors arising from model runs that are not further scrutinized. Accordingly, our results give an indication of numerical errors arising purely from modeling codes, rather than the occurrence of these errors in published work. For example, it has been demonstrated that fixed step explicit methods are likely to produce large instabilities, but these would probably be identified as such and discarded by the majority of model users.
5.2 Are minimized numerical error and computational expense good metrics with which to choose a numerical method?
At a glance, it might seem as if an inexpensive numerical method with minimized numerical error -the adaptive explicit Heun method -would be an optimal choice for modeling. After all, Clark and Kavetski (2010) do show that numerical error can easily be a large source of error in a model. However, a numerical technique that minimizes numerical error might not always minimize total error in an observa-tional context; in reality, numerical error can cancel out other error sources. This is possible, for example, when a model structure is not sufficiently diffuse, and then a first-order fixed step numerical choice is overly diffuse. The numerical method introduces numerical diffusion (a numerical error), which interacts with the structural error in such a way that the total error is reduced when evaluated on discharge observations, even when compared to a near-exact solution . Numerical diffusion has even been intentionally used to represent physical diffusion for increased accuracy (Thober et al., 2019). Furthermore, during calibration, error due to parameter choice can cancel out numerical error, in the same way model structural error can. Whether or not these are desirable qualities of a numerical method (i.e., whether hydrological models should simply provide accurate predictions of observable data or accurately represent their intended structure regardless of physical realism) is an open question, where the answer is context dependent. On one hand, a more accurate, real-world prediction is ultimately more useful for anyone whose livelihood depends on model results, e.g., a municipality that has to decide whether or not to issue evacuation advisories due to flooding risk. On the other hand, having somewhat unknown errors cancel each other out is a classic instance of finding the right answer for the wrong reasons (Kirchner, 2006), which could pose a problem when the aim of a modeling effort is to gain understanding of hydrological processes. Especially because this experiment shows that model structure can impact numerical error, it is possible that a structural development in a model yielding reduced error is interpreted as an improved representation of nature, where, in reality, the structural development changed the numerical error such that the total error is reduced. When we want to test multiple process-based hypotheses, it would be ideal if we could be more assured that our results are due to the structural changes we make rather than being conflated with numerical error. With this study, we do not claim to have identified optimal methods in an observational context. We do submit, however, that we have identified a few numerical techniques that are desirable (or undesirable) for future process-based hypothesis testing.

Recommendation for numerical technique
Of the studied methods, this experiment finds that the adaptive explicit Heun method provides the best economy of low error and computational expense on average and, accordingly, is a practical choice for the numerical method. This is consistent with Clark and Kavetski (2010); Kavetski and Clark (2010). However, this study differs in that it does not recommend the fixed step implicit Euler method as an optimal solution, due to the fact that this method, on average, has relatively large numerical error and computational expenses here. The difference in recommendation between the present study and Clark and Kavetski (2010) or Kavetski and Clark (2010) is likely due to two methodological differences. First, P. T. La Follette et al.: Numerical daemons of hydrological models are summoned by extreme precipitation Clark and Kavetski (2010); Kavetski and Clark (2010) use the concept of fidelity to determine how closely a numerical technique approximates an exact solution which incorporates real discharge observations. The present study uses purely numerical definitions of error, which are simply different methods. Second, the time periods on which errors were evaluated were different, where the present study evaluates error on the timescale of days, and Clark and Kavetski (2010); Kavetski and Clark (2010) evaluate error on the timescale of years. As this study shows, the RMSE depends on precipitation intensity and duration. It could be the case that the fixed step implicit Euler method produces small numerical errors for a majority of a long time series when precipitation is mild or absent, which would keep the numerical error small on average. In contrast, a shorter but more intense precipitation time series yields larger average numerical errors. On balance, it is encouraging that the present study and the 2010 numerical daemon papers authored by Clark and Kavetski (2010) agree that the fixed step explicit Euler method produces a large amount of error and that the adaptive explicit Heun method is a good choice.
It is further encouraging to see a high degree of similarity between the numerical method recommended by the presented experiment and the recommendation in Schoups et al. (2010). Both works indicate that a second-order explicit adaptive method offers the best intersection of low error and low computational expense. In the case of Schoups et al. (2010), it is further found that the second-order explicit adaptive method is superior to a variety of other methods when calibrating a hydrologic model with Markov chain Monte Carlo methods. This offers additional evidence that secondorder explicit adaptive methods are preferred in the simulation of real (rather than just synthetic) streamflow. Furthermore, tasks such as parameter estimation (via calibration) and the simulation of state variables aside from streamflow can be compromised by using the fixed step explicit Euler method (Schoups et al., 2010;Kavetski and Clark, 2010).

Why are sophisticated numerical methods uncommon in conceptual hydrologic models?
The original numerical daemons papers and other notable work on numerics in hydrologic models Kavetski and Clark, 2010;Schoups et al., 2010) were published over a decade ago; these papers and the present study recommend similar numerical techniques. While some new models use sufficiently advanced numerical methods (Buitink et al., 2020;David et al., 2019;Pilz et al., 2020), it appears as if many current hydrologic models continue to use relatively unsophisticated numerical methods (see Sect. 4.7), where the fixed step explicit Euler method is rather common.
Here, we offer some speculative reasoning as to why this might be and indicate an opportunity in hydrologic education with respect to numerical techniques.
Numerical techniques receive substantially more attention in several other environmental sciences disciplines. This is related to the nature of the processes in these disciplinesthey may be subject to chaos. That is, small changes in initial conditions or small perturbations will eventually yield wildly different model results. Examples include predator-prey systems (Vano et al., 2006;Inoue and Kamifukumoto, 1984) and meteorological systems, such as the famous Lorenz butterfly (Lorenz, 1963). Attention to numerical schemes is much needed, since the positive feedback loops inherently present in these fields are very sensitive to numerical error. Hydrologic systems of equations, in contrast, tend to contain negative feedback loops. Therefore, numerical errors do not explode -albeit that they are still present, as shown in this study. This stability compared to other fields might offer an explanation for why numerics in hydrology have received relatively little attention. There is clearly a role for hydrology education here, where more attention to numerical techniques can make the next generation of hydrological modelers more aware.
Besides the stability of the hydrological system, another explanation for the limited attention to numerics in hydrology is that numerical error can interact with other sources of error. Multiple sources of error can cancel each other out, such that hydrologic models produce accurate results, despite having significant numerical error. When hydrologic models are working, although not necessarily for the right reasons, there is likely to be less incentive to investigate numerical errors.

Conclusions
In this study, the effects of changing precipitation intensity and duration on the numerical errors and computational expenses resulting from various numerical techniques in the context of lumped hydrological models were examined. Precipitation events varying between 1 % and 120 % of interpolated historical precipitation intensity maxima for various durations were used as forcing data. Model results were generated over a variety of choices in parameters, initial conditions, and model structures for generality. It was found that median root mean square error (RMSE) usually increased with increasing precipitation intensity and decreased with increasing precipitation event duration. Rankings of numerical techniques in terms of RMSE and the number of flux evaluations were established for varying precipitation intensities. These were shown to be robust over the majority of choices in parameters, model structures, and initial conditions. Then, the efficiency of each technique was examined by comparing computational expense and numerical error. Of the tested methods, a clear leader in efficiency emerged for all precipitation intensities, namely the adaptive explicit Heun method. A small literature review was conducted on various hydrological modeling codes, demonstrating that this method, and even adaptive methods in general, might be uncommon in practice, and that a much more numerically erroneous method might be common. Because process-based hydrological knowledge is enhanced by reductions in numerical error and computational expense, and because numerical errors are in many areas likely to increase with the changing climate (via more extreme precipitation), we advocate a more widespread use of highly efficient numerical methods.