Deep learning methods have frequently outperformed conceptual hydrologic models in rainfall-runoff modelling. Attempts of investigating such deep learning models internally are being made, but the traceability of model states and processes and their interrelations to model input and output is not yet fully understood. Direct interpretability of mechanistic processes has always been considered an asset of conceptual models that helps to gain system understanding aside of predictability. We introduce hydrologic neural ordinary differential equation (ODE) models that perform as well as state-of-the-art deep learning methods in stream flow prediction while maintaining the ease of interpretability of conceptual hydrologic models. In neural ODEs, internal processes that are represented in differential equations, are substituted by neural networks. Therefore, neural ODE models enable the fusion of deep learning with mechanistic modelling. We demonstrate the basin-specific predictive performance for 569 catchments of the continental United States. For exemplary basins, we analyse the dynamics of states and processes learned by the model-internal neural networks. Finally, we discuss the potential of neural ODE models in hydrology.

Deep learning models, in particular long-short-term memory (LSTM) neural networks, have outperformed traditionally used conceptual models in
hydrologic modelling

First, machine-learning models are still not as easily interpretable as traditionally used physics-based conceptual hydrologic models are

Therefore, it is becoming more and more inaccurate to label machine-learning methods as black box models since techniques exist that shed light on the
internal information processing of machine-learning methods

Second, while the introduction of system memory as a physical principle (like in LSTM models) has turned out to be crucial for hydrograph prediction, other
basic physical principles have not necessarily been fulfilled, yet. Currently used machine-learning approaches are limited to fixed time steps, which
restricts their usage. For instance, while LSTM approaches have been shown to perform well at the task of discharge prediction on daily resolution in many
different cases

Third, there is often prior knowledge that cannot be included in machine-learning models. Data-driven modelling demonstrates impressive abilities in
terms of mimicking and/or improving the translation from driving forces variables through the system into its output, like from precipitation to
discharge in hydrology. Yet, the question remains as to why such models only have to use data to learn all the internal processes of the system from scratch. Much
knowledge about hydrology has been gathered in the past, so why not provide such knowledge, for example, mechanistic structure, reliable causal
interrelations and context-specific information

For conceptual hydrologic models, these gaps have been mostly closed over the last decades: the development of conceptual bucket-type models

Yet, there remains a dichotomy between bottom-up and top-down approaches in hydrology

Recently, hybrid attempts have been made to extend conceptual hydrologic models with machine-learning methods in order to alleviate their
shortcomings. For example,

Other approaches that combine principles from physical process-based models (PBMs) and deep learning have increasingly been developed in recent
years. Mixtures of PBMs and deep learning methods have been used to model the global hydrological cycle

Here, we introduce a modelling approach that addresses the above gaps regarding interpretability, physics and knowledge simultaneously and that
therefore has the potential to help dissolving the dichotomy in hydrology. The approach is also a hybrid of deep learning and differential equations,
but it does not apply deep learning only in a postprocessing step like in the hybrid CNN approach above, and it does not only use constraints from
differential equations in the loss function like in PINNs. We employ neural ordinary differential equation (ODE) models

Neural ODEs are part of so-called scientific machine learning

Deep learning methods in hydrology have proven their ability to process integrated site-specific information to improve discharge prediction
tremendously

The remainder of this article is structured as follows: in Sect.

Schemes of the neural ODE models M50

As a baseline conceptual framework, we work with a typical hydrologic bucket-type model. We employ the structure of the simple rainfall-runoff model EXP-Hydro

EXP-Hydro as originally developed by

We refer to our implementation of EXP-Hydro as model M0. In total, we set up three different models, with numbers in the model name indicating the
percentage of neural network fraction within the model. Our models M50 and M100 have terms in Eqs. (

As shown in Fig.

We use the data provided in the CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) dataset

In our model evaluation, we also use lumped snow water equivalent (SWE) time series data for each basin. Aside of the catchment-integrated time series
such as those for temperature or precipitation, the CAMELS dataset contains dynamic data provided for different elevation bands in each basin,
including SWE time series. Each elevation band is assigned a respective area as fraction of the full catchment area. Using this information, we
integrate the SWE data as an area-weighted average in order to obtain lumped SWE data for each catchment. Note that SWE is not used as model input in
calibration. The observed SWE data are solely used for comparison with the dynamics of the snow storage

From the 671 available catchments, we use the same 569 as in

Our models are calibrated to each catchment specifically and validated on the same catchment. The procedure is structured as follows, with steps 2
and 3 only applying to neural ODE models M50 or M100:

Over the different steps, we enable knowledge transfer between the models: results from the trained conceptual hydrologic model are used as an example
for the neural network(s) to learn general relations between input variables and output quantities. These relations are then improved and refined in
the neural ODE training step. After successful training, we conduct a twofold evaluation of the models with validation data from the test period between
1 October 2000 and 30 September 2010:

We benchmark the models by three metrics commonly used in hydrology

We analyse internal model states and processes between the conceptual (M0) and the neural ODE (M50, M100) models (see
Sect.

First, for benchmarking, the following metrics are used: the Nash–Sutcliffe efficiency (NSE), as defined in Eq. (

Histograms of NSE (orange; optimal value: 1), FHV (yellow; optimal value: 0) and mNSE (blue; optimal value: 1) for the developed neural ODE models M100 and M50, the plain conceptual baseline model M0 and state-of-the-art LSTM and postprocessing hybrid CNN models

Another special case with

Second, the evaluation of internal model states and processes is conducted in direct comparison between the conceptual model M0 and the neural ODE
models M50 and M100: the dynamics of snow and water storages is inspected alongside the model-specific estimated streamflow. Further, the internal
processes for discharge, evapotranspiration and melting are isolated and explored over plausible ranges of input variables and model states, for example, discharge as a function of water storage. Additional input variables to the neural networks in M50 and M100 that shall not be explored are kept
fixed with catchment-specific values (like mean temperature) as specified in Sect.

Figure

For both FHV and mNSE, the M100 scores better in both mean and median than all other models. The distributions over all catchments show clear shifts towards the optimal scores 0 for FHV and 1 for mNSE, respectively. Considering NSE, which is also the calibration metric, M100 outperforms all other models except for the hybrid CNN approach. Yet, both mean and median NSE between the two models do only deviate by a small margin. Looking at the histograms, it can be seen that the hybrid CNN model shows an accumulation of scores slightly above the median for NSE and mNSE and slightly below the median for FHV. Contrarily, the M100 achieves substantially more high scores for NSE and mNSE and lower peak flow errors. At the tails of the histograms, M100 managed to reduce the number of bad results (NSE and mNSE below 0 and FHV around 100 and above).

Time series of data and model predictions from models M0, M50 and M100 for discharge (top), snow storage (centre) and water storage (bottom; no data) for the test period in basin 1013500

Considering M50 and M0, the neural ODE model M50 achieves a significant improvement in all metrics over the plain conceptual model: NSE mean and
median improve by about 0.15 and 0.23, respectively; mNSE increases in both statistical moments by more than 0.1, while FHV drops by about 25 %.
This shows that the conceptual model clearly benefits already from substituting only two processes (ET and

It can easily be seen that all models except for model M0 and LSTM achieve performances in a similar range with similar means and medians over all
metrics, although the distributions show noticeable differences. While M0 shows better FHV scores with the whole distribution tending toward lower
values, the LSTM is considerably better regarding NSE and mNSE. Yet, all distributions for both models deviate clearly from the other models, showing
more insufficient values that are low (around 0.0) for NSE and mNSE and high for FHV. This is further discussed in
Sect.

As with conceptual hydrologic models, the temporal dynamics of processes and states can directly be inspected and analysed in the neural ODE
approach. We chose two exemplary basins for demonstration purposes: Fish River near Fort Kent, Maine (ID: 1013500), and Spearfish creek, South Dakota
(ID: 6431500). The former one in Maine is a comparably large basin (

Figure

Relation between water storage and discharge

The two basins cover different magnitudes for all depicted variables. For the basin 1013500, model predictions of the three models are very
similar. Discharge predictions of all models match observations very well, which is also indicated by overall good metrics in
Table

Streamflow prediction performance based on NSE (optimum: 1), FHV (optimum: 0) and mNSE (optimum: 1) of the conceptual model (M0) and both neural ODE models (M50 and M100) for basins 1013500 and 6431500. Bold values indicate best performance.

In neither basin do the neural ODE models alter the snow storage component much from the plain conceptual model, although there are small differences
in specific years. Overall, the models do catch the temporal pattern of snow accumulation, but there are discrepancies in the magnitude. The models for
basin 1013500 show acceptable estimates, while for basin 6431500 they tend to underestimate SWE systematically. At the end of each snow season, the
models predict snow to disappear much earlier compared to the observed values for most years. This issue is further discussed in
Sect.

Like in plain conceptual models, internal processes like discharge, evapotranspiration and melting can be analysed over plausible ranges of input
variables in neural ODE models. Figure

The discrepancies of the learned relations between the three models become even more apparent on the logarithmic scale in Fig.

Dependence of discharge on precipitation (rain) and water storage for neural ODE models M0, M50 and M100 in basin 1013500 (left;

Exceeding plain conceptual models, the neural ODE approach further allows us to directly analyse the (cross-)impact of additionally assigned variables to
specific processes. Both neural networks

For basin 1013500, models M50 and M100 show an overall similar pattern in Fig.

The expected trend of increasing discharge for increasing rain is clearly visible for both models in basin 6431500 (Fig.

Figure

For basin 1013500, M0 shows much higher ET estimates over a large range of temperature–water storage combinations compared to the other two
models. M50 only reaches maximal ET in the region of medium to high water storage and very high temperatures (extreme to unrealistic for the
considered basin) as shown in Fig.

In contrast to M50, M100 shows a much more regular dependence of ET on temperature and water storage, as shown in
Fig.

Dependence of evapotranspiration on temperature and water storage for neural ODE models M0, M50 and M100 in basin 1013500 (left;

In basin 6431500, both models show a much more similar pattern for the maxima of evapotranspiration (Fig.

Dependence of melting rate on temperature and snow storage for neural ODE models M0/M50 and M100 in basin 1013500 (left;

The effect of snow storage and temperature on melting rates is displayed in Fig.

For M100, differences between the basins and from the hard-coded melting linear relationship in M0/M50 are clearly observable: for basin 1013500
(Fig.

Of course, the highest temperatures covered in the above analysis are unrealistic to be associated with snow cover. Elevation information that would make it possible to consider snow cover in high altitudes while already having warm temperatures in lower parts of the catchment is neglected. Nonetheless, we demonstrate that a physical extrapolation and analysis of individual processes is possible with the neural ODE approach, just as is traditionally done with conceptual models.

All four machine-learning-based hydrologic models show a significant improvement over the plain conceptual hydrologic model M0. Results indicate that
more information from training data can be leveraged by partial or pure data-driven models, and considerably higher rating scores are
achieved. Arguably, the EXP-Hydro is a very simplistic bucket model, and more sophisticated conceptual hydrologic models exist that achieve higher
scores (see SAC-SMA (Sacramento Soil Moisture Accounting Model) in Appendix

Note that the displayed results for LSTM are the original values from

Despite their success, machine-learning models in hydrology like LSTMs are known for often underestimating high flow events

The overall better performance of neural ODE models compared to plain conceptual models is associated with decisive differences in the model internal dynamics and process relations. Results demonstrate that the pre-training of neural networks in order to mimic hard-coded processes before the full neural ODE training does not prevent the neural networks from learning new and vastly different relations. With neural ODEs being built on the same conceptual model structure, individual states and processes can easily be analysed and compared between different models, or they can be investigated over specific ranges of input variables and model states. Ultimately, the dependencies learnt by the neural networks might help to develop more sophisticated relations for discharge and other processes.

In the variable ranges where many data were available, the neural ODE models elicited plausible relations for the investigated processes. Yet, the analyses indicated that in the extreme ranges of the process-dependent variables learned, relations might be counter-intuitive or subject to uncertainty. This is partially caused by a lack of data: 20 years of training data for a single catchment typically does not provide enough information to certainly extrapolate towards these limits. Although general process trends often appeared to be plausible, cases remain that are hardly explainable (e.g. a decrease of melting rate for growing snow storage). More data might help to refine functional relations for broader data ranges to a higher level of accuracy and to turn parts of the extrapolation into an interpolation problem. Yet, this will only be one part of the solution since further extrapolation is always a challenging task – especially for purely data-driven methods. Here, we conjecture that the hybrid neural ODE models benefit from their physical structure that enforces regularization. It informs the model parameters aside of data during training and naturally constrains predictions during interpolation and extrapolation tasks – just like the modelled natural system at hand being constrained by physical limits. We think that this combination of more data and physical structure might help neural ODE models to elicit reliable functional relations that can then be evaluated in plausibility testing. An example for this could be a centennial rainfall-runoff event that might not be covered by our data, but we would still be able to qualitatively judge whether the extrapolated relation to predict it is plausible or not.

None of the models were trained on snow water equivalent data, but due to their conceptual structure M0/M50 and M100 learned snow dynamics indirectly
via the snow storage state of the model. Despite the close agreement regarding their predictions of snow in both considered basins, all models depict
limitations of the lumped snow storage approach: melting of snow is often predicted earlier than shown for the catchment by data (see
Fig.

Hydrologic neural ODE models fuse the modular bucket-type structure of conceptual hydrologic models with machine learning. Plainly spoken, neural ODE
models are conceptual hydrologic models with deep learning cores. The presented models M50 and M100 depict hydrologic implementations of the general
neural ODE approach

First, using the conceptual hydrologic model structure preserves the interpretability of the model as traditionally given by conceptual models
and appreciated by the hydrologic community. Internal model states and processes can directly be inspected for plausibility, and their physical
interpretation fosters system understanding. The neural ODE approach might further trigger advancement in a more fundamental manner of building
“conceptual” models: theoretically, modellers only need to set up the conceptual framework but do not have to specify parameterizations within the
model and let the neural networks learn plausible relations. Potentially, even features that are often neglected in typical conceptual models, like
hysteresis

Second, the neural ODE allows for physically constrained, continuous time solutions. In principle, this also allows us to include data at an irregular temporal resolution for both training and testing. Physical principles and mechanistic structure act as guide rails that are naturally included and do not have to be learned or enforced as with pure machine-learning approaches. The physical constraints act as regularization that bound variability of the model. At the same time, the method is flexible enough to learn constitutive relations from data.

Third, our approach invites prior physical knowledge to be incorporated into the model. For instance, the neural ODE approach allows us to include processes that are fully known as hard-coded features, like a sewage treatment plant discharging into the stream at a known temporal pattern. Locally, expert knowledge might be available about hydrologic systems that can be accounted for. Pure data-driven methods might not be able to infer this knowledge from data alone, and pure mechanistic models might not provide the desired flexibility like neural ODE models.

In principle, the introduced approach can be applied to any conceptual hydrologic model. Numerous alternative bucket-type models and frameworks exist that can be fused with neural networks partially or entirely. The number of states and processes is adjustable according to specific requirements of the modelling problem at hand or in a more generic setup for multiple catchments. Already the EXP-Hydro model used as a rather simplistic example of conceptual model facilitated a drastic improvement of model performance when used as a basis for neural ODE models. Many sophisticated conceptual models exist (like SAC-SMA) that could also serve as a framework for more sophisticated hydrologic neural ODE models.

With the hydrological neural ODE model, we seek to introduce a tool in between existing top-down and bottom-up approaches that paves the way for
various subsequent research routes. For example, the deterministic model can be made probabilistic to enable uncertainty assessments as currently
performed for stochastic hydrologic models

The simple rainfall-runoff model EXP-Hydro

EXP-hydro parameter definitions, meaning and units

For ease of readability and comparability to

precipitation as snow or rain:

Evapotranspiration:

Melting:

Discharge:

With the substitutions from M0 to M50, we want to highlight two important features of the neural ODE modelling approach. First, physical knowledge can
directly be included in the model: the ET prescription uses potential evapotranspiration based on Hamon's formula

Histograms of NSE (orange; optimal value: 1), FHV (yellow; optimal value: 0) and mNSE (blue; optimal value: 1) for the SAC-SMA model over 569 basins.

Second, in hydrologic models, discharge is often split up into (at least) a base flow component and an excess or peak flow component that acts above a
certain threshold of the water storage. In the neural ODE approach, these two flow components can be substituted by a neural network with a single
output node because neural networks are particularly suited to learn nonlinearities. Hence, rather than defining an “artificial” threshold beyond
which a new process is added, NNs can learn a continuous relation between water storage and model inputs to discharge. Unlike the

M50 is meant to demonstrate how strongly predictive performance can be increased by including some more flexible, data-driven model parts, i.e. only
partial modifications within the traditional modelling approach. This approach is similar to the one in

In the next step from M50 to M100, the other mechanistic processes that are “hard-coded” in the plain EXP-Hydro are also substituted. These are to
distinguish between precipitation as rain or snow and the melting process that transfers water from the snow storage unit to the main storage unit. As
opposed to ET and

The current benchmark hydrologic model for the CAMELS US dataset is the Sacramento Soil Moisture Accounting Model

Note, however, that training and testing periods for the SAC-SMA were different from those used here. The SAC-SMA was calibrated with a split-sample
approach, where 30 years of data (1 October 1980 to 30 September 2010) was split up into two parts, each covering 15 years. For details, refer to

Figure

All software was written in the programming language Julia (

MH had the original idea and developed the conceptualization and methodology of the study. MH developed the software with initial support by AS. MH conducted all model simulations and their formal analysis. Results were discussed and further research steps planned between CA, MBJ, AS, FF and MH. The visualizations and the original draft of the manuscript were prepared by MH, and reviewing and editing were provided by MBJ, CA, AS and FF. Funding was acquired by FF. All authors have read and agreed to the current version of the paper.

At least one of the (co-)authors is a member of the editorial board of

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors would like to thank Shijie Jiang for providing the original results from

This paper was edited by Marnik Vanclooster and reviewed by Miyuru Gunathilake and Andreas Wunsch.