Deep learning rainfall-runoff predictions of extreme events
- 1National Water Center, National Oceanic and Atmospheric Administration, Tuscaloosa, AL, United States
- 2University of Alabama, Tuscaloosa, AL, United States
- 3LIT AI Lab & Institute for Machine Learning, Johannes Kepler University, Linz, Austria
- 4Google Research, Tel Aviv, Israel
- 5The University of Arizona, Tucson, AZ, United States
- 6Google Research, Mountain View, CA, United States
- 7University of California Davis, Department of Land, Air & Water Resources, Davis, CA, United States
- 1National Water Center, National Oceanic and Atmospheric Administration, Tuscaloosa, AL, United States
- 2University of Alabama, Tuscaloosa, AL, United States
- 3LIT AI Lab & Institute for Machine Learning, Johannes Kepler University, Linz, Austria
- 4Google Research, Tel Aviv, Israel
- 5The University of Arizona, Tucson, AZ, United States
- 6Google Research, Mountain View, CA, United States
- 7University of California Davis, Department of Land, Air & Water Resources, Davis, CA, United States
Abstract. The most accurate rainfall-runoff predictions are currently based on deep learning. There is a concern among hydrologists that data-driven models based on deep learning may not be reliable in extrapolation or for predicting extreme events. This study tests that hypothesis using Long Short-Term Memory networks (LSTMs) and an LSTM variant that is architecturally constrained to conserve mass. The LSTM (and the mass-conserving LSTM variant) remained relatively accurate in predicting extreme (high return-period) events compared to both a conceptual model (the Sacramento Model) and a process-based model (US National Water Model), even when extreme events were not included in the training period. Adding mass balance constraints to the data-driven model (LSTM) reduced model skill during extreme events.
Jonathan Frame et al.
Status: closed
-
CC1: 'Comment on hess-2021-423', John Ding, 20 Aug 2021
I’m intrigued by the opening sentence in the Abstract that “(t)he most accurate rainfall-runoff predictions are currently based on deep learning. “ And after a quick read, I’d have to concur with the authors in their findings.
One question I have is about one of their previous works. Kratzert et al. (2019a, Section 3.1, paragraph 2) wrote that “This [LSTM] is not dissimilar to any standard hydrological simulation model (i.e., is it not a one-step-ahead forecast model).“
The word “it” inside the parentheses seems out of place. I’d appreciate a clarification of this.
The conceptual model the authors adopt for benchmarking is the Sacramento Soil Moisture Accounting model (SAC-SMA). This includes “a [linear] unit hydrograph routing function” (Line 122).
As a proponent of using a nonlinear response function to simulate what I’ve called Childs-Minshall phenomenon (Ding, 2011, Figures 1 and 2), I feel the SAC-SMA can be improved by moving to a nonlinear store or storage. But then advancing the state of the art of a standard conceptual model is a separate issue.
References
Ding, J. Y.: A measure of watershed nonlinearity: interpreting a variable instantaneous unit hydrograph model on two vastly different sized watersheds, Hydrol. Earth Syst. Sci., 15, 405–423, https://doi.org/10.5194/hess-15-405-2011, 2011.-
CC2: 'Clarify on CC1', John Ding, 30 Oct 2021
I’d like to make use of this extended discussion period to clarify the one question I have of some co-authors’ previous statement, the latter part of which reads “This [LSTM] ..... (i.e., is it not a one-step-ahead forecast model).“ (CC1, paragraph 2).
Among the autoregressive (AR) class of time series models for prediction, a simplest one being a one-step-ahead extrapolation/forecast model. This is a second-order one, written as AR(2, 2, -1), i.e. y^_{t+1} = 2*y_{t} - y_{t-1}.
The drawback of the AR(2) is to always overshoot by one time step the timing of peaks and troughs of an observed hydrograph (Mizukami et al., 2019, SC1 therein; Ding, 2018).
Isn’t AR(2, 2, -1) a special case of the LSTM network models?
References
Ding, J.: Interactive comment on “On the choice of calibration metrics for “high flow” estimation using hydrologic models” by Naoki Mizukami et al., Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-391-SC1, 2018.
Mizukami, N., Rakovec, O., Newman, A. J., Clark, M. P., Wood, A. W., Gupta, H. V., and Kumar, R.: On the choice of calibration metrics for “high-flow” estimation using hydrologic models, Hydrol. Earth Syst. Sci., 23, 2601–2614, https://doi.org/10.5194/hess-23-2601-2019, 2019.
-
AC1: 'Reply on CC2', Jonathan Frame, 29 Dec 2021
Thank you very much for your comment on our paper. I am glad that you concur with the findings in this paper “Deep learning rainfall-runoff predictions of extreme events”.
[One question I have is about one of their previous works. Kratzert et al. (2019a, Section 3.1, paragraph 2) wrote that “This [LSTM] is not dissimilar to any standard hydrological simulation model (i.e., is it not a one-step-ahead forecast model).“]
I think the problem is simply that "is it" should have been "it is". Please note that this discussion is about our current manuscript and not meant to discuss an already published manuscript. Next time, please consider writing these questions to the contact author of the corresponding paper.
We chose to use the SAC-SMA model as developed for the CAMELS dataset (https://ral.ucar.edu/solutions/products/camels), because it is a model that is known to produce good streamflow predictions on lumped catchments. It is true that there are many different variations of conceptual and process based models that might perform better in extreme events than this particular version of SAC-SMA and the U.S. National Water Model that are included in this paper. But due to the nature of our hypothesis as testing data-driven models, i.e., “data-driven streamflow models are likely to become unreliable in extreme or out- of-sample events”, we do not feel that it is necessary to include benchmarks of all possible variations of process-based models to address our hypothesis.
As for your question “Isn’t AR(2, 2, -1) a special case of the LSTM network models?”, I do not agree that AR(2,2,-1) is a special case of the LSTM network models. The LSTM is a general state-space model, which does not require an auto-regressive input, and in this paper we do not use an auto-regressive input, while the AR model you describe only uses an auto-regressive input. While it is true that LSTM can be trained to reproduce the results of an AR(2,2,-1) model, that does not make it a special case of LSTM, rather that is a result of the universal approximation characteristic of neural networks.
Thank you very much for your comments.
Jonathan Frame
-
AC1: 'Reply on CC2', Jonathan Frame, 29 Dec 2021
-
CC2: 'Clarify on CC1', John Ding, 30 Oct 2021
-
RC1: 'Comment on hess-2021-423', Anonymous Referee #1, 05 Sep 2021
This is an interesting paper that addresses a most relevant issue of incorporating background knowledge about processes in question into machine learning (ML) algorithms.
Clearly, in ML we are hoping to program computers by telling them what we want to achieve without having to explicitly instruct them how to achieve such goals. But, what is it that we want from those programs? Do they just need to be accurate or should we also be able to interpret them? In the scientific contexts, the ambition is clear: we are looking for a learning machine capable of finding an accurate approximation of a natural phenomenon, as well as expressing it in the form of a meaningful or an interpretable model. The authors adopt this view, and I fully subscribe to such a working hypothesis.
At the same time, this bias towards meaningfulness and interpretability opens several additional issues. The computer-generated hypotheses should take advantage of the already existing body of knowledge about the domain in question. In the case of two equally good approximations of the same data set: the one which blindly fits the data and the other which, in addition to the fit, also respects the background knowledge, we should be biased toward the latter one.
However, there are few questions that I would appreciate authors could address in the manuscript to a greater depth.
1.) The fashion in which we express knowledge about the processes and make it available to the learning machine remains rather unclear. One can insist on strict adherence to the background knowledge principles - such as 100% mass balance accuracy. We declare this desire and hence this is referred to as declarative bias. Alternatively, one can treat the bias as an additional objective that should be treated simultaneously with the goodness of fit in the learning process. This is referred to as preferential bias. Declarative bias reduces search space but results in so-called broken ergodicity. Preferential bias results in a Pareto-optimal set of solutions. For further discussion in the context of water management see:
M Keijzer and V Babovic, 2002, Declarative and preferential bias in GP-based scientific discovery, Genetic Programming and Evolvable Machines 3 (1), 41-79
In the present paper, it would appear that authors prefer declarative treatment of background knowledge. However, I would appreciate further analysis, comparison, and, if that is not possible, at least a discussion on preferential vs. declarative bias in the case studies described in theirs work.
2.) Bias Variance Tradeoff. Arguably incorporation of the knowledge bias affects model variance. In this case, bias denotes the difference between the average prediction of a model and the correct value which it is trying to predict. Variance is the variability of model prediction for a given data point or a value that tells us the spread of our data. For in-depth discussion see:
Hastie, T; Tibshirani, R; Friedman, J. H. (2009). The Elements of Statistical Learning, Springer
I would love to see a more in-depth analysis of the bias-variance tradeoff in the present case, and am looking forward to reading more about it in the revised version of the manuscript.
3.) Models vs. Predictions.
LSTM-type of ML models are extremely good at forecasting. The authors have eloquently argued in favour of the approach in this (as well as in previous) published research works. At the same time, one must consider if such a n ML approaches induce models or forecasters. On this topic I would advise the following recent works:
HMVV Herath, et al, 2021, Hydrologically informed machine learning for rainfall–runoff modelling: towards distributed modelling, Hydrology and Earth System Sciences 25 (8), 4373-4401HMVV Herath, J et al, 2021, Genetic programming for hydrological applications: to model or forecast that is the question, Journal of HydroinformaticsIn general, this is an interesting and potentially valuable contribution to the hydrological society. I am looking forward to reading revised version of the manuscript.-
AC2: 'Reply on RC1', Jonathan Frame, 29 Dec 2021
Thank you for your thoughtful review.
You ask a few questions that I think might be rhetorical, but I will answer them assuming that they are genuine.
RC1: [“in ML we are hoping to program computers by telling them what we want to achieve without having to explicitly instruct them how to achieve such goals.”]
The algorithms in ML are explicit instructions.
RC1: [“what is it that we want from those programs?”]
We want these programs to simulate the physical watershed processes resulting in streamflow during extreme runoff events.
RC1: [“Do they just need to be accurate or should we also be able to interpret them?”]
Accuracy and interpretation are not competing interests. ML models are just as interpretable as numerical solutions to partial differential equations (PDEs). As a matter of fact, the interpretation of PDEs and ML happens in the same way, through sensitivity analysis and visualization of the 1) relationship between input and diagnostic variable, and 2) the phase-space of model states.
RC1: [“In the case of two equally good approximations of the same data set: the one which blindly fits the data and the other which, in addition to the fit, also respects the background knowledge, we should be biased toward the latter one. ”]
If background knowledge is desirable, and it does not hinder the value of the products, then it is surely better to have background knowledge. But I do not understand your statement of “blindly fits the data”. Is this some assumption that one can make good predictions, but not understand how they are making good predictions? If so, this is not applicable, as we do in fact know how LSTM makes good predictions of streamflow during extreme events, and that is: The LSTM was trained to simulate the hydrological processes of a watershed by finding model parameters to represent dynamic relationships between static watershed attributes, dynamic atmospheric forcings and streamflow response. Exactly the same way as any other dynamic hydrological model.
RC1: [“The fashion in which we express knowledge about the processes and make it available to the learning machine remains rather unclear.”]
We express knowledge about the processes (I assume you mean hydrological processes) by setting up the input data (static watershed attributes and atmospheric forcings) that we assume will be informative to predict the target (streamflow). The “learning machine” has available to it a sample of the input and target data. We have set up our experiment here to test the hypothesis that the sample data is sufficient to learn a relationship that is suitable for extremely high runoff events.
RC1: [“One can insist on strict adherence to the background knowledge principles - such as 100% mass balance accuracy. We declare this desire and hence this is referred to as declarative bias. ”]
We do not have access to enough measurement data to attempt a 100% mass balance accuracy. We cannot distinguish between losses, from our watershed-scale control volume, to the atmosphere or the ground. We can only attempt to parameterize these losses through our model architecture.
RC1: [“Alternatively, one can treat the bias as an additional objective that should be treated simultaneously with the goodness of fit in the learning process. This is referred to as preferential bias. ”]
Preferential bias is “preference for one class of concepts over another class”. The LSTM, MC-LSTM and SAC-SMA are trained and calibrated to prefer a higher NSE score. We then test these models on their performance predicting the peak flow to address our hypothesis.
RC1: [“Declarative bias reduces search space but results in so-called broken ergodicity. Preferential bias results in a Pareto-optimal set of solutions.”]
And according to Keijzer and Babovic, reducing the search space does not help finding better solutions faster. We address this in our paper on lines 236-242.
RC1: [“In the present paper, it would appear that authors prefer declarative treatment of background knowledge. However, I would appreciate further analysis, comparison, and, if that is not possible, at least a discussion on preferential vs. declarative bias in the case studies described in theirs work.”]
No, the authors do not prefer declarative treatment of background knowledge. According to Keijzer and Babovic, reducing the search space “does not help finding better solutions faster. In fact, for the class of scientific discovery problems the opposite seems to be the case.” We come to the same conclusion in our paper (lines 234-242): “It is important to understand that there is only one type of situation in which adding any type of constraint (physically-based or otherwise) to a data-driven model can add value: if constraints help optimization. Helping optimization is meant here in a very general sense, which might include processes such as smoothing the loss surface, casting the optimization into a convex problem, restricting the search space, etc. Neural networks (and recurrent neural networks) can emulate large classes of functions (Hornik et al., 1989; Schäfer and Zimmermann, 2007), and by adding constraints to this type of model we can only restrict (not expand) the space of possible functions that the network can emulate. This form of regularization is valuable only if it helps locate a better (in some general sense) local minimum on the optimization response surface (Mitchell, 1980). And it is only in this sense that constraints imposed by physical theory can add information relative to what is available purely from data.” We do not believe that further discussion is required.
RC1: [“Bias Variance Tradeoff. Arguably incorporation of the knowledge bias affects model variance. In this case, bias denotes the difference between the average prediction of a model and the correct value which it is trying to predict. Variance is the variability of model prediction for a given data point or a value that tells us the spread of our data.“]
This is well covered in our paper. We train/calibrate our models using the Nash-Suttcliffe Efficiency (NSE), which when decomposed includes a term for bias (Ratio of the means of observed and simulated flow) and a term for variance (Ratio of standard deviations of observed and simulated flow). There is certainly a bias-variance tradeoff in our trained/calibrated models, but the NSE as a loss function is a good way to include these two terms. In Table 2 we present the results of both the Apha-NSE (The variance term), and the Beta-NSE (The bias term).
According to Elements of Statistical Learning “to trade bias off with variance in such a way as to minimize the test error.” The bias-variance tradeoff is an analysis to see how a model generalizes to be used on data that is not part of the training set. We show in our results that the LSTM model generalizes from a training set without extremely large runoff events to low probability, high flow events, that are not included in the training set.
High bias can cause an algorithm to miss the relevant relations between features and target outputs. High variance may result from an algorithm modeling the random noise in the training data
RC1: [“LSTM-type of ML models are extremely good at forecasting. The authors have eloquently argued in favour of the approach in this (as well as in previous) published research works. At the same time, one must consider if such a n ML approaches induce models or forecasters.”]
I am not sure of the question here. I think there was some typos, and I am not able to make out the last sentence. The Herath et al. paper is good, but not applicable to this study, which has the specific hypothesis of deep learning predictions in extremely large runoff events that are not included in the training.
-
AC2: 'Reply on RC1', Jonathan Frame, 29 Dec 2021
-
RC2: 'Comment on hess-2021-423', Anonymous Referee #2, 23 Sep 2021
Recently, deep learning has d more accurate rainfall-runoff predictions than a conceptual and physically-based model; this is to be expected as they are trained to be accurate predictively. It is also interesting that these models can generalise to different catchments and are trying to explain some of the mechanisms leading to runoff.
The paper provides a first-hand comparison of how different models (deep learning vs conceptual vs physical model) fare on unseen extreme events. It is clear that with the comparison, the deep learning-based model outperforms on different accuracy metrics.
The paper also argues that deep learning provides the hydrological sciences community's most accurate rainfall-runoff simulations. While this might be true, but certainly requires comparison with many available different models in different parts of the globe. Moreover, this might be only true when we have a large amount of data available. Nevertheless, I would agree that deep learning provides one of the most accurate rainfall-runoff predictions. In the future, there is much potential for a deep learning-based model for rainfall-runoff prediction.
One of the vital points raised in the paper is about the conceptual flux equations and highlights the potential for improvement with a comparison to MC-LSTM.
Another exciting thing is about the FLV (Bottom 30% low flow bias). Analysing why FLV increases(in magnitude) drastically for MC-LSTM could be an interesting direction to explore. Especially for the low probability years. As theoretically, the machine learning model should have seen such low flow data. The author illustrates that any constraint restricts the space of possible functions that the network can emulate. MC-LSTM is developed primarily to model this type of situation where an entity is conserved. Furthermore, unlike other metrics, which did not deteriorate much, we see a drastic drop (increase in magnitude) in FLV. More analysis in this direction would be interesting for the readers as well.
The paper highlights the potential of deep learning models to predict extreme events, while the hypothesis is that the data-driven models lose reliability in extreme events more than models based on process-understanding. The notion of reliability can be somewhat vague and should be clarified. The paper is only focusing on the predictive reliability here.
Overall, the paper would provide the first comparison on predictive accuracy for unseen extreme events for a deep learning model and a valuable contribution to the hydrological community.
- AC3: 'Reply on RC2', Jonathan Frame, 29 Dec 2021
-
RC3: 'Comment on hess-2021-423', Anonymous Referee #3, 26 Nov 2021
This is a really well motivated and focused paper. Please see the attached PDF for more detailed comments. I recommend accepting the paper as is, but wanted to outline some of my remaining questions while reviewing the manuscript. Within the PDF one will find suggestions that remain up to the authors and the editor to decide whether they are necessary for publication. Ultimately, while I clearly think that these suggestions are interesting avenues to explore, the current paper is extremely well focused and that should be commended.
- AC4: 'Reply on RC3', Jonathan Frame, 29 Dec 2021
Status: closed
-
CC1: 'Comment on hess-2021-423', John Ding, 20 Aug 2021
I’m intrigued by the opening sentence in the Abstract that “(t)he most accurate rainfall-runoff predictions are currently based on deep learning. “ And after a quick read, I’d have to concur with the authors in their findings.
One question I have is about one of their previous works. Kratzert et al. (2019a, Section 3.1, paragraph 2) wrote that “This [LSTM] is not dissimilar to any standard hydrological simulation model (i.e., is it not a one-step-ahead forecast model).“
The word “it” inside the parentheses seems out of place. I’d appreciate a clarification of this.
The conceptual model the authors adopt for benchmarking is the Sacramento Soil Moisture Accounting model (SAC-SMA). This includes “a [linear] unit hydrograph routing function” (Line 122).
As a proponent of using a nonlinear response function to simulate what I’ve called Childs-Minshall phenomenon (Ding, 2011, Figures 1 and 2), I feel the SAC-SMA can be improved by moving to a nonlinear store or storage. But then advancing the state of the art of a standard conceptual model is a separate issue.
References
Ding, J. Y.: A measure of watershed nonlinearity: interpreting a variable instantaneous unit hydrograph model on two vastly different sized watersheds, Hydrol. Earth Syst. Sci., 15, 405–423, https://doi.org/10.5194/hess-15-405-2011, 2011.-
CC2: 'Clarify on CC1', John Ding, 30 Oct 2021
I’d like to make use of this extended discussion period to clarify the one question I have of some co-authors’ previous statement, the latter part of which reads “This [LSTM] ..... (i.e., is it not a one-step-ahead forecast model).“ (CC1, paragraph 2).
Among the autoregressive (AR) class of time series models for prediction, a simplest one being a one-step-ahead extrapolation/forecast model. This is a second-order one, written as AR(2, 2, -1), i.e. y^_{t+1} = 2*y_{t} - y_{t-1}.
The drawback of the AR(2) is to always overshoot by one time step the timing of peaks and troughs of an observed hydrograph (Mizukami et al., 2019, SC1 therein; Ding, 2018).
Isn’t AR(2, 2, -1) a special case of the LSTM network models?
References
Ding, J.: Interactive comment on “On the choice of calibration metrics for “high flow” estimation using hydrologic models” by Naoki Mizukami et al., Hydrol. Earth Syst. Sci. Discuss., https://doi.org/10.5194/hess-2018-391-SC1, 2018.
Mizukami, N., Rakovec, O., Newman, A. J., Clark, M. P., Wood, A. W., Gupta, H. V., and Kumar, R.: On the choice of calibration metrics for “high-flow” estimation using hydrologic models, Hydrol. Earth Syst. Sci., 23, 2601–2614, https://doi.org/10.5194/hess-23-2601-2019, 2019.
-
AC1: 'Reply on CC2', Jonathan Frame, 29 Dec 2021
Thank you very much for your comment on our paper. I am glad that you concur with the findings in this paper “Deep learning rainfall-runoff predictions of extreme events”.
[One question I have is about one of their previous works. Kratzert et al. (2019a, Section 3.1, paragraph 2) wrote that “This [LSTM] is not dissimilar to any standard hydrological simulation model (i.e., is it not a one-step-ahead forecast model).“]
I think the problem is simply that "is it" should have been "it is". Please note that this discussion is about our current manuscript and not meant to discuss an already published manuscript. Next time, please consider writing these questions to the contact author of the corresponding paper.
We chose to use the SAC-SMA model as developed for the CAMELS dataset (https://ral.ucar.edu/solutions/products/camels), because it is a model that is known to produce good streamflow predictions on lumped catchments. It is true that there are many different variations of conceptual and process based models that might perform better in extreme events than this particular version of SAC-SMA and the U.S. National Water Model that are included in this paper. But due to the nature of our hypothesis as testing data-driven models, i.e., “data-driven streamflow models are likely to become unreliable in extreme or out- of-sample events”, we do not feel that it is necessary to include benchmarks of all possible variations of process-based models to address our hypothesis.
As for your question “Isn’t AR(2, 2, -1) a special case of the LSTM network models?”, I do not agree that AR(2,2,-1) is a special case of the LSTM network models. The LSTM is a general state-space model, which does not require an auto-regressive input, and in this paper we do not use an auto-regressive input, while the AR model you describe only uses an auto-regressive input. While it is true that LSTM can be trained to reproduce the results of an AR(2,2,-1) model, that does not make it a special case of LSTM, rather that is a result of the universal approximation characteristic of neural networks.
Thank you very much for your comments.
Jonathan Frame
-
AC1: 'Reply on CC2', Jonathan Frame, 29 Dec 2021
-
CC2: 'Clarify on CC1', John Ding, 30 Oct 2021
-
RC1: 'Comment on hess-2021-423', Anonymous Referee #1, 05 Sep 2021
This is an interesting paper that addresses a most relevant issue of incorporating background knowledge about processes in question into machine learning (ML) algorithms.
Clearly, in ML we are hoping to program computers by telling them what we want to achieve without having to explicitly instruct them how to achieve such goals. But, what is it that we want from those programs? Do they just need to be accurate or should we also be able to interpret them? In the scientific contexts, the ambition is clear: we are looking for a learning machine capable of finding an accurate approximation of a natural phenomenon, as well as expressing it in the form of a meaningful or an interpretable model. The authors adopt this view, and I fully subscribe to such a working hypothesis.
At the same time, this bias towards meaningfulness and interpretability opens several additional issues. The computer-generated hypotheses should take advantage of the already existing body of knowledge about the domain in question. In the case of two equally good approximations of the same data set: the one which blindly fits the data and the other which, in addition to the fit, also respects the background knowledge, we should be biased toward the latter one.
However, there are few questions that I would appreciate authors could address in the manuscript to a greater depth.
1.) The fashion in which we express knowledge about the processes and make it available to the learning machine remains rather unclear. One can insist on strict adherence to the background knowledge principles - such as 100% mass balance accuracy. We declare this desire and hence this is referred to as declarative bias. Alternatively, one can treat the bias as an additional objective that should be treated simultaneously with the goodness of fit in the learning process. This is referred to as preferential bias. Declarative bias reduces search space but results in so-called broken ergodicity. Preferential bias results in a Pareto-optimal set of solutions. For further discussion in the context of water management see:
M Keijzer and V Babovic, 2002, Declarative and preferential bias in GP-based scientific discovery, Genetic Programming and Evolvable Machines 3 (1), 41-79
In the present paper, it would appear that authors prefer declarative treatment of background knowledge. However, I would appreciate further analysis, comparison, and, if that is not possible, at least a discussion on preferential vs. declarative bias in the case studies described in theirs work.
2.) Bias Variance Tradeoff. Arguably incorporation of the knowledge bias affects model variance. In this case, bias denotes the difference between the average prediction of a model and the correct value which it is trying to predict. Variance is the variability of model prediction for a given data point or a value that tells us the spread of our data. For in-depth discussion see:
Hastie, T; Tibshirani, R; Friedman, J. H. (2009). The Elements of Statistical Learning, Springer
I would love to see a more in-depth analysis of the bias-variance tradeoff in the present case, and am looking forward to reading more about it in the revised version of the manuscript.
3.) Models vs. Predictions.
LSTM-type of ML models are extremely good at forecasting. The authors have eloquently argued in favour of the approach in this (as well as in previous) published research works. At the same time, one must consider if such a n ML approaches induce models or forecasters. On this topic I would advise the following recent works:
HMVV Herath, et al, 2021, Hydrologically informed machine learning for rainfall–runoff modelling: towards distributed modelling, Hydrology and Earth System Sciences 25 (8), 4373-4401HMVV Herath, J et al, 2021, Genetic programming for hydrological applications: to model or forecast that is the question, Journal of HydroinformaticsIn general, this is an interesting and potentially valuable contribution to the hydrological society. I am looking forward to reading revised version of the manuscript.-
AC2: 'Reply on RC1', Jonathan Frame, 29 Dec 2021
Thank you for your thoughtful review.
You ask a few questions that I think might be rhetorical, but I will answer them assuming that they are genuine.
RC1: [“in ML we are hoping to program computers by telling them what we want to achieve without having to explicitly instruct them how to achieve such goals.”]
The algorithms in ML are explicit instructions.
RC1: [“what is it that we want from those programs?”]
We want these programs to simulate the physical watershed processes resulting in streamflow during extreme runoff events.
RC1: [“Do they just need to be accurate or should we also be able to interpret them?”]
Accuracy and interpretation are not competing interests. ML models are just as interpretable as numerical solutions to partial differential equations (PDEs). As a matter of fact, the interpretation of PDEs and ML happens in the same way, through sensitivity analysis and visualization of the 1) relationship between input and diagnostic variable, and 2) the phase-space of model states.
RC1: [“In the case of two equally good approximations of the same data set: the one which blindly fits the data and the other which, in addition to the fit, also respects the background knowledge, we should be biased toward the latter one. ”]
If background knowledge is desirable, and it does not hinder the value of the products, then it is surely better to have background knowledge. But I do not understand your statement of “blindly fits the data”. Is this some assumption that one can make good predictions, but not understand how they are making good predictions? If so, this is not applicable, as we do in fact know how LSTM makes good predictions of streamflow during extreme events, and that is: The LSTM was trained to simulate the hydrological processes of a watershed by finding model parameters to represent dynamic relationships between static watershed attributes, dynamic atmospheric forcings and streamflow response. Exactly the same way as any other dynamic hydrological model.
RC1: [“The fashion in which we express knowledge about the processes and make it available to the learning machine remains rather unclear.”]
We express knowledge about the processes (I assume you mean hydrological processes) by setting up the input data (static watershed attributes and atmospheric forcings) that we assume will be informative to predict the target (streamflow). The “learning machine” has available to it a sample of the input and target data. We have set up our experiment here to test the hypothesis that the sample data is sufficient to learn a relationship that is suitable for extremely high runoff events.
RC1: [“One can insist on strict adherence to the background knowledge principles - such as 100% mass balance accuracy. We declare this desire and hence this is referred to as declarative bias. ”]
We do not have access to enough measurement data to attempt a 100% mass balance accuracy. We cannot distinguish between losses, from our watershed-scale control volume, to the atmosphere or the ground. We can only attempt to parameterize these losses through our model architecture.
RC1: [“Alternatively, one can treat the bias as an additional objective that should be treated simultaneously with the goodness of fit in the learning process. This is referred to as preferential bias. ”]
Preferential bias is “preference for one class of concepts over another class”. The LSTM, MC-LSTM and SAC-SMA are trained and calibrated to prefer a higher NSE score. We then test these models on their performance predicting the peak flow to address our hypothesis.
RC1: [“Declarative bias reduces search space but results in so-called broken ergodicity. Preferential bias results in a Pareto-optimal set of solutions.”]
And according to Keijzer and Babovic, reducing the search space does not help finding better solutions faster. We address this in our paper on lines 236-242.
RC1: [“In the present paper, it would appear that authors prefer declarative treatment of background knowledge. However, I would appreciate further analysis, comparison, and, if that is not possible, at least a discussion on preferential vs. declarative bias in the case studies described in theirs work.”]
No, the authors do not prefer declarative treatment of background knowledge. According to Keijzer and Babovic, reducing the search space “does not help finding better solutions faster. In fact, for the class of scientific discovery problems the opposite seems to be the case.” We come to the same conclusion in our paper (lines 234-242): “It is important to understand that there is only one type of situation in which adding any type of constraint (physically-based or otherwise) to a data-driven model can add value: if constraints help optimization. Helping optimization is meant here in a very general sense, which might include processes such as smoothing the loss surface, casting the optimization into a convex problem, restricting the search space, etc. Neural networks (and recurrent neural networks) can emulate large classes of functions (Hornik et al., 1989; Schäfer and Zimmermann, 2007), and by adding constraints to this type of model we can only restrict (not expand) the space of possible functions that the network can emulate. This form of regularization is valuable only if it helps locate a better (in some general sense) local minimum on the optimization response surface (Mitchell, 1980). And it is only in this sense that constraints imposed by physical theory can add information relative to what is available purely from data.” We do not believe that further discussion is required.
RC1: [“Bias Variance Tradeoff. Arguably incorporation of the knowledge bias affects model variance. In this case, bias denotes the difference between the average prediction of a model and the correct value which it is trying to predict. Variance is the variability of model prediction for a given data point or a value that tells us the spread of our data.“]
This is well covered in our paper. We train/calibrate our models using the Nash-Suttcliffe Efficiency (NSE), which when decomposed includes a term for bias (Ratio of the means of observed and simulated flow) and a term for variance (Ratio of standard deviations of observed and simulated flow). There is certainly a bias-variance tradeoff in our trained/calibrated models, but the NSE as a loss function is a good way to include these two terms. In Table 2 we present the results of both the Apha-NSE (The variance term), and the Beta-NSE (The bias term).
According to Elements of Statistical Learning “to trade bias off with variance in such a way as to minimize the test error.” The bias-variance tradeoff is an analysis to see how a model generalizes to be used on data that is not part of the training set. We show in our results that the LSTM model generalizes from a training set without extremely large runoff events to low probability, high flow events, that are not included in the training set.
High bias can cause an algorithm to miss the relevant relations between features and target outputs. High variance may result from an algorithm modeling the random noise in the training data
RC1: [“LSTM-type of ML models are extremely good at forecasting. The authors have eloquently argued in favour of the approach in this (as well as in previous) published research works. At the same time, one must consider if such a n ML approaches induce models or forecasters.”]
I am not sure of the question here. I think there was some typos, and I am not able to make out the last sentence. The Herath et al. paper is good, but not applicable to this study, which has the specific hypothesis of deep learning predictions in extremely large runoff events that are not included in the training.
-
AC2: 'Reply on RC1', Jonathan Frame, 29 Dec 2021
-
RC2: 'Comment on hess-2021-423', Anonymous Referee #2, 23 Sep 2021
Recently, deep learning has d more accurate rainfall-runoff predictions than a conceptual and physically-based model; this is to be expected as they are trained to be accurate predictively. It is also interesting that these models can generalise to different catchments and are trying to explain some of the mechanisms leading to runoff.
The paper provides a first-hand comparison of how different models (deep learning vs conceptual vs physical model) fare on unseen extreme events. It is clear that with the comparison, the deep learning-based model outperforms on different accuracy metrics.
The paper also argues that deep learning provides the hydrological sciences community's most accurate rainfall-runoff simulations. While this might be true, but certainly requires comparison with many available different models in different parts of the globe. Moreover, this might be only true when we have a large amount of data available. Nevertheless, I would agree that deep learning provides one of the most accurate rainfall-runoff predictions. In the future, there is much potential for a deep learning-based model for rainfall-runoff prediction.
One of the vital points raised in the paper is about the conceptual flux equations and highlights the potential for improvement with a comparison to MC-LSTM.
Another exciting thing is about the FLV (Bottom 30% low flow bias). Analysing why FLV increases(in magnitude) drastically for MC-LSTM could be an interesting direction to explore. Especially for the low probability years. As theoretically, the machine learning model should have seen such low flow data. The author illustrates that any constraint restricts the space of possible functions that the network can emulate. MC-LSTM is developed primarily to model this type of situation where an entity is conserved. Furthermore, unlike other metrics, which did not deteriorate much, we see a drastic drop (increase in magnitude) in FLV. More analysis in this direction would be interesting for the readers as well.
The paper highlights the potential of deep learning models to predict extreme events, while the hypothesis is that the data-driven models lose reliability in extreme events more than models based on process-understanding. The notion of reliability can be somewhat vague and should be clarified. The paper is only focusing on the predictive reliability here.
Overall, the paper would provide the first comparison on predictive accuracy for unseen extreme events for a deep learning model and a valuable contribution to the hydrological community.
- AC3: 'Reply on RC2', Jonathan Frame, 29 Dec 2021
-
RC3: 'Comment on hess-2021-423', Anonymous Referee #3, 26 Nov 2021
This is a really well motivated and focused paper. Please see the attached PDF for more detailed comments. I recommend accepting the paper as is, but wanted to outline some of my remaining questions while reviewing the manuscript. Within the PDF one will find suggestions that remain up to the authors and the editor to decide whether they are necessary for publication. Ultimately, while I clearly think that these suggestions are interesting avenues to explore, the current paper is extremely well focused and that should be commended.
- AC4: 'Reply on RC3', Jonathan Frame, 29 Dec 2021
Jonathan Frame et al.
Data sets
CAMELS return period analysis Jonathan M. Frame https://doi.org/10.4211/hs.c7739f47e2ca4a92989ec34b7a2e78dd
Model code and software
Model results analysis Jonathan M. Frame https://github.com/jmframe/mclstm_2021_extrapolate/tree/main/results
NeuralHydrology Frederick Kratzert https://github.com/neuralhydrology/neuralhydrology
Code for calibrating SAC-SMA Grey S. Nearing https://github.com/Upstream-Tech/SACSMA-SNOW17
Jonathan Frame et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,808 | 914 | 26 | 2,748 | 23 | 17 |
- HTML: 1,808
- PDF: 914
- XML: 26
- Total: 2,748
- BibTeX: 23
- EndNote: 17
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
2 citations as recorded by crossref.
- Robust and efficient uncertainty quantification for extreme events that deviate significantly from the training dataset using polynomial chaos-kriging V. Tran & J. Kim 10.1016/j.jhydrol.2022.127716
- Sandtank-ML: An Educational Tool at the Interface of Hydrology and Machine Learning L. Gallagher et al. 10.3390/w13233328