Evaluation and interpretation of convolutional long short-term memory networks for regional hydrological modelling

Anderson, Sam; Radić, Valentina

doi:https://doi.org/10.5194/hess-26-795-2022

Articles | Volume 26, issue 3

https://doi.org/10.5194/hess-26-795-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-26-795-2022

© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 26, issue 3

Research article

|

14 Feb 2022

Research article |

| 14 Feb 2022

Evaluation and interpretation of convolutional long short-term memory networks for regional hydrological modelling

Sam Anderson and Valentina Radić

Download

Final revised paper (published on 14 Feb 2022)
Supplement to the final revised paper
Preprint (discussion started on 11 Mar 2021)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-113', Anonymous Referee #1, 20 Apr 2021
Manuscript Number: HESS-2021-113

Title: Evaluation and interpretation of convolutional-recurrent networks for regional hydrological modelling

This manuscript presents an interesting application of deep learning approach for modelling streamflow responses across 226 streamflow gauges in southwestern Canada. This paper is well written and mostly easy to follow. I find the application of DL approach for streamflow simulation to be quite innovative and worthy addition to the growing body of literature in this field. The paper will be of widespread interest to the community. Overall, the paper offers plenty of interesting work, however, some effort is needed to communicate the results more effectively and highlighting key findings and novelty in view of the journal audience, e.g., better describe temperature and spatial perturbations results by linking with previous studies. Additional discussion is also needed on what the DL method brings to the table in comparison to the traditional process based models. Furthermore, while the application of DL methods for streamflow simulation is interesting, it is not entirely clear how this approach could be used for real world applications. There are also questionable choices on some of the methods and data used.

I also find the most figure captions lacking in details and could be expanded to provide more details. This will avoid readers having to scroll up and down the paper to understand the details in figures. I hope these comments are helpful and I look forward to reading the revised manuscript. My detailed comments are given below.

Major comments

It is not clear what the application of DL method bring to the table in comparison to the traditional process based hydrologic models. This is an important question as the application of DL methods may be limited to predicting within the range of training datasets. Additionally, while the authors outlined the development and evaluation of DL method for streamflow simulation as their objectives, it is not stated how the DL method could be used for real world applications beyond the proof of concept type approach presented in this paper.

The clustering method divided the study region into six clusters based on seasonal streamflow, latitude and longitude variables in order to fine-tune the model training. However, there are a number of studies in the region which describe the spatial heterogeneity of the region. For instance, streamflow responses in the lee- and windward side of coast and rocky mountains, as well as mountainous and interior plains are know to be quite different (e.g., Moore 1991; Shrestha et al. 2012). Therefore, I would think including variables like slope and aspect will be able to better characterize the spatial heterogeneity and provide clusters that better capture the variability in the streamflow response. Better clustering can potentially improve the model fine-tuning and model performance in several regions, especially in the Eastern slopes of the Rocky Mountains where the model performed relatively poorly.

It is surprising to see that the study used 0.75° x 0.75° resolution ERA5 reanalysis data, especially given that the authors stated finer resolution climate data may improve model performance (L637). I wonder why the authors did not used to the finer resolution data readily available for the region (e.g., Werner et al. 2019)?

The authors described the DL methods as if the study is on image/video processing. While the methods may be same as image/video processing, there is a need to rephrase section 4 in terms of hydro-climatic modelling.

Specific comments

L95-110: The objective and novelty need to be revised by clearly describing what the DL method bring to the table compared to the process-based models, and how the DL method could be used for real-world application.

L135: What are the range of basin areas for the selected stations?

L138: Naturalized flow generally means regulated flow adjusted with regulation/abstraction removed. Correct term is natural flow.

L140-145: 40% missing data can lead to challenges in model setup. Wondering if the model performance was inferior for basins with missing data than basins with complete data sets?

L158-170: As stated earlier including slope and aspect may improve the cluster selection and model performance.

Figure 2: State in figure caption how the discharge values are normalized. Similarly, the authors need to provide more details in all Figure captions.

L189: It is not clear how the gridded weather data is mapped to the streamflow stations, are the nearest grid cells or mean values from several grid cells taken?

L315-316: Since previous 365 days of data are required, is Jan. 1 1980 is the first day used for streamflow training?

L364: The spatial perturbation section is hard to follow, how was the amplitude of 1 used in perturbation of climate fields?

L411: Also say temperature perturbations are constant throughout the time period.

L425: on what basis/reference was the criteria for freshet timing defined?

The results in Figure 4b have not been adequately described in the text.

Figure 6: name the basins for which these example results are presented.

L492: Given that streamflow at a hydrometric station is response to precipitation and temperature over the entire drainage basin, it is to be expected that there are higher sensitivity in response by including areas near and within the station. This need to be clarified, in the context of how big the drainage basins are, and whether the inclusion of precipitation and temperature variables from a wider region improved the model performance.

L562: How is the intensity of freshet calculated?

L562-580: The results in Figures 9 and 10 seem to be consistent with previous climate change impact studies in the region. This is quite promising and is perhaps one of results the authors can highlight further. I suggest expanding the discussion in this section by linking with previous climate impacts studies.

L589: Rephrase the sentence, it appears as if previous studies also used deep learning.

L626-627: While it is true that the non-contributing areas may have played a part in DL results in parts of eastern cluster, the cited studies are outside of the study region and not directly comparable. Also non-contributing areas may not be a factor for the entire region. There are maps available which outline the extent of non-contributing areas.

Table 2 heading: State the period of test set used. Also it should be clarified in the heading that various validation periods were used in reference models.

Discussion and Conclusions: the changes suggested above also applies to the results and discussion section.

References

Moore, R. D., 1991: Hydrology and water supply in the Fraser River basin. Water in Sustainable Development: Exploring Our Common Future in the Fraser River Basin, 21–40.

Shrestha, R. R., M. A. Schnorbus, A. T. Werner, and A. J. Berland, 2012: Modelling spatial and temporal variability of hydrologic impacts of climate change in the Fraser River basin, British Columbia, Canada. Hydrological Processes, 26, 1840–1860, https://doi.org/10.1002/hyp.9283.

Werner, A. T., M. A. Schnorbus, R. R. Shrestha, A. J. Cannon, F. W. Zwiers, G. Dayon, and F. Anslow, 2019: A long-term, temporally consistent, gridded daily meteorological dataset for northwestern North America. Scientific Data, 6, 180299, https://doi.org/10.1038/sdata.2018.299.
Citation: https://doi.org/10.5194/hess-2021-113-RC1
- AC1: 'Reply on RC1', Sam Anderson, 30 Jun 2021
  
  See reply on RC1 in attached file.
  
  Citation: https://doi.org/10.5194/hess-2021-113-AC1
RC2:
'Comment on hess-2021-113', Anonymous Referee #2, 03 Jun 2021

General comments:

This is an intriguing study that combines two distinct deep-learning technologies (the convolutional neural network, CNN, and long short-term memory neural network, LSTM) to create a new method for regional daily streamflow prediction that integrates complex spatiotemporal structures and dependencies. The method is applied to streamflow data from the southern portion of Canada’s two westernmost provinces, which is a geophysically complex and interesting region. Some effort is also made to address physical interpretation and meaningfulness of the technique. It is a promising study with widely relevant results that has strong potential for publication in a top-tier hydrology journal like HESS.

That said, the submission as it currently stands appears to have some substantial issues that need to be addressed before it can be considered for publication. The overall feel of how the manuscript is written is one of technical naivete and oversimplification, undermining the credibility of the study. For example, the text of the paper and possibly some of the analytical steps suggest a superficial understanding of the physical hydrology of western Canadian rivers and their associated datasets; and overall, the literature review around machine learning and its hydrologic applications is wholly inadequate and does not provide the reader with accurate and meaningful context to the study. Additionally, several basic elements one normally expects of a machine learning paper today seem to be missing, like clear descriptions of training vs. testing vs. validation data subsets, or the use of informative benchmark models to evaluate the new model against. The study is also not reproducible based on the limited information provided in the paper.

My recommendation is to accept the paper for publication in HESS pending major revisions. I hope the detailed comments provided below, as well as the references section that follows those detailed comments, will be helpful to the authors as they revise their manuscript.

Detailed comments:

* Line 30: Should also cite Hsu et al. (1995) here, as to my knowledge it was the first peer-reviewed journal paper to present the use of machine learning for rainfall-runoff modeling. (Full literature citations are provided below.)

* Lines 34-37: This feels like an overstatement/misstatement of both the limitations of conventional machine learning and the advantages of deep learning in a hydrologic prediction context. For one thing, a basic result in AI, dating back to the late 1980s or so, is that non-deep ANNs (in particular, multilayer perceptrons having a single hidden layer) are theoretically capable of learning any continuous relationship. Another issue: contrary to what is implied in the passage, non-deep ANNs are not the only kind of non-deep machine learning – there are several other major classes (random forests, support vector machines, and so forth). There also continues to be intense research in non-deep ML to create new kinds of AI, including news kinds of neural networks, having certain useful characteristics that have been successfully applied to river prediction; online sequential learning is an obvious example (e.g., Lima et al., 2015, 2016, 2017). Indeed, new kinds of non-deep machine learning algorithms are being developed specifically for hydrometeorological analysis and prediction tasks (e.g., Cannon, 2010, 2011, 2018; Fleming et al., 2015, 2019, 2021). On the other hand, deep learning applications in hydrology are currently in vogue and seem to be very promising in certain circumstances, but the body of work on the subject – particularly around streamflow prediction – remains exceedingly small, and the ultimate suitability of deep learning to this task, including capabilities and limitations, remains unclear at this point. A more mature way of looking at deep learning in hydrologic prediction is that work to date suggests it is a promising research direction that could potentially offer an alternative or complementary approach to non-deep machine learning for certain tasks.

* Lines 49-50: The use of point observations (of weather, presumably) does not necessarily imply that a model is spatially lumped. It is very common in process-based hydrologic modeling, including semi-distributed and fully distributed models, to spatially interpolate measurements from point data sources. In fact, some process-based models even integrate that spatial interpolation step into the software platform, along with adjustments for adiabatic lapse rates, etc., etc.

* Lines 67-70: Explainability is an issue for all machine learning models, not just deep learning models; it feels like this passage is conflating ML generally with DL specifically. For a recent example of a new non-deep ML technique specifically introduced to improve interpretability of a practical hydrologic prediction model, see Fleming et al. (2021), which also provides a much better explanation of exactly why geophysical explainability is a key requirement for practical applications of machine learning in hydrologic prediction.

* Lines 78-79: the authors are not using the terms white-box and (in particular) black-box in the way they are usually used. Most working in hydrology, in particular, would regard any physically explainable ML as being white-box in some sense. The term “black-box” is normally reserved for machine learning algorithms that do not offer any physical interpretability, which is to say, most of them.

* Lines 85-86: would be useful to note the similarities and differences between recurrent and LSTM neural networks here for a general readership. The text seems to be haphazardly switching between the two, which are related but not identical; LSTM is essentially a specific and advanced form of recurrent ANN. This applies to the title of the paper too; why "recurrent" instead of "long short-term memory"?

* Lines 104-105 are a bit off as well. There seems to be an implication here that more complex models are better models, and that in contrast this study is aiming for parsimonious models. That’s an odd way of looking at the desirability of different modeling approaches and structures. Most modelers view a parsimonious model as being fundamentally better, holding all else equal, i.e., so-called Accom’s razor.

* In addition to the various other papers referenced in this review that should be cited in the paper but were not, the authors may also wish to read and cite the review articles by Reichstein et al. (2019) and McGovern et al. (2019). Citing prior applications of machine learning to hydrologic and related modeling in the study area would also be appropriate. Some examples that come to mind include Rasouli et al. (2012), Lima et al. (2015, 2016, 2017), Snauffer et al. (2018), Fleming et al. (2015), Hsieh et al. (2003), and Shrestha et al (2021).

* Figure 1 would be much better, especially for an international readership that is unlikely to be strongly familiar with the study area, if it was a multi-panel figure that additionally illustrated topography, mean annual temperature, mean annual precipitation, and perhaps mean April 1 snow water equivalent.

* Lines 137-139: perhaps this passage merely is poorly written, but as it stands, the text implies a disturbing lack of understanding of the streamflow data being modeled. Naturalized flow data are flow data that have been adjusted for upstream water management activities – diversions, withdrawals, reservoir operations, etc. Data for stations upstream of dams are not necessarily naturalized, contrary to what is implied in this passage of the paper, and certainly in datasets like the HYDAT database used here, that step has not been undertaken and in many cases is unnecessary. Similarly, dams are not the only disturbance that result in non-natural streamflow data that would in principle require naturalization prior to use in a hydrologic modeling study of the sort done here; another obvious example is land use change. Why not use the Reference Hydrometric Basin Network (RHBN) stations or something similar? There is no mention here at all of the RHBN station network, which has been very widely used for decades for hydrological analysis and modeling studies in Canada. Also, I think quite a few hydrologists would raise their eyebrows at the specific data selection and processing procedures described in the first paragraph of section 3.1.

* The second paragraph of section 3.1 is also muddled. All that’s needed here is a concise statement that hydrometric network density is much higher in southern than northern Canada, and so, for the purposes of this study, the authors focused on the former.

* While the approach described on lines 159-170 is interesting and perhaps sufficient for the purposes of this study, overall it appears to be a naïve representation of spatiotemporal pattern formation in streamflow regimes in this study area. At an absolute minimum, some acknowledgement of prior work, and some caveats about the simple method and assumptions used here for regime classification, are needed. See in particular Halverson and Fleming (2015) and references cited therein. A particularly notable omission is that glacier-fed rivers are not identified as a distinct regime, whereas glacial cover is well-known to be a major control of streamflow dynamics in several areas within this region; see Moore et al. (2009), Fleming et al. (2016), Jost et al. (2012), and Bidlack et al. (2021).

* Section 3.1: I think reproducibility requires that the hydrometric station list used here be shown to readers. A table in an appendix or supplementary materials would be fine.

* Section 3.5: provide information about the latency of the ERA5 reanalysis product – is it available in near-real time? Some reanalysis products are, and some aren’t. It’s a crucial question if one were interested in operationalizing a hydrologic prediction system like this for actual use in flood forecasting or another similar practical hydrologic prediction application. If ERA5 products are not available in near-real time, then briefly but clearly state that limitation and its implications for wider use of the modeling framework introduced here.

* “data” = plural

* Somewhere in Section 3 or 4 there needs to be an explicit and clear description of what the training vs. testing vs. validation datasets are. There is a very brief mention of training vs validation but it is inadequate. The reader is not provided with information about how the training vs validation split is made, nor whether another subset is reserved for out-of-sample hyperparameter selection. These are standard practices in machine learning, and information about them is needed for transparency, reproducibility, and credibility of the study.

* A modern paper on machine learning applications to hydrologic prediction requires, in general, a performance comparison against some relevant benchmark model. Linear regression using precisely the same input dataset as the deep learning method introduced here is an obvious starting point and can provide a meaningful assessment of how much nonlinearity, interactions, etc contribute to the (presumably better) performance of the new technique. A conventional ANN and an LSTM would also be useful, if more ambitious, points of comparison. The only significant attempt the paper makes at this is Table 2, which scours the peer-reviewed journal literature for examples of hydrologic models that have been developed previously for a few of the locations considered in this study. That comparison is interesting and probably worth including in the paper, but it also has limited meaningfulness as different date ranges etc were used in the studies. Moreover, Table 2 relies on a small handful of academic studies and misses a lot of existing models within the study area operated by pragmatic water-management organizations like a large government-owned hydroelectric utility (BC Hydro), a provincial ministry (BC River Forecast Center), regional water management authorities (e.g., the MIKE-SHE model operated in the Okanagan Basin), and so forth. Moreover, given that even the simplest machine learning architecture outperforms process-based models in most cases, the somewhat mixed results in Table 2 are a little surprising. In Section 5 there is also a very brief verbal comparison against the LSTM-based work of Kratzert et al. (2018) but that study used a completely different set of basins and data, so again, the comparison is extremely approximate. I get that the purpose of this study is more around demonstrating a new technology, and perhaps delving a little into the question of explainability, but I suspect most readers would like to see more meaningful inter-model performance comparisons here.

* Estimating predictive uncertainty is a key element of a hydrologic prediction system. Figure 6 and its caption suggests that predictive uncertainty is quantitatively estimated here but is vague about the method. It appears that an ensemble of 10 different models is formed, and twice the standard deviation of the predictions from those 10 models on a given day is used as the de facto prediction bound for that day. This is a reasonable first-cut approach, I think. However, the method needs to be described in the methods section, and some capabilities and limitations need to be mentioned; I suspect that because weather uncertainty is not factored in (as far as I can tell from the manuscript as submitted) the ensemble spread will be substantially under-dispersive.

* The bar for explainability does not seem to be set very high here. The sensitivity analyses included in the paper are very useful, but they really amount to more of a plausibility test than an interpretability test. In particular, the paper demonstrates, though observing the CNN-LSTM responses to perturbations in the meteorological driving data, that its streamflow predictions (a) are most sensitive to weather in and near the basin as opposed to further away, and (b) are sensitive to temperature regimes, in particular, demonstrate hydrograph timing shifts corresponding to changes in snow accumulation and melt driven by temperature perturbations. Those results suggest the CNN-LSTM model is capturing key geophysical processes more-or-less correctly, but it does not clearly reveal physical explanations of the input-output relationships – only that the behaviors are consistent with some basic physical expectations. I think the paper is publishable without diving further into explainability, but the authors ought to phrase their outcomes a little more precisely around the question of interpretability and may wish to consider some additional sleuthing to demonstrate that the CNN-LSTM reveals physical processes. There is some precedent for this in machine learing-based streamflow modeling, and looking closely at those precedents may be useful to the authors; examples include Fleming (2007), Kratzert et al. (2018), and Fleming et al. (2021). Looking even more broadly across the literature than this would likely lead to even more suggestions of how to examine the geophysical relationships the model is capturing.

* Lines 629, “it is notable that the CNN-LSTM model achieves good streamflow simulation with only temperature and precipitation forcing data” – well, in practice the most widely applied hydrologic models tend to use only these two types of forcing because that’s all that is usually available, so I guess this point might be worth mentioning here but it’s not particularly “notable” to most streamflow modelers.

* Lines 635-638: is it possible that, through its empirical and complex meteorological input-hydrologic output mappings – effectively, a transfer function linking the meteorological data to the point streamflow observations – the CNN-LSTM effectively downscaled the reanalysis data, at least to some degree? May be worth talking about here.

* Lines 646-653: are the authors sure their method requires less data than an LSTM, as claimed here? Doesn’t the CNN-LSTM still ultimately need data for all N basins? This passage needs further explanation/clarification.

References:

Bidlack AL, Bisbing SM, Buma BJ, Diefenderfer HL, Fellman JB, Floyd WC, Giesbrecht I, Lally A, Lertzman KP, Perakis SS, Butman DE, D’Amore DV, Fleming SW, Hood EW, Hunt BPV, Kiffney PM, McNicol G, Menounos B, Tank SE. 2021. Climate-mediated changes to linked terrestrial and marine ecosystems across the Northeast Pacific Coastal Temperature Rainforest margin. Bioscience, doi.org/10.1093/biosci/biaa171.

Cannon AJ. 2010. A flexible nonlinear modelling framework for nonstationary generalized extreme value analysis in hydroclimatology. Hydrological Processes, 24, 673-685.

Cannon AJ. 2011. Quantile regression neural networks: implementation in R and application to precipitation downscaling. Computers and Geosciences, 37, 1277-1274.

Cannon AJ. 2018. Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. Stochastic Environmental Research and Risk Assessment, 32, 3207-3225

Fleming SW. 2007. Artificial neural network forecasting of nonlinear Markov processes. Canadian Journal of Physics, 85, 279-294.

Fleming SW, Bourdin DR, Campbell D, Stull RB, Gardner T. 2015. Development and operational testing of a super-ensemble artificial intelligence flood-forecast model for a Pacific Northwest river. Journal of the American Water Resources Association, 51, 502-512.

Fleming SW, Goodbody AG. 2019. A machine learning metasystem for robust probabilistic nonlinear regression-based forecasting of seasonal water availability in the US West. IEEE Access, 7, 119943-119964.

Fleming SW, Hood E, Dahlke HE, O’Neel S. 2016. Seasonal flows of international British Columbia-Alaska rivers: the nonlinear influence of ocean-atmosphere circulation patterns. Advances in Water Resources, 87, 42-55.

Fleming SW, Vesselinov VV, Goodbody AG. 2021. Augmenting geophysical interpretation of data-driven operational water supply forecast modeling for a western US river using a hybrid machine learning approach. Journal of Hydrology, 597, 126327.

Halverson MJ, Fleming SW. 2015. Complex network theory, streamflow, and hydrometric monitoring system design. Hydrology and Earth System Sciences, 19, 3301-3318.

Hsieh WW, Yuval, Li J; Shabbar A, Smith S. 2003. Seasonal prediction with error estimation of Columbia River streamflow in British Columbia. Journal of Water Resource Planning and Management, 129, 146-149.

Hsu K, Gupta HV, Sorooshian S. 1995. Artificial neural network modeling of the rainfall-runoff process. Water Resources Research, 31, 2517-2530.

Jost G, Moore RD, Menounos B, Wheate R. 2012. Quantifying the contribution of glacier runoff to streamflow in the upper Columbia River Basin, Canada. Hydrology and Earth System Sciences, 16, 849-860.

Kratzert F, Klotz D, Brenner C, Schulz K, Herrnegger M. 2018. Rainfall-runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22, 6005-6022.

Lima AR, Cannon AJ, Hsieh WW. 2015. Nonlinear regression in environmental sciences using extreme learning machines: a comparative evaluation. Environmental Modelling and Software, 73, 175-188.

Lima AR, Cannon AJ, Hsieh WW. 2016. Forecasting daily streamflow using online sequential extreme learning machines. Journal of Hydrology, 537, 431-443.

Lima AR, Hsieh WW, Cannon AJ. 2017. Variable complexity online sequential extreme learning machine, with applications to streamflow prediction. Journal of Hydrology, 555, 983-994.

McGovern A, Lagerquist R, Gagne DJ II, Jergensen GE, Elmore KL, Homeyer CF, Smith T. 2019. Making the black box more transparent: understanding the physical implications of machine learning. Bulletin of the American Meteorological Society, November, 2175-2199.

Moore RD, Fleming SW, Menounos B, Wheate R, Fountain A, Stahl K, Holm K, Jakob M. 2009. Glacier change in western North America: influences on hydrology, geomorphic hazards, and water quality. Hydrological Processes, 23, 42-61.

Rasouli K, Hsieh WW, Cannon AJ. 2012. Daily streamflow forecasting by machine learning methods with weather and climate Inputs. Journal of Hydrology, 414/415, 284-293.

Reichstein M, Camps-Valls G, Stevens B, Jung M, Denzler J, Carvalhais N, Prabhat. 2019. Deep learning and process understanding for data-driven Earth system science. Nature, 566, 195-204.

Shrestha RR, Bonsal BR, Bonnyman JM, Cannon AJ, Najafi MR. 2021. Heterogeneous snowpack response and snow drought occurrence across river basins of northwestern North America under 1.0*C to 4.0*C global warming. Climatic Change, 164, 40.

Snauffer AM, Hsieh WW, Cannon AJ, Schnorbus MA. 2018. Improving gridded snow water equivalent in British Columbia, Canada: multi-source data fusion by neural network methods. The Cryosphere, 12, 891-905.

Citation: https://doi.org/10.5194/hess-2021-113-RC2
- AC2: 'Reply on RC2', Sam Anderson, 30 Jun 2021
  
  See reply on RC2 in attached file.
  
  Citation: https://doi.org/10.5194/hess-2021-113-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (11 Aug 2021) by Jim Freer

AR by Sam Anderson on behalf of the Authors (11 Aug 2021) Author's response Manuscript

ED: Referee Nomination & Report Request started (25 Aug 2021) by Jim Freer

RR by Anonymous Referee #2 (01 Sep 2021)

Suggestions for revision or reasons for rejection

General comments:

The idea of this study is to develop a regional streamflow model using a convolutional long short-term memory artificial neural network, which is the merger of two distinct deep learning (DL) techniques. This and several other innovations presented in the paper are quite impressive, and the overall performance of the model seems good. The revised paper is also in much better shape than the original submission, with considerably more detail given, much better figures, and some significant new analysis to shore up some weak points in the original study, such as including a linear benchmark model for comparison and additional sensitivity analyses demonstrating physically reasonable responses to perturbations in temperature fields, which (to some degree) ties into broader goals like explainable machine learning.

Unfortunately, the quality of the writing and explanations remains somewhat inadequate. The general impression one receives when reading this manuscript (which may or may not be true, but it is the impression one gets from the writing) is that the authors have some background with areas of geophysical science adjacent to watershed hydrology, but not watershed hydrology itself, and certainly not any aspect of operational hydrology or streamflow modeling. Similarly, treatments of machine learning in the manuscript seem to suggest familiarity with a very narrow range of sophisticated techniques but not a great awareness of the overall field of machine learning and, in particular, prior work on its application to streamflow modeling. Sadly, this may only serve to reinforce negative impressions among the water resource community as a whole about the general usefulness and credibility of machine learning – impressions that have been crippling in some important ways to the advancement of the field of hydrology.

Additionally, the primary technical innovation presented here – taking the long short-term memory (LSTM) neural network from time series analysis, which has recently seen several high-profile research applications to streamflow modeling, and adding in a convolutional neural network (CNN) from image analysis – involves an incremental advance over the existing LSTM approach, yet the existing LSTM approach is never implemented here. As a result, the paper cannot provide information on how much of an advantage, if any, the addition of a CNN architecture provides. Perhaps this is not strictly needed for publication, but it is an obvious limitation of the study that may compromise the adoption of this novel streamflow modeling technique by others.

Overall, this manuscript contains many clever and potentially powerful ideas but seems to be poorly executed, and it feels like an appropriate recommendation is for publication pending major revisions.

Detailed comments:

Line 10, delete “the region of”

Lines 31-32, not clear what the authors mean by a spatially distributed DL model. Use of the concepts of lumped, semi-distributed, and fully distributed hydrologic models has been pretty much exclusive to process-based models and it’s not clear how it can be extended to ML-based hydrologic models. Reading the rest of the paper, I can make an educated guess as to what the authors trying to get at here, but the reader should not have to do this and it’s still not clear that the terms really apply as such to machine learning models in hydrology. Are they referring to multiple forecast points (corresponding to gages)? If so, that’s captured in the concept of a regional hydrology model, which is not necessarily the same thing as a fully distributed hydrology model (as many fully distributed models make predictions at only a single location for example). Are they referring to fully spatially distributed (e.g. gridded) inputs? If so, that was successfully tackled decades ago in ML-based streamflow modeling by Hsieh et al. (2003). Moreover, the regional DL models with many input data and output prediction locations introduced by Kratzert et al. (2018, 2019a, 2019b) feel like they may be just as spatially distributed as the DL model introduced in this submission. The authors need to be much clearer and more specific on what they mean here and consider whether the confusion created by this mixing-and-matching of terminology is really beneficial to their ultimate purpose and the clarity and credibility of this submission. I suspect this is another case (the problem was widespread in the original submission) of slightly misusing standard hydrologic nomenclature. That said, see comment below re: line 137, where the manuscript handles all this much better.

Line 41: replace “total April-August streamflow” with “seasonal water supply”, which was the point of the exercise and is a major, mainstream task in water resource forecasting and management.

Lines 45-48: this basic description of ML in hydrology is clunky and imprecise. It could be easily read to imply the authors think that a Bayesian neural network is not an ANN, or that they think ANNs aren’t non-deep (in truth, traditional feedforward-error backpropagation ANNs of the sort being referred to here may by deep or non-deep depending on how many hidden layers they have), etc. Mistakes like this up-front in the introductory section may immediately draw the paper’s basic credibility into immediate question, no matter how innovative and correct the actual technical work subsequently presented in the paper may be.

Line 53: here the authors are implying that ANNs are non-deep, whereas that may or may not be the case (see preceding comment). This error is just sloppy writing and is totally avoidable. Again, the overall impression one gets from these passages is that the authors are not very familiar or comfortable with the field of ML in general, which is not a helpful image to present to the reader.

Lines 56-57, comment about advantages of deep learning relative to "labour-intensive manual feature extraction often required for non-deep models" - essentially true but also substantially exaggerated, which again undermines the credibility of the manuscript. Automated predictor selection and feature creation techniques have been used in statistical modeling for decades and have appeared in non-deep machine learning too. A recent example is Fleming et al. (2021b).

Lines 70-75: good description, but might want to consider mentioning here that Kratzert et al. (2019a) additionally used spatially heterogeneous physical basin characteristics as predictors in regional LSTM models. I believe this may be mentioned later in the manuscript but ought to be briefly pointed out here.

Line 80: beach state classification in coastal geomorphology is another example; see Ellenson et al. (2020).

The introductory section’s discussion of explainability in machine learning is inadequate and under-referenced, especially from the viewpoint of socially relevant hydrologic model applications, i.e., things such as actual flood and water supply forecasting at government agencies and the like. At a minimum, on line 100, after the sentence ending with “making”, add the following: “Practical methods are beginning to appear that allow users to easily identify and geophysically interpret, in detail, spatiotemporal patterns or input-output relationships identified by, respectively, new unsupervised learning (e.g., Fleming et al., 2021a) and supervised learning (e.g., Fleming et al., 2021b) algorithms designed for applied operational hydrological modelling environments where interpretability is key. However, there is still much work to be done on developing new and better ways to further the goal of explainable machine learning for hydrology, in both deep and non-deep contexts and both operations and research settings.” Both of these cited manuscripts describe new but non-deep ML methods that are far more focused on, and successful at, providing extensive and complete geophysical interpretations than the method introduced in this submission or other deep learning work in hydrology so far.

Line 107: add “or practical” after “research” and before “questions”

Line 137: yes, nicely done! Compared to lines 31-32 (see comment above), this is a much better description of what’s being meant by a “distributed” model in the context of DL in this paper, though it’s still not clear that using the term to described the CNN-LSTM application is particularly helpful.

Point 2 on line 140, the authors should modify the text slightly to be explicit that they’re referring to the spatially distributed input data

Line 153: this region is never referred to by anyone strong familiar with it as “the south-central domain of western Canada” and this description doesn’t even make geographic sense (what’s “central” about it?). Ironically, statements like this are likely to undermine the credibility of the work specifically with hydrologists working in the paper’s study area. Maybe try just calling it what it is: “southwestern Canada” and/or “the southern portions of the western Canadian provinces of British Columbia and Alberta”.

Line 187, after “acceptable” insert “for the purposes of this study”

Lines 205 and 207, “maximized” is not quite the right word here; the term suggests optimization

Line 210, is “uniquely” the right word here? As described here, this is not unique to this region, as similar processes happen in other parts of the Canadian Prairies and presumably elsewhere as well. It’s the only part of the study region with that characteristic, though, which maybe is what the authors are trying to say?

Line 224: the standard statistical nomenclature is “unit variance” not “unity variance”

Line 331: insert “(from which CNN technologies primarily originated)” after “image processing”

Line 331: I’m not confident hydroclimatologists would view this type of work as “hydro-climatic modelling”

Line 332: after “the three weather predictors”, should “at all grid cells” be added to further clarify? I think that’s what’s meant here, but it needs to be entirely clear given the analogies being drawn here between video processing and spatiotemporal climate fields

Lines 370-375: this is mostly sound logic, except perhaps for the PDO, which is generally thought to typically remain in one state for a couple decades or so between regime shifts

Section 4.3.1: this is the first place in the manuscript that ensemble modeling is introduced, and strangely, no effort is made to explain here or anywhere else in the paper how that ensemble is formed. Between this treatment (or lack thereof) in the manuscript on the one hand and their rebuttal letter on the other hand, it’s not clear the authors are aware that there is more than one way to create an ensemble of ML models, and they do need to provide a brief explanation of how they did it (one sentence will suffice).

Line 460: “Gaussian distribution” not just “gaussian” which is informal slang

Lines 505-507: temperature degree-day models go a lot further back than this; provide more several more references, including references specific to the use of this type of snow sub-models within standard watershed hydrology models

Figure 6: it is interesting that the overall test-phase NSE reported here for the Englishman River is substantially lower than that for the ensemble non-deep ANN for this same river described by Fleming et al. (2015). This may suggest that the advantage of simultaneous regional modeling across a large domain by the CNN-LSTM network introduced in this paper is accompanied by the disadvantage of weaker performance on a single given river of particular interest, which is often what water resource professionals are primarily concerned with – a particular river with socially destructive flooding events, for example, or a tributary to a reservoir that requires inflow predictions. This result might be traceable, at least in part, to the benefits of making specific choices, on the basis of general expertise in physical hydrology and familiarity with the particular watershed in question, that can be easily made when modeling a single watershed but are more cumbersome in a regional model. For example, the Englishman River-specific ensemble non-deep ANNs of Fleming et al. (2015) included snow pillow and antecedent streamflow inputs as inputs. That does not imply that the submitted study has done anything wrong – on the contrary, the result likely reflects an expected trade-off between scale and detail in a modeling system. The paper should explicitly acknowledge this point using the Englishman River, and the comparison to previous AI work in that basin, as what appears to be a clear example. For these reasons, the Fleming et al. (2015) study should also obviously be added to Table 2, giving an additional point of comparison for the Englishman River, which is already included in that table but only for a very old study presenting a lower-performing process-based hydrologic model.

Drop the term “heat map” from the manuscript. It’s a standard term in graphics production, but in the context of a manuscript dealing with various geophysical quantities including temperature, it’s unnecessarily ambiguous.

Lines 624-626: can the authors offer a specific hypothesis or two why the eastern and northeastern clusters show such a strong sensitivity to coastal conditions? Could it perhaps reflect some meteorological setup, e.g., jet stream position, storm tracks, etc.? It’s a very prominent feature of the results.

A major point of the article is that the resulting CNN-LSTM neural network provides results that are physically explainable in the sense that perturbations to driving fields yield the streamflow responses one would expect on the basis of physical hydrologic knowledge. That’s great, but it should be made very clear that this is not necessarily a unique attribute of deep learning – that has not been at all demonstrated here, and much the same might be expected from non-deep machine learning or even statistical models in the same application, provided they are built correctly.

Line 777: this is a grossly inadequate explanation

Line 800: “seasonal-scale input time series” – really? Decades-long time series with a daily sampling interval were used in this study, were they not? So, what are the authors trying to express here?

Lines 813-816: this supposition is inconsistent with the basics of physical hydrology. Interactions between streamflow and geology (aquifers, soil moisture storage, etc) directly and nonlinearly affect the temporal dynamics of streamflow responses to forcing meteorology. The passage is also inconsistent with the work of Kratzert et al. (2019a), who demonstrated that including static catchment characteristics as predictors in a LSTM streamflow model substantially improves performance, and Kratzert et al. (2019b), who demonstrated that a new variant they developed of the LSTM can extract features corresponding to static basin characteristics that capture geological and other watershed properties.

Line 880, “a single year of temperature and precipitation alone” – elsewhere the manuscript states that the data were over 1980-2015 (or 1979-2015, the paper is inconsistent on that point, e.g., between the abstract and conclusions). Even after subsetting the data into training and testing datasets, that still leaves several years, so where does the “single year” comment come from?

I believe the terms “validation” and “testing” are used inconsistently across the manuscript. Moreover, for the benefit of readers less familiar with machine learning, the manuscript should clearly explain the difference between training, validation, and testing datasets in the context of the CNN-LSTM network used here.

References:

Ellenson A, Simmons JA, Wilson GW, Hesser TJ, Splinter KD. 2020. Beach state recognition using Argus imagery and convolutional neural networks. Remote Sensing, 12, 3953.

Fleming SW, Bourdin DR, Campbell D, Stull RB, Gardner T. 2015. Development and operational testing of a super-ensemble artificial intelligence flood-forecast model for a Pacific Northwest river. Journal of the American Water Resources Association, 51, 502-512.

Fleming SW, Garen DC, Goodbody AG, McCarthy CS, Landers LC. 2021b. Assessing the new Natural Resources Conservation Service water supply forecast model for the American West: a challenging test of explainable, automated, ensemble artificial intelligence. Journal of Hydrology, 602, 126782.

Fleming SW, Vesselinov VV, Goodbody AG. 2021a. Augmenting geophysical interpretation of data-driven operational water supply forecast modeling for a western US river using a hybrid machine learning approach. Journal of Hydrology, 597, 126327.

Hsieh WW, Yuval, Li J, Shabbar A, Smith S. 2003. Seasonal prediction with error estimation of Columbia River streamflow in British Columbia. Journal of Water Resource Planning and Management, 129, 146-149.

Kratzert F, Klotz D, Brenner C, Schulz K, Herrnegger M. 2018. Rainfall-runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences, 22, 6005-6022.

Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. 2019a. Toward improved predictions in ungauged basins: exploiting the power of machine learning. Water Resources Research, 55, 11344-11354.

Kratzert F, Klotz D, Shalev G, Klambauer G, Hochreiter S, Nearing G. 2019b. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23, 5089-5110.

Hide

RR by Anonymous Referee #1 (01 Oct 2021)

ED: Publish subject to revisions (further review by editor and referees) (29 Oct 2021) by Jim Freer

AR by Sam Anderson on behalf of the Authors (21 Nov 2021) Author's response Author's tracked changes Manuscript

ED: Publish subject to minor revisions (further review by editor) (14 Dec 2021) by Jim Freer

AR by Sam Anderson on behalf of the Authors (27 Dec 2021) Author's response Manuscript

ED: Publish as is (14 Jan 2022) by Jim Freer

AR by Sam Anderson on behalf of the Authors (17 Jan 2022) Manuscript

Short summary

We develop and interpret a spatiotemporal deep learning model for regional streamflow prediction at more than 200 stream gauge stations in western Canada. We find the novel modelling style to work very well for daily streamflow prediction. Importantly, we interpret model learning to show that it has learned to focus on physically interpretable and physically relevant information, which is a highly desirable quality of machine-learning-based hydrological models.