General comments:
The idea of this study is to develop a regional streamflow model using a convolutional long short-term memory artificial neural network, which is the merger of two distinct deep learning (DL) techniques. This and several other innovations presented in the paper are quite impressive, and the overall performance of the model seems good. The revised paper is also in much better shape than the original submission, with considerably more detail given, much better figures, and some significant new analysis to shore up some weak points in the original study, such as including a linear benchmark model for comparison and additional sensitivity analyses demonstrating physically reasonable responses to perturbations in temperature fields, which (to some degree) ties into broader goals like explainable machine learning.
Unfortunately, the quality of the writing and explanations remains somewhat inadequate. The general impression one receives when reading this manuscript (which may or may not be true, but it is the impression one gets from the writing) is that the authors have some background with areas of geophysical science adjacent to watershed hydrology, but not watershed hydrology itself, and certainly not any aspect of operational hydrology or streamflow modeling. Similarly, treatments of machine learning in the manuscript seem to suggest familiarity with a very narrow range of sophisticated techniques but not a great awareness of the overall field of machine learning and, in particular, prior work on its application to streamflow modeling. Sadly, this may only serve to reinforce negative impressions among the water resource community as a whole about the general usefulness and credibility of machine learning – impressions that have been crippling in some important ways to the advancement of the field of hydrology.
Additionally, the primary technical innovation presented here – taking the long short-term memory (LSTM) neural network from time series analysis, which has recently seen several high-profile research applications to streamflow modeling, and adding in a convolutional neural network (CNN) from image analysis – involves an incremental advance over the existing LSTM approach, yet the existing LSTM approach is never implemented here. As a result, the paper cannot provide information on how much of an advantage, if any, the addition of a CNN architecture provides. Perhaps this is not strictly needed for publication, but it is an obvious limitation of the study that may compromise the adoption of this novel streamflow modeling technique by others.
Overall, this manuscript contains many clever and potentially powerful ideas but seems to be poorly executed, and it feels like an appropriate recommendation is for publication pending major revisions.
Detailed comments:
Line 10, delete “the region of”
Lines 31-32, not clear what the authors mean by a spatially distributed DL model. Use of the concepts of lumped, semi-distributed, and fully distributed hydrologic models has been pretty much exclusive to process-based models and it’s not clear how it can be extended to ML-based hydrologic models. Reading the rest of the paper, I can make an educated guess as to what the authors trying to get at here, but the reader should not have to do this and it’s still not clear that the terms really apply as such to machine learning models in hydrology. Are they referring to multiple forecast points (corresponding to gages)? If so, that’s captured in the concept of a regional hydrology model, which is not necessarily the same thing as a fully distributed hydrology model (as many fully distributed models make predictions at only a single location for example). Are they referring to fully spatially distributed (e.g. gridded) inputs? If so, that was successfully tackled decades ago in ML-based streamflow modeling by Hsieh et al. (2003). Moreover, the regional DL models with many input data and output prediction locations introduced by Kratzert et al. (2018, 2019a, 2019b) feel like they may be just as spatially distributed as the DL model introduced in this submission. The authors need to be much clearer and more specific on what they mean here and consider whether the confusion created by this mixing-and-matching of terminology is really beneficial to their ultimate purpose and the clarity and credibility of this submission. I suspect this is another case (the problem was widespread in the original submission) of slightly misusing standard hydrologic nomenclature. That said, see comment below re: line 137, where the manuscript handles all this much better.
Line 41: replace “total April-August streamflow” with “seasonal water supply”, which was the point of the exercise and is a major, mainstream task in water resource forecasting and management.
Lines 45-48: this basic description of ML in hydrology is clunky and imprecise. It could be easily read to imply the authors think that a Bayesian neural network is not an ANN, or that they think ANNs aren’t non-deep (in truth, traditional feedforward-error backpropagation ANNs of the sort being referred to here may by deep or non-deep depending on how many hidden layers they have), etc. Mistakes like this up-front in the introductory section may immediately draw the paper’s basic credibility into immediate question, no matter how innovative and correct the actual technical work subsequently presented in the paper may be.
Line 53: here the authors are implying that ANNs are non-deep, whereas that may or may not be the case (see preceding comment). This error is just sloppy writing and is totally avoidable. Again, the overall impression one gets from these passages is that the authors are not very familiar or comfortable with the field of ML in general, which is not a helpful image to present to the reader.
Lines 56-57, comment about advantages of deep learning relative to "labour-intensive manual feature extraction often required for non-deep models" - essentially true but also substantially exaggerated, which again undermines the credibility of the manuscript. Automated predictor selection and feature creation techniques have been used in statistical modeling for decades and have appeared in non-deep machine learning too. A recent example is Fleming et al. (2021b).
Lines 70-75: good description, but might want to consider mentioning here that Kratzert et al. (2019a) additionally used spatially heterogeneous physical basin characteristics as predictors in regional LSTM models. I believe this may be mentioned later in the manuscript but ought to be briefly pointed out here.
Line 80: beach state classification in coastal geomorphology is another example; see Ellenson et al. (2020).
The introductory section’s discussion of explainability in machine learning is inadequate and under-referenced, especially from the viewpoint of socially relevant hydrologic model applications, i.e., things such as actual flood and water supply forecasting at government agencies and the like. At a minimum, on line 100, after the sentence ending with “making”, add the following: “Practical methods are beginning to appear that allow users to easily identify and geophysically interpret, in detail, spatiotemporal patterns or input-output relationships identified by, respectively, new unsupervised learning (e.g., Fleming et al., 2021a) and supervised learning (e.g., Fleming et al., 2021b) algorithms designed for applied operational hydrological modelling environments where interpretability is key. However, there is still much work to be done on developing new and better ways to further the goal of explainable machine learning for hydrology, in both deep and non-deep contexts and both operations and research settings.” Both of these cited manuscripts describe new but non-deep ML methods that are far more focused on, and successful at, providing extensive and complete geophysical interpretations than the method introduced in this submission or other deep learning work in hydrology so far.
Line 107: add “or practical” after “research” and before “questions”
Line 137: yes, nicely done! Compared to lines 31-32 (see comment above), this is a much better description of what’s being meant by a “distributed” model in the context of DL in this paper, though it’s still not clear that using the term to described the CNN-LSTM application is particularly helpful.
Point 2 on line 140, the authors should modify the text slightly to be explicit that they’re referring to the spatially distributed input data
Line 153: this region is never referred to by anyone strong familiar with it as “the south-central domain of western Canada” and this description doesn’t even make geographic sense (what’s “central” about it?). Ironically, statements like this are likely to undermine the credibility of the work specifically with hydrologists working in the paper’s study area. Maybe try just calling it what it is: “southwestern Canada” and/or “the southern portions of the western Canadian provinces of British Columbia and Alberta”.
Line 187, after “acceptable” insert “for the purposes of this study”
Lines 205 and 207, “maximized” is not quite the right word here; the term suggests optimization
Line 210, is “uniquely” the right word here? As described here, this is not unique to this region, as similar processes happen in other parts of the Canadian Prairies and presumably elsewhere as well. It’s the only part of the study region with that characteristic, though, which maybe is what the authors are trying to say?
Line 224: the standard statistical nomenclature is “unit variance” not “unity variance”
Line 331: insert “(from which CNN technologies primarily originated)” after “image processing”
Line 331: I’m not confident hydroclimatologists would view this type of work as “hydro-climatic modelling”
Line 332: after “the three weather predictors”, should “at all grid cells” be added to further clarify? I think that’s what’s meant here, but it needs to be entirely clear given the analogies being drawn here between video processing and spatiotemporal climate fields
Lines 370-375: this is mostly sound logic, except perhaps for the PDO, which is generally thought to typically remain in one state for a couple decades or so between regime shifts
Section 4.3.1: this is the first place in the manuscript that ensemble modeling is introduced, and strangely, no effort is made to explain here or anywhere else in the paper how that ensemble is formed. Between this treatment (or lack thereof) in the manuscript on the one hand and their rebuttal letter on the other hand, it’s not clear the authors are aware that there is more than one way to create an ensemble of ML models, and they do need to provide a brief explanation of how they did it (one sentence will suffice).
Line 460: “Gaussian distribution” not just “gaussian” which is informal slang
Lines 505-507: temperature degree-day models go a lot further back than this; provide more several more references, including references specific to the use of this type of snow sub-models within standard watershed hydrology models
Figure 6: it is interesting that the overall test-phase NSE reported here for the Englishman River is substantially lower than that for the ensemble non-deep ANN for this same river described by Fleming et al. (2015). This may suggest that the advantage of simultaneous regional modeling across a large domain by the CNN-LSTM network introduced in this paper is accompanied by the disadvantage of weaker performance on a single given river of particular interest, which is often what water resource professionals are primarily concerned with – a particular river with socially destructive flooding events, for example, or a tributary to a reservoir that requires inflow predictions. This result might be traceable, at least in part, to the benefits of making specific choices, on the basis of general expertise in physical hydrology and familiarity with the particular watershed in question, that can be easily made when modeling a single watershed but are more cumbersome in a regional model. For example, the Englishman River-specific ensemble non-deep ANNs of Fleming et al. (2015) included snow pillow and antecedent streamflow inputs as inputs. That does not imply that the submitted study has done anything wrong – on the contrary, the result likely reflects an expected trade-off between scale and detail in a modeling system. The paper should explicitly acknowledge this point using the Englishman River, and the comparison to previous AI work in that basin, as what appears to be a clear example. For these reasons, the Fleming et al. (2015) study should also obviously be added to Table 2, giving an additional point of comparison for the Englishman River, which is already included in that table but only for a very old study presenting a lower-performing process-based hydrologic model.
Drop the term “heat map” from the manuscript. It’s a standard term in graphics production, but in the context of a manuscript dealing with various geophysical quantities including temperature, it’s unnecessarily ambiguous.
Lines 624-626: can the authors offer a specific hypothesis or two why the eastern and northeastern clusters show such a strong sensitivity to coastal conditions? Could it perhaps reflect some meteorological setup, e.g., jet stream position, storm tracks, etc.? It’s a very prominent feature of the results.
A major point of the article is that the resulting CNN-LSTM neural network provides results that are physically explainable in the sense that perturbations to driving fields yield the streamflow responses one would expect on the basis of physical hydrologic knowledge. That’s great, but it should be made very clear that this is not necessarily a unique attribute of deep learning – that has not been at all demonstrated here, and much the same might be expected from non-deep machine learning or even statistical models in the same application, provided they are built correctly.
Line 777: this is a grossly inadequate explanation
Line 800: “seasonal-scale input time series” – really? Decades-long time series with a daily sampling interval were used in this study, were they not? So, what are the authors trying to express here?
Lines 813-816: this supposition is inconsistent with the basics of physical hydrology. Interactions between streamflow and geology (aquifers, soil moisture storage, etc) directly and nonlinearly affect the temporal dynamics of streamflow responses to forcing meteorology. The passage is also inconsistent with the work of Kratzert et al. (2019a), who demonstrated that including static catchment characteristics as predictors in a LSTM streamflow model substantially improves performance, and Kratzert et al. (2019b), who demonstrated that a new variant they developed of the LSTM can extract features corresponding to static basin characteristics that capture geological and other watershed properties.
Line 880, “a single year of temperature and precipitation alone” – elsewhere the manuscript states that the data were over 1980-2015 (or 1979-2015, the paper is inconsistent on that point, e.g., between the abstract and conclusions). Even after subsetting the data into training and testing datasets, that still leaves several years, so where does the “single year” comment come from?
I believe the terms “validation” and “testing” are used inconsistently across the manuscript. Moreover, for the benefit of readers less familiar with machine learning, the manuscript should clearly explain the difference between training, validation, and testing datasets in the context of the CNN-LSTM network used here.
References:
Ellenson A, Simmons JA, Wilson GW, Hesser TJ, Splinter KD. 2020. Beach state recognition using Argus imagery and convolutional neural networks. Remote Sensing, 12, 3953.
Fleming SW, Bourdin DR, Campbell D, Stull RB, Gardner T. 2015. Development and operational testing of a super-ensemble artificial intelligence flood-forecast model for a Pacific Northwest river. Journal of the American Water Resources Association, 51, 502-512.
Fleming SW, Garen DC, Goodbody AG, McCarthy CS, Landers LC. 2021b. Assessing the new Natural Resources Conservation Service water supply forecast model for the American West: a challenging test of explainable, automated, ensemble artificial intelligence. Journal of Hydrology, 602, 126782.
Fleming SW, Vesselinov VV, Goodbody AG. 2021a. Augmenting geophysical interpretation of data-driven operational water supply forecast modeling for a western US river using a hybrid machine learning approach. Journal of Hydrology, 597, 126327.
Hsieh WW, Yuval, Li J, Shabbar A, Smith S. 2003. Seasonal prediction with error estimation of Columbia River streamflow in British Columbia. Journal of Water Resource Planning and Management, 129, 146-149.
Kratzert F, Klotz D, Brenner C, Schulz K, Herrnegger M. 2018. Rainfall-runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences, 22, 6005-6022.
Kratzert F, Klotz D, Herrnegger M, Sampson AK, Hochreiter S, Nearing GS. 2019a. Toward improved predictions in ungauged basins: exploiting the power of machine learning. Water Resources Research, 55, 11344-11354.
Kratzert F, Klotz D, Shalev G, Klambauer G, Hochreiter S, Nearing G. 2019b. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23, 5089-5110. |