the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Machine Learning in Stream/River Water Temperature Modelling: a review and metrics for evaluation
Abstract. As climate change continues to affect stream/river (henceforth stream) systems worldwide, stream water temperature (SWT) is an increasingly important indicator of distribution patterns and mortality rates among fish, amphibians, and macroinvertebrates. Technological advances tracing back to the mid-20th century have improved our ability to measure SWT at varying spatial and temporal resolutions for the fundamental goal of better understanding stream function and ensuring ecosystem health. Despite significant advances, there continue to be numerous stream reaches, stream segments and entire catchments that are difficult to access for a myriad of reasons, including but not limited to physical limitations. Moreover, there are noted access issues, financial constraints, and temporal and spatial inconsistencies or failures within situ instrumentation. Over the last few decades and in response to these limitations, statistical methods and physically based computer models have been steadily employed to examine SWT dynamics and controls. Most recently, the use of artificial intelligence, specifically machine learning (ML) algorithms, has garnered significant attention and utility in hydrologic sciences, specifically as a novel tool to learn undiscovered patterns from complex data and try to fill data streams and knowledge gaps. Our review found that in the recent five years (2020–2024), a similar number (27) of publications using ML, as were published in the previous 20 years, (2000–2019), totaling 54. The aim of this work is three-fold: first, to provide a concise review of the use of ML algorithms in SWT modeling and prediction, second, to review ML performance evaluation metrics as it pertains to SWT modeling and prediction and find the commonly used metrics and suggest guidelines for easier comparison of ML performance across SWT studies and third, to examine how ML use in SWT modeling has enhanced our understanding of spatial and temporal patterns of SWT and examine where progress is still needed.
- Preprint
(1151 KB) - Metadata XML
-
Supplement
(267 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2024-256', Anonymous Referee #1, 23 Sep 2024
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2024-256/hess-2024-256-RC1-supplement.pdf
- AC2: 'Reply on RC1', Claudia Corona, 28 Nov 2024
-
RC2: 'Comment on hess-2024-256', Anonymous Referee #2, 30 Sep 2024
This is a meaningful manuscript that provides a thorough review of machine learning approaches for stream water temperature modeling and their evaluation metrics. I believe that the current scientific community has indeed developed a broad understanding of the integration of machine learning into stream temperature modeling. Hence, while the manuscript presents a comprehensive overview, incorporating more in-depth insights could enhance its appeal to readers and significantly increase its contribution to the field.
The review covers a wealth of content, including recent articles and other reviews, but the sections are somewhat loosely structured, with key points relatively briefly mentioned.
For instance, in the first section (Overview: Stream Water Temperature Model Types), the author provides a solid overview of statistical, physical, and machine learning models. However, a more detailed analysis of the comparative strengths and weaknesses of physical and machine learning models would strengthen the discussion. The models are presented in a nearly linear developmental order in this review, but it would be beneficial to mention some points, for example, physical models perform well, why machine learning models are adopted. How to gain the trust of traditional model users in machine learning methods? (This question is inherently challenging, as model users often have preferences based on their own familiarity with certain models and may exhibit biases against alternative approaches. However, it may be worthy to acknowledge this in the review.) This discussion could extend to the choice between different machine learning models as well, as conclusions favoring one model over another often depend on the specific context of the study. Many conclusions are applicable only under particular circumstances, so a generalization such as “a certain model is better suited to a particular type of problem” is more appropriate.
Furthermore, the author may not clearly (separately) present the generalization capabilities of machine learning models in temporal and spatial contexts, which is crucial for data split. The model ability of generalization over time is particularly meaningful for climate change studies, where overfitting (very common for machine learning studies) may lead to highly unreliable projections. Spatial generalization is highly useful for applying models to new regions or watersheds (ungauged stream/river/watershed).
Additionally, the review does not systematically address the critical issue of model input selection, which is essential in machine learning modeling. Model inputs for stream water temperature modeling may include hydrometeorological and physical parameters (or other attributes used in different studies), they play a role in model performance and should be discussed in this part.
In the second section, the authors do an excellent job summarizing model evaluation metrics. However, considering that machine learning models are often optimized to achieve superior performance on these metrics, there is (always) a risk of overfitting. Thus, beyond focusing on metrics, the review should also highlight the importance of more rigorous evaluation to further assess generalization ability. For instance, if a stream water temperature model is built to run climate change scenarios, additional testing and more rigorous designs are essential to evaluate the model's ability to generalize over time. For robust long-term predictions, the model is supposed to maintain robust predictive performance in completely unseen periods, rather than being limited to a specific temporal range.
Overall, this review is informative and well-researched, and with more refined organization and deeper exploration of these key issues, it could make a substantial contribution to the field of stream water temperature research.
Citation: https://doi.org/10.5194/hess-2024-256-RC2 - AC1: 'Reply on RC2', Claudia Corona, 28 Nov 2024
-
RC3: 'Comment on hess-2024-256', Jeremy Diaz, 07 Oct 2024
I believe that this manuscript is a very useful and extensive methods literature review regarding stream temperature modeling. I would recommend approval with minor revisions to provide additional details from the reviewed literature and correct minor writing aspects; I had no problem with the general structure/flow or quality.
- Section 2.3.3 (“Newer/recent ML algorithms”) introduces RNNs, CNNs, and GNNs sufficiently, but it should probably give some description and reference to attention-based transformers. I am not aware of their application to SWT, but they are responsible for broader interest in ML (e.g., ChatGPT, which was cited earlier) and have had mixed success in hydrologic modeling. This class of models seems easily placed as a future direction.
- There are some examples of unusual subsection and paragraph formatting. For example, section 1.1 is one paragraph which is approximately 1 page long. It seems that this is excessively large for one paragraph and that a named subsection should perhaps be more than just one (regularly sized) paragraph. Line 201 has another approximately 1-page-long paragraph, this area might be better organized with another level of subsections rather than fitting the more extensive references of decision trees into 1 paragraph.
- There is an extensive background of traditional ANNs (2.3.2) which is debatably too extensive given the description of ANN variants and backpropagation alternatives (e.g., lines 284-320), which are relatively niche and rare. The content already exists and is not wrong, but if length were a concern, I would reduce this area.
- This work does not address predictive uncertainty or the lack thereof associated with the ML literature review. I think that would be a worthwhile addition because I suspect most efforts lack that (e.g., referring to https://doi.org/10.5194/hess-26-1673-2022). A counterexample to the lack of uncertainty quantification, which may also be relevant to section 2.5, could be work led by Jacob Zwart focusing on SWT for reservoir operations (thermal releases). Examples being https://doi.org/10.1111/1752-1688.13093 or https://doi.org/10.3389/frwa.2023.1184992
- In section 3 (e.g., 3.1, 3.3, 3.4), I would recommend adding some discussion regarding the equivalence or lack of between lower-case r and r-squared, upper-case R-squared, and NSE. I am very comfortable stating that for the purpose of this continuously valued model evaluation, upper case R-squared and NSE are equivalent, but I am less comfortable making the assertation that lower case r and r-squared are (in all the papers reporting this value). This is likely further complicated by the reviewed literature using the lower-case r-squared and R-squared interchangeably, but given the 0-1 range, the high value skew, and the special case/conditional equivalences, I believe these values should all be reported together to characterize goodness of fit – especially that upper case R-squared and NSE should not be separated.
- In line 761, it feels controversial and a step too far to say ML models should be held to a higher standard. It feels less problematic to apply these higher, seemingly attainable standards to all SWT models. For example, a physics-based model is not "very good" by virtue of being a physics-based model, instead it's the same "satisfactory" label because its physics are not sufficient or accurate enough to do what the ML models can.
- If possible, in addition to considering spatial extents and temporal resolution of the papers, it would be interesting to know the aggregation level of data - if that is reported and what all the possibilities are. For example, individual gages with input data collected at the same gage location in situ, remotely sensed data subset to the drainage area for the reach that a gage is on. Are any works modeling dense transects along a river or modeling raster grid cells up and across a river (i.e., the 2D surface area), etc.
Additional literature to consider. Not necessary
- The paragraph at line 385 related to process guidance prompted me to recommend https://doi.org/10.1029/2023WR035327 as very relevant. The reference is concerned with comparing different hybrid ML methods for SWT modeling to represent groundwater processes which aren’t as represented here (e.g., relative to reservoir influence/reservoir adjacent modeling)
- In section 4.2, https://doi.org/10.1029/2020WR028091 may be a very relevant addition in-line with the author’s narrative.
Minor writing comments:
- The sentence beginning on line 51 perhaps uses too bold language when stating “AI … create reasonable choices”. Many users of AI and scientists have concerns regarding the reasonableness of AI. Maybe it would be more accurate to further connect with the latter part of that sentence and say that “AI … learn optimal patterns to meet stated objectives” (which may or may not be broadly reasonable)
- Starting at line 131, “We define newer ML as those introduced in hydrologic modeling in the few years,” perhaps this should say “in recent years”?
- At line 380, although it can be inferred, “WNN” is never explicitly defined.
- At line 541, “all journals examined used least one”, perhaps this should say, “at least one”
- By typo/mistake, it appears that two subsections in section 3 are titled "Model Performance Metrics: Error Indices"
- At line 610, there is a typo claiming an upper bound of -1
I have the benefit of reviewing 3rd, so I read the other reviewer’s comments after making my own. I agree that a characterization of the validation and test sets used would be very beneficial (e.g., spatial, temporal, spatiotemporal exclusion, etc.), but I believe the concerns of overfitting are potentially overstated by the other reviewers given that this manuscript reports train, validation, and test set metrics (and the very strong agreement between the three).
Disclaimer: I propose some additional literature (n = 4-5), and I am a coauthor on 1 of them. I do not view including that literature as mandatory, and only proposed additional sources based on their relevance to the content of this manuscript. I selected "No" to anonymity to avoid any appearance of subversive influence.
Citation: https://doi.org/10.5194/hess-2024-256-RC3 - AC3: 'Reply on RC3', Claudia Corona, 28 Nov 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
455 | 180 | 199 | 834 | 39 | 5 | 6 |
- HTML: 455
- PDF: 180
- XML: 199
- Total: 834
- Supplement: 39
- BibTeX: 5
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1