Machine-learning methods for stream water temperature prediction

Feigl, Moritz; Lebiedzinski, Katharina; Herrnegger, Mathew; Schulz, Karsten

doi:https://doi.org/10.5194/hess-25-2951-2021

Articles | Volume 25, issue 5

https://doi.org/10.5194/hess-25-2951-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-25-2951-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 25, issue 5

Research article

|

31 May 2021

Research article |

| 31 May 2021

Machine-learning methods for stream water temperature prediction

Moritz Feigl, Katharina Lebiedzinski, Mathew Herrnegger, and Karsten Schulz

Download

Final revised paper (published on 31 May 2021)
Supplement to the final revised paper
Preprint (discussion started on 14 Jan 2021)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2020-670', Salim Heddam, 30 Jan 2021

See attached file

Citation: https://doi.org/10.5194/hess-2020-670-RC1
- AC1:
  'Authors answers to RC1', Moritz Feigl, 02 Mar 2021
  Dear Reviewer,
  We thank you for your encouraging and positive feedback and sincerely thank you for your insightful comments and suggestions. Please find our answers to your comments below.
  
  Review:
  The comparisons of models results with in situ measured data using only errors metrics is insufficient and does not help in providing robust conclusions regarding models accuracies, robustness and fitting capabilities. Specifically, using several kinds of goodness-of-fit indicators should be more useful: the coefficient of determination (R²), the Nash-Sutcliffe efficiency (NSE), and the index of agreement d, are highly recommended for hydrological models evaluation (Legates and McCabe 1999; Moriasi et al. 2007; Harmel and Smith 2007; Gupta 1998, 2008; Krause et al. 2005).
  Answer:
  We agree that by choosing a variety of metrics, a more concise picture of model performance can be shown. However, we noticed that some metrics are not sensitive enough to compare the results of this study. The NSE values of the presented models were all >0.9 and usually around 0.98. We noticed that while we could still see differences in model performance in RMSE and MAE, there were no or hardly any differences in the first three decimals of the NSE. Similar observations were made for the coefficient of determination and the index of agreement. Therefore, we think that adding these metrics would decrease readability while not adding a lot of information, but we will add them to the appendix to allow for future comparisons.
  The following table contains an overview of all metrics for the best ML models and the two benchmark models per catchment:
  We used the index of agreement by Willmott, 1981.
  References:
  Willmott, C. J. (1981). On the validation of models, Physical Geography, 2(2), 184–194. https://doi.org/10.1080/02723646.1981.10642213
  
  Review:
  Models structures need to be clarified. In Lines 173-175, the authors argued that including the lag of all variables for 4 previous days can help in improving models accuracies according to Webb et al. (2003). First, using only 4 previous lag should be justified, on which basis it was selected (i.e., cross-correlation analysis can be helpful for answering this question)? Second, according to Webb et al. (2003), adopting the previous lag as input variables can be useful on only hourly data scenario. Therefore, a comparison between models with and without lag data may be a good option.
  Answer:
  This is indeed an important point and we agree that we should explain our reasoning. Our decision on the number of lags was based on the results of an explorative data analysis of daily data we carried out before starting to work on the model structures. This included three major parts that guided our decision:
  Assessing partial autocorrelation plots of water temperatures: They showed a significant importance usually up to the 4^th lag, as is illustrated in the following plot showing the partial autocorrelation of stream water temperatures of the Ybbs catchment.
  
  Assessing variable significance in a linear regression model using multiple lags: This also pointed to the fact that 4 lags are an adequate range for the given basins.
  
  Assessing variable importance in a simple Random Forest (RF) Model, which also pointed to the fact that 4 lags are a reasonable choice, as is visible from the following plot showing the RF variable importance for predicting water temperatures in the Ybbs catchment. Variable importance refers to how much a given model "uses" that variable to make accurate predictions.
  
  Our initial analysis pointed to the fact that lags are important and that 4 is a good choice for our set of basins, thus resulted in our decision for this study. Nevertheless, this is a relevant initial decision. Our results let us assume that this might be quite different for different basins and maybe a more dynamic approach might be valuable. This could for example be including the choice of lag time depending on the mean concentration time of a catchment, or the catchment size and should be explored in future studies.
  We will add a short summary of these initial analyses after Line 173 to explain our choice of numbers of lags:
  “The lag period of 4 days was chosen based on an initial data analysis that included (i) assessing partial autocorrelation plots of water temperatures, (ii) testing for significance of lags in linear regression models, and (iii) checking variable importance of lags in a random forest model.”
  Regarding the study of Webb et al. (2003): While they only showed the importance of lagged variables for hourly data, they also summarized findings regarding lagged air temperature for daily data (Grant, 1977; Jeppesen & Iversen, 1987; Stefan & Preud’homme, 1993). These previous findings, together with our initial analyses (especially the RF variable importance), made it clear that daily lags are relevant inputs and necessary for a high model performance.
  References:
  Grant, P. J. (1977). Water temperatures of the Ngaruroro river at three stations. Journal of Hydrology, 16(2), 148–157. https://www.jstor.org/stable/43944413?seq=1
  Jeppesen, E., & Iversen, T. M. (1987). Two Simple Models for Estimating Daily Mean Water Temperatures and Diel Variations in a Danish Low Gradient Stream. Oikos, 49(2), 149. https://doi.org/10.2307/3566020
  Stefan, H. G., & Preud’homme, E. B. (1993). Stream temperature estimation from air temperature. JAWRA Journal of the American Water Resources Association, 29(1), 27–45. https://doi.org/10.1111/j.1752-1688.1993.tb01502.x
  Willmott, C. J. (1981). On the validation of models. Physical Geography, 2(2), 184–194. https://doi.org/10.1080/02723646.1981.10642213
  
  Review:
  The introduction is not deeply written and in some cases need improvement. Specifically, the proposed ML reported in the literature should be presented, discussed, and the strength and weakness of each one would be more useful and effective if they are highlighted. Using lumped references do not help in understanding the mains contribution of the work.
  Answer:
  Thank you for pointing this out, comparing the different models instead of only giving an overview of past applications will make this section more informative. For the revised manuscript, we propose to include a more general overview of the different model approaches and their strengths and weaknesses.
  
  Review:
  Research gap. What are the mains contributions of the present study in comparison to what is already done? What does it add to existing literature?
  Answer:
  We agree, this might not be stated clearly enough in the manuscript yet. We think that the summary of the important contributions of this study in your general comments are very much on point. We will include it in the last paragraph of the introduction.
  
  Review:
  Lines 47 to 50, from Austria to characteristics. To our opinion this paragraph is more suitable to be moved to section 2.1.
  Answer:
  Agreed, we will change it according to your suggestion.
  
  Review:
  Line 79: ‘’To the author’s knowledge, RF has not been applied for river water temperature prediction yet’’. This statement is incorrect. The RF was recently reported as a powerful tool for predicting river water temperature (Heddam et al. 2020).
  Answer:
  Thank you for pointing this out. This is indeed an interesting and relevant publication. We propose to change the sentence in line 79 to the following:
  “Up to date, only one previous study by Heddam et al. (2020) already applied RF for predicting lake surface temperatures.”
  
  Review:
  Models comparison using cross-station scenarios can help in providing more conclusions, and a clear idea about models capabilities outside of their own catchment area: models calibration using data from on station and validated for other stations (i.e., see Zhu and Heddam 2019).
  Answer:
  We do agree that the application outside the initial trained catchment is an important type of application. However, we found it necessary to only focus on the model prediction capabilities in single catchments, to derive the general applicability of different model types and data inputs. These should be used as a foundation to derive transferable models and modelling approaches. The transferability of these models in this study cannot be adequately tested, as there is no information provided to the tested models to conduct this transfer. In our opinion, this transfer would need additional basin characteristics as inputs and consequently a larger number of basins for training and testing (multi-basin training). We certainly do see this as an important next step, but would refrain of applying the single basin trained models for this task. We thank the reviewer for this thought and propose to add this topic in the conclusions regarding future research fields.
  
  Citation: https://doi.org/10.5194/hess-2020-670-AC1

RC2: 'Comment on hess-2020-670', Adrien Michel, 15 Feb 2021

Review of « Machine learning methods for stream water temperature prediction

Dear authors, dear editor,

The paper submitted discusses the usage of machine learning (ML) models to simulate water temperature on 10 catchments located in Austria. The results obtained from the ML models are compared to a linear regression model and to the model . The authors show that ML models achieve better results for water temperature simulations than the 2 benchmark models used. They also show that the choice of the hyperparameters of the ML models has an important role in the performance of the models, and they present a method to reduce the computational time required to optimize the values of these parameters. Finally the choice of the input variables forcing the ML models and its impact on the model’s performances is discussed. This work is part of a recent ongoing effort to apply ML models in hydrology and it brings some new interesting insights. The comparison with two benchmark models really allows to correctly asses the performances of the ML models.

In general, the paper is clear and well written and shows clean figures. In addition, all the source code is provided with clear instructions and in-code documentation. As I detail below, there are some important points to be addressed regarding the amount of information provided in order to allow reproducibility, the clarity of some parts of the methods, and finally, the possible application discussed (short terms prediction and climate change impact studies). I have no doubt that these points can be clarified and/or enhanced by the authors and that a reviewed version will fit for a publication in HESS. Indeed, water temperatures has historically received less attention than discharge in modelling, while becoming a more and more important variable with ongoing and foreseen climate change. The contribution brought by this article is therefor really valuable.

I have to mention here that my expertise is on the water temperature side and not on the machine learning side, and that the editor might want a review by an expert from the ML community.

Do no hesitate to contact me for further discussions.

Best regards,

Adrien Michel (adrien.michel@epfl.ch)

Major comments:

Impact of snow/glacier cover and catchment size

The authors discuss the importance of snow/glacier cover in the perspective of climate change (CC) application (I discuss CC below). However, this is not discussed for the application done in the paper. First, I suggest to add in Table 1 the mean catchment elevation and percentage of glacier cover in the catchments in order to allow a quick overview of the contribution of glacier and snow melt we can expect on each catchment. I would expect the TQ experiments to perform significantly better than the TP ones in catchment where snow plays an important role. Indeed, snow melt dynamic is captured in Q while I doubt TP experiments will be able to get it. This is difficult to see from Fig 4c and A1, so I would suggest to add further information about TP vs TQ comparison in high elevation catchment. Especially since the authors mention TP performance as an argument for usage in CC impact studies (lines 532 to 537).

Catchment’s size seems to have a clear influence on the results. Indeed, if we neglect the Danube catchment, Figures 4c and A1 show a reduction of RMSE and MAE with increasing catchment’s size. This is not surprising since I would expect local scale effects, harder to capture in models, to be smoothed out when increasing catchments size, leading to an increase of the model performance. It could be interesting to replace catchment with catchment size in the linear regression for test RMSE, or use both, in order to really asses it (in any case this would mean a regression with both discrete and continuous variables). This size effect is currently not discussed (except for the Danube) while I think it should definitely be mentioned.

ML models details

As mentioned above, I have no expertise in the ML domain beyond basic knowledge. As a novice, I found the Section 2.4 quite technical (especially Section 2.4.5). Having in mind the target audience of HESS, I would suggest to keep in the main text an overview of the different ML models used, and to move the most technical parts in Supplementary Material along with the details requested below.

This would allow the authors to present more details about the reproducibility of the study which are not yet presented. Indeed, Appendix A only shows the hyperparameters bounds used for the Bayesian optimization. The final set of hyperparameters should be provided (along with the parameters of the two benchmark models). It is not completely clear for me if the Bayesian optimization is done in general or per catchment. This should be stated. Also, is the optimization run for each separate experiment, or only once? And in this case for which experiments? If the Bayesian optimization is done per catchment separately, is not it a risk of overfitting? In summary, some details and clarity are missing about how the Bayesian optimization is done (which catchments, experiments and time periods), and how the models’ training is done. Note that the calibration procedure of should also be presented.

The computational cost seems to be a major concern using ML models and is mentioned multiple times throughout the paper. It would be interesting to have indications about the hardware used and the total time needed for the Bayesian optimization, the learning phase, and the running phase along with the time needed to calibrate and run the two benchmark models. This would help the reader to apprehend the computational implication of using ML models.

Finally, the training is done on really different time period lengths. The results seem to suggest that there is no correlation between length of the training period and RMSE. This difference in period length is not really discussed in the paper. In general, having similar training periods would be beneficial to really compare the models’ performances across catchments. Indeed, with the data provided, we do not know if the differences observed across catchments are due to catchments’ characteristics or to the training time period length. I imagine that the heavy computational time forbids to re-run all catchments using similar datas. However, a re-run on a single catchment with > 30 yrs of data, but using only 10 years as for the Enns, could be interesting to assess the impact of the length of the training time series on the results. Note that this question of length of the time series available for training is an important point for application perspective. Indeed, water temperature measurements network are usually quite recent (few decades), compared to time series available for discharge.

Models evaluation and application

The Section 3.3 focuses on the catchment obtaining the second lowest RMSE. It would be interesting to perform a similar analysis for a catchment obtaining higher RMSE. For example, analyzing the Kleine Muehl, where the median of ML RMSE is close to , would be interesting. In any case, some plots of the whole time series for all catchments should be presented in Supplementary in order to allow the reader to look at the real outputs and not only have access to RMSE boxplots (I mention here that I do particularly appreciate Figure 5 where all outputs are always shown in grey in the background).

Two main applications are mentioned: short term predictions (especially high summer water temperature peak) and climate change (CC). For the first application, further discussion could be added in Section 3.3. Indeed, In Figures 5, we do see that many short timed high temperature events are not captured during the summer, which would be problematic for predictive application. Metrics do exist to assess the quality of models to capture such events (see e.g. the two-alternative forced choice score used in Greissinger et al. (2019)). While this paper of course does not pretend to deliver operational models – I rather see the paper as a demonstration of current capabilities of ML models for water temperature predictions along with pros and cons of different models – a discussion of the performance of ML models regarding short term high temperature event would be a great addition.

The author choose an interesting approach to assess the ability of the model to cope with non-stationarity in time series by using the CC signal present in the measurements timeseries. They show that ML models still obtain good performance in the warm year 2015. This could be enhanced by showing the whole time series in order to see if the error grows with time (and to better compare with benchmark models). Using one catchment with >30 yrs time series and train it only with the first 10 years (as suggested above to assess the effect of training time period length) could also be interesting in this regard. Indeed, the temperature increase between the 80’s and 2015 will be more important than the one in the time series of the Inn catchment used in the paper to do this validation. Note that the years 2003, which shows an important water temperature anomaly, is also an interesting benchmark year.

While these tests are important and showing that models are able to correctly predict water temperature when out of the training range is a really good point for ML models, the increase of water temperature expected with CC is far above the range tested here. As a consequence, I do not think that it is really possible to assess ML models’ ability to correctly predict CC impacts on water temperature with historical data. A comparison with physically based models could be an approach, but is beyond the scope of this work. Consequently, I would suggest to revise the lines 535 and 536 in the discussion.

Minor comments:

Figure 1: Please specify where you obtained the catchments delineation. Also, the figure mention “Danube catchment” while Donau is used elsewhere in the paper. Maybe the English name Danube should be used, it would be more accessible to international readers.

Table 1: In addition to what is already mentioned above, add calibration and testing periods to the table.

Line 48: date -> data

Lines 67-69:

Higher prediction performances compared to what (models and/or temporal resolution)?

Line 177: moths -> months

Line 255: Misplaced parenthesis and inversed sum bounds in eq (5)

Section 2.4.5: There is an overall inconsistency between parts of the text and equation in the usage of bold font for vector and matrix terms. E.g. in the paragraph at lines 308-310 they are in bold, while in the following paragraph they are in italic. I would suggest the usage of bold fonts everywhere.

Line 329: Hyperparameters meaning is never really defined in the paper

Line 356: The difference between validation and testing periods is not really clear. I understand the validation is used to choose the best models, and then test period is used to compare the set of best models (lines 377-380). This should be clarified. In addition, it is not clear if validation is done for the hyperparameters selection (the 5 setups mentioned at line 372) or between different trained version of the model using the same hyperparameters set. Also, is there first a phase to select the hyperparameters (which require to train and test the model), and then a new training phase, or are they both done at once?

Lines 366-367: Do you mean 60% -> training, 20% -> validation? Please clarify.

Lines 368-369: what is the “standard way of training neural networks…”, the 50 times training or your approach?

Lines 395-397: Please add some citation to the statements made here.

Line 398-399: I do not understand this sentence.

Lines 438-439: Do you have any explanation regarding the difference of performance observed?

Lines 450-455: I’m not completely convinced by the significance of this regression. Indeed, simply from boxplots we do see that the difference between catchments is by far the most important predictor.

Line 487: What is the number of time steps and optimal time step here?

Lines 489-490: Total time should also be provided in order to see how this ~2h decrease is important. Also, how is the obtained here?

Line 536: The claim about short term predictions and CC is too ambitious here. What is shown is the increase in performance compared to benchmark models, which is already a really important step. For short term prediction and CC, see my longer comments above, but I think more work and discussion are needed to really assess the ability of the models for these applications.

Lines 546-547: The improvement here can be explained by the Bayesian optimization method used?

Lines 548-550: what do you mean by “spatial information at different scale”? Indeed, ML models do not provide any spatially distributed output (which can be achieved with distributed physical models), but only point informations.

References

Griessinger, M. Schirmer, N. Helbig, A. Winstral, A. Michel, T. Jonas, Implications of observation-enhanced energy-balance snowmelt simulations for runoff modeling of Alpine catchments, Advances in Water Resources, Volume 133, 2019, https://doi.org/10.1016/j.advwatres.2019.103410.

Citation: https://doi.org/10.5194/hess-2020-670-RC2

AC2: 'Authors answers to RC2', Moritz Feigl, 04 Mar 2021

Dear Adrien Michel,

we thank you for your time and effort, bringing up these thoughtful questions and sincerely thank you for your insightful comments and suggestions on the manuscript. To address them more easily we split them into several parts. Please find our answers to your comments below.

Review:

The authors discuss the importance of snow/glacier cover in the perspective of climate change (CC) application (I discuss CC below). However, this is not discussed for the application done in the paper. First, I suggest to add in Table 1 the mean catchment elevation and percentage of glacier cover in the catchments in order to allow a quick overview of the contribution of glacier and snow melt we can expect on each catchment.

Answer:

These points are highly interesting and will be addressed in more detail in a subsequent publication, which is currently in preparation. To answer your questions here, we computed the percentage of glacier and perpetual snow cover from the CORINE Land Cover data 2012 and the mean catchment elevation from the EU-DEM v1.1 digital elevation model with 25x25 m resolution, which we will include in Table 1 of the revised manuscript. The following table shows the corresponding values for all study catchments:

	Glaciers and perpetual snow	Glaciers and perpetual snow	Mean catchment elevation
Catchment	(km²)	(% of total catchment area)	(m NAP)
Kleine Mühl	0.0	0	602
Aschach	0.0	0	435
Erlauf	0.0	0	661
Traisen	0.0	0	697
Ybbs	0.0	0	691
Saalach	0.0	0	1196
Enns	0.1	0.003	1419
Inn	60.0	2.8	2244
Salzach	60.3	1.4	1475
Donau	339.0	0.4	827

Review:

I would expect the TQ experiments to perform significantly better than the TP ones in catchment where snow plays an important role. Indeed, snow melt dynamic is captured in Q while I doubt TP experiments will be able to get it. This is difficult to see from Fig 4c and A1, so I would suggest to add further information about TP vs TQ comparison in high elevation catchment. Especially since the authors mention TP performance as an argument for usage in CC impact studies (lines 532 to 537).

Answer:

We believe that these are important questions and we again want to refer to our subsequent publication for a detailed answer. To give an overview here, we analyzed the differences in TP and TQ performances depending on elevation and glacier cover. TQ generally shows a significantly lower test RMSE. The difference between TQ and TP performance at the same time does not show any changing patterns when evaluating high elevation catchments in our data. This is visible in the following figure, which contains test RMSE values for both experiments 2(TP) and 3(TQ) for all catchments. The catchments are sorted by elevation and glacier cover fraction, from left (low elevation, no glacier) to right (high elevation, glacier).

While we do not observe the relationship you expected, it is also important to keep in mind that our study catchments are mainly larger catchments and limited in their number (n = 10), They were also intentionally chosen to be quite different from each other. Therefore, it might be difficult to test this hypothesis, since we lack the necessary number of small high elevation catchments.

As you expected, there is a significant relationship between mean catchment elevation, glacier fraction and test RMSE. We tested this with a linear model using mean catchment elevation, glacier fraction in % of the total catchment area, total catchment area and the experiments as independent variables and test RMSE as dependent variable. This is illustrated in the following figure which shows test RMSE values and mean catchment elevations for all experiments, with additional information of the glacier area.

We propose to add a sentence regarding the relationship between test RMSE and mean elevation in the results section 3.2 as proposed in the answer of the third comment.

Review:

Catchment’s size seems to have a clear influence on the results. Indeed, if we neglect the Danube catchment, Figures 4c and A1 show a reduction of RMSE and MAE with increasing catchment’s size. This is not surprising since I would expect local scale effects, harder to capture in models, to be smoothed out when increasing catchments size, leading to an increase of the model performance. It could be interesting to replace catchment with catchment size in the linear regression for test RMSE, or use both, in order to really asses it (in any case this would mean a regression with both discrete and continuous variables). This size effect is currently not discussed (except for the Danube) while I think it should definitely be mentioned.

Answer:

We agree, catchment size seems to have an effect on model performance. When substituting catchment with catchment area in the linear regression model for test RMSE, we find a significant (p-value = 3.91 × 10^-4) influence. This is also visible when plotting test RMSE of all experiments together with the logarithm of the catchment area:

Overall, we agree that due to aggregation, local small scale effects on stream water temperature are smoothed out, making it easier to perform predictions based on catchment means. The fact that the Danube shows a reduced performance instead of an increased one for all models except RNNs, leads us to the conclusion that while larger catchments are easier to model, they need additional lagged information due to longer-term dependencies.

We propose to add the findings regarding catchment area, elevation and glacier cover in the results section 3.2 after line 460:

“The relationship between mean catchment elevation, glacier fraction and test RMSE was analyzed with a linear model using mean catchment elevation, glacier fraction in % of the total catchment area, total catchment area and the experiments as independent variables and test RMSE as dependent variable. This resulted in a significant association of elevation (p-value < 2 × 10^-16) with lower RMSE values and catchment area (p-value = 3.91 × 10^-4) and a significant association of glacier cover (p-value = 9.79 × 10^-5) with higher RMSE values. Applying the same model without using the data of the largest catchment, the Danube, resulted in a significant (p-value = 2.12 × 10^-11) association between catchment area and lower RMSE values, while the direction of the other associations stayed the same.“

Furthermore, we propose to add a statement in the discussion at the end of line 562:

“The current results suggest a strong influence of catchment properties on general ML model performance. While association of performance with elevation, glacier cover and catchment area were apparent, we could not come to a conclusion as even the direction of the relationship for one variable changed when removing one catchment from the analysis. We believe that there are a number of factors influencing these associations and more in depth investigations on a larger number of basins are needed to further understand the relationships between ML model performances and catchment properties and their implications. “

Review:

As mentioned above, I have no expertise in the ML domain beyond basic knowledge. As a novice, I found the Section 2.4 quite technical (especially Section 2.4.5). Having in mind the target audience of HESS, I would suggest to keep in the main text an overview of the different ML models used, and to move the most technical parts in Supplementary Material along with the details requested below.

Answer:

Thank you for this important comment, we agree that section 2.4 is quite technical. We propose to make the model descriptions more easily understandable and propose to include a statement about their main characteristics. This should help the reader to gain a more high-level overview of the general ideas behind these models and their differences.

Review:

This would allow the authors to present more details about the reproducibility of the study which are not yet presented. Indeed, Appendix A only shows the hyperparameters bounds used for the Bayesian optimization. The final set of hyperparameters should be provided (along with the parameters of the two benchmark models). It is not completely clear for me if the Bayesian optimization is done in general or per catchment. This should be stated. Also, is the optimization run for each separate experiment, or only once? And in this case for which experiments? If the Bayesian optimization is done per catchment separately, is not it a risk of overfitting? In summary, some details and clarity are missing about how the Bayesian optimization is done (which catchments, experiments and time periods), and how the models’ training is done. Note that the calibration procedure of air2stream should also be presented.

Answer:

Thank you for raising this point. We agree that providing the estimated hyperparameter sets in the supplementary material would be helpful and we will include these in the revised version. We would like to note that the code we used and provided for reproducibility purposes included fixed seeds in the random number generation to make our results reproduceable. Therefore, by running the provided code, anyone should be able to reproduce our results on any computer.

Since we are only investigating single basin models, Bayesian hyperparameter optimization was applied for each combination of catchment, model, and experiment. Should there be any overfitting, it would result in a lower performance in the test time period, which was never used for any model selection or training. Using cross-validation and training/validation splits during training should prevent potential overfitting.

To make the whole optimization procedure and overall training setup more comprehensible, we propose to add a paragraph and a table at the end of section 2.7 (Experimental setup). While the paragraph would include a summary of the whole modelling/optimization procedure (including the air2stream optimization), the table would contain the following information: catchment name, training/validation time period, test time period, cross-validation yes/no, number of hyperparameters, number of iterations of hyperparameter tuning.

Review:

The computational cost seems to be a major concern using ML models and is mentioned multiple times throughout the paper. It would be interesting to have indications about the hardware used and the total time needed for the Bayesian optimization, the learning phase, and the running phase along with the time needed to calibrate and run the two benchmark models. This would help the reader to apprehend the computational implication of using ML models.

Answer:

We absolutely agree with this remark. We propose to provide the median and inter quartile range of the run time for each ML model in the results section 3.2. Additionally, we can include a statement about the run times of the benchmark models: (1) the run time of the LM model is neglectable small (< 1 sec), (2) air2stream has a run time of < 2 min in all catchments. We also propose to add details about the hardware used in section 2.7 (Experimental setup).

Run time overview:

Model	Run time (min) median (IQR)
step LM	698.9 (158.8, 1733.8)
RF	54.3 (44.3, 74.6)
XGBoost	172.9 (153.6, 204)
FNN	30.8 (28.5, 41.5)
RNN-LSTM	748.6 (520.9, 1111.6)
RNN-GRU	767.8 (583.9, 1171.1)

Hardware overview:

All models were run on the Vienna Scientific Cluster, where each run had access to 2 Intel Xeon E5-2650v2, 2.6 GHz, 8 cores CPUs and 65 GB RAM.

Review:

Finally, the training is done on really different time period lengths. The results seem to suggest that there is no correlation between length of the training period and RMSE. This difference in period length is not really discussed in the paper. In general, having similar training periods would be beneficial to really compare the models’ performances across catchments. Indeed, with the data provided, we do not know if the differences observed across catchments are due to catchments’ characteristics or to the training time period length. I imagine that the heavy computational time forbids to re-run all catchments using similar datas. However, a re-run on a single catchment with > 30 yrs of data, but using only 10 years as for the Enns, could be interesting to assess the impact of the length of the training time series on the results. Note that this question of length of the time series available for training is an important point for application perspective. Indeed, water temperature measurements network are usually quite recent (few decades), compared to time series available for discharge.

Answer:

From the current results we assume not much difference in performance between a model using 30 years and a model using 10 years of data. Our results show that also short time series (6 years of training in all experiment 5 & 6 models) can produce state of the art results. It is nevertheless true that we do not know the data length threshold at which the model performance is significantly reduced. However, we believe that this threshold is highly dependent on the catchment and the main processes that are present in the catchment. We share your interest in these points and thank you for your suggestions. However, at this point we would like to kindly point out to the follow-up study we are currently working on.

As you mentioned, there is no apparent correlation between RMSE and the length of the training period in our results. Based on these results we conclude that the influence of the catchment is most likely larger than the influence of the time series length. However, we believe that, since the number of investigated catchments is limited, effects such as the relationship of time series length and model performance can not be estimated with certainty. There may be a strong correlation between time series length and RMSE in certain types of catchments (e.g. small alpine catchments), but finding these relationships is not possible with the given study setup. Thus, we would like to refrain from making assumptions we cannot yet proof.

Review:

The Section 3.3 focuses on the catchment obtaining the second lowest RMSE. It would be interesting to perform a similar analysis for a catchment obtaining higher RMSE. For example, analyzing the Kleine Muehl, where the median of ML RMSE is close to air2stream, would be interesting. In any case, some plots of the whole time series for all catchments should be presented in Supplementary in order to allow the reader to look at the real outputs and not only have access to RMSE boxplots (I mention here that I do particularly appreciate Figure 5 where all outputs are always shown in grey in the background).

Answer:

As suggested, we propose to add the time series plots of all other catchments in the supplementary material. We agree that there are additional aspects of these results that would also be interesting to further investigate and discuss, but as the manuscript was already quite extensive, we only chose one catchment. We hope that adding the remaining catchments in the supplementary material is a reasonable compromise. Please find the comparison of the prediction of all tested model types for the Kleine Mühl catchment for the year 2015 below. As you can see, while air2stream is indeed performing much better in the Kleine Mühl compared to the Inn catchment, but the overall characteristics of the models are quite similar.

Review:

Two main applications are mentioned: short term predictions (especially high summer water temperature peak) and climate change (CC). For the first application, further discussion could be added in Section 3.3. Indeed, In Figures 5, we do see that many short timed high temperature events are not captured during the summer, which would be problematic for predictive application. Metrics do exist to assess the quality of models to capture such events (see e.g. the two-alternative forced choice score used in Greissinger et al. (2019)). While this paper of course does not pretend to deliver operational models – I rather see the paper as a demonstration of current capabilities of ML models for water temperature predictions along with pros and cons of different models – a discussion of the performance of ML models regarding short term high temperature event would be a great addition.

Answer:

This is indeed an important point and we agree that our description of the possible application was not precise enough. In general, we believe it is very important to differ between simulation and forecast models (Beven & Young, 2013). While simulation models aim to predict a certain variable (e.g. water temperature) by using inputs that reflect our process understanding, forecast models use all available previous information, including previous observations of this variable. While simulation models can show important connections between input variables and the output variable and can be used to learn more about a system, forecast models will most likely always provide a better prediction performance. As you mention, we do not aim to provide the best possible operational (i.e. forecasting) model, but investigate model types, input data and training procedures for water temperature (simulation) modelling. Based on these results, developing an operational forecasting model would mainly consist of choosing the model type and data depending on the aims and availability and include water temperature from previous time steps as an additional input.

We propose to adapt our statement on short term forecast in the discussion (line 536) to:

“… could be used for climate change studies and as basis for short term forecast models.”

Reference:

Beven, K. and Young, P. (2013) ‘A guide to good practice in modeling semantics for authors and referees’, Water Resources Research. Blackwell Publishing Ltd, 49(8), pp. 5092–5098. doi: 10.1002/wrcr.20393.

Review:

The author choose an interesting approach to assess the ability of the model to cope with non-stationarity in time series by using the CC signal present in the measurements timeseries. They show that ML models still obtain good performance in the warm year 2015. This could be enhanced by showing the whole time series in order to see if the error grows with time (and to better compare with benchmark models). Using one catchment with >30 yrs time series and train it only with the first 10 years (as suggested above to assess the effect of training time period length) could also be interesting in this regard. Indeed, the temperature increase between the 80’s and 2015 will be more important than the one in the time series of the Inn catchment used in the paper to do this validation. Note that the years 2003, which shows an important water temperature anomaly, is also an interesting benchmark year.

Answer:

We actually only noticed that the ML models were able to cope with signals in the observations, which were outside the range of the training data, after all model simulations were performed. We agree that assessing the performance of a model trained with data from the 80’s with data from 2015 would be a much harder and more comprehensive test. We hope for your understanding, that this test cannot be added to the already extensive manuscript. There are three important reasons: Only three catchments have time series that are long enough (> 30 years) leaving us with a very limited sample size. Secondly, data collection technology was updated at some point in the 90’s for all catchments. While this is not a large source of error for the current model setup, where enough data points in later years are available, it can be assumed that this will most likely reduce performance. Consequently, additional analysis of the underlying observations is necessary to better estimate the origin of potential modelling errors. Thirdly, the manuscript focuses on the comparison of several ML approaches and is already quite extensive. Adding the “CC-test” would therefore be beyond the scope for the current manuscript.

Review:

While these tests are important and showing that models are able to correctly predict water temperature when out of the training range is a really good point for ML models, the increase of water temperature expected with CC is far above the range tested here. As a consequence, I do not think that it is really possible to assess ML models’ ability to correctly predict CC impacts on water temperature with historical data. A comparison with physically based models could be an approach, but is beyond the scope of this work. Consequently, I would suggest to revise the lines 535 and 536 in the discussion.

Answer:

We agree that this statement is too general and is not supported by our findings. We will change it to: “Thus, application of this set of widely available data inputs is able to produce prediction performance improving the current state of the art and could be used for short term forecasts and assessing near future predictions (5-10 years) under climate change. The ability of ML approaches to simulate processes and signals from a system under prolonged climate change is important and topic of future research.”

Review:

Figure 1: Please specify where you obtained the catchments delineation. Also, the figure mention “Danube catchment” while Donau is used elsewhere in the paper. Maybe the English name Danube should be used, it would be more accessible to international readers.

Answer:

We will add the delineation sources (Bayrisches Landesamt für Umwelt; HAO: Hydrological Atlas of Austria (digHAO), 3. Delivery, Federal Ministry of Agriculture, Regions and Tourism, Vienna, Austria, 2007) and change “Donau” to “Danube” in the manuscript as you suggested.

Review:

Table 1: In addition to what is already mentioned above, add calibration and testing periods to the table.

Answer:

This will be included in the additional table in section 2.7 (Experimental setup) that we proposed in your major comment.

Review:

Line 48: date -> data

Answer:

The suggested change will be included in the revised manuscript.

Review:

Lines 67-69: “Another main concern is that parametric statistical models showed higher prediction performances on weekly, monthly or seasonal time scales in the past (Caissie, 2006) leading to a loss of temporal variation (DeWeber and Wagner, 2014).” Higher prediction performances compared to what (models and/or temporal resolution)?

Answer:

In this section, we give an overview about strengths of different modelling types, which were used for river water temperature modelling in the past. With the sentence you are addressing in your comment we aimed to point out that parametric statistical models achieved higher performances on coarser time scales than on finer temporal resolution – especially when considering air temperature as solely surrogate. We suggest to include an explicit add-on to the sentence in the revised manuscript: “Another main concern is that parametric statistical models showed higher prediction performances on weekly, monthly or seasonal time scales in the past compared to finer temporal resolutions (Cassie, 2006) leading to a loss of temporal variation and thus undesirable information (DeWeber and Wagner, 2014).”

Review:

Line 177: moths -> months

Answer:

The suggested change will be included in the revised manuscript.

Review:

Line 255: Misplaced parenthesis and inversed sum bounds in eq (5)

Answer:

The suggested change will be included in the revised manuscript.

Review:

Section 2.4.5: There is an overall inconsistency between parts of the text and equation in the usage of bold font for vector and matrix terms. E.g. in the paragraph at lines 308-310 they are in bold, while in the following paragraph they are in italic. I would suggest the usage of bold fonts everywhere.

Answer:

Thank you for this observation and comment, we are going to adapt the formulas as you suggested.

Review:

Line 329: Hyperparameters meaning is never really defined in the paper

Answer:

We propose to add an additional sentence in line 117: The term “hyperparameters” refers to any model parameter that is chosen before training the model (e.g. Neural network structure)”.

Review:

Line 356: The difference between validation and testing periods is not really clear. I understand the validation is used to choose the best models, and then test period is used to compare the set of best models (lines 377-380). This should be clarified. In addition, it is not clear if validation is done for the hyperparameters selection (the 5 setups mentioned at line 372) or between different trained version of the model using the same hyperparameters set. Also, is there first a phase to select the hyperparameters (which require to train and test the model), and then a new training phase, or are they both done at once?

Answer:

We hope that our proposed additional paragraph and table in section 2.7 (Experimental setup) will clarify these points:

The validation period is used to estimate the performance of a specific set of hyperparameters and therefore choose the best fitting set of hyperparameters. The testing period is never part of any training or parameter or model selection procedure and can therefore be understood as a test of the predictive ability of a model on a new data set. The training/validation split in only applied for the neural networks (FNN, RNN-LSTM, RNN-GRU), while all other models use cross-validation (i.e. multiple training/validation splits are applied and the validation performance is estimated more robustly).

Review:

Lines 366-367: Do you mean 60% -> training, 20% -> validation? Please clarify.

Answer:

We will change the sentence to: “… was done by using training/validation split with 60% data for training and 20% data for validation.“

Review:

Lines 368-369: what is the “standard way of training neural networks…”, the 50 times training or your approach?

Answer:

We will change the sentence to: “Furthermore, the training/validation split is the standard way of training neural networks for real world applications.”

Review:

Lines 395-397: Please add some citation to the statements made here.

Answer:

Thank you for your suggestion. We will include some references to the statement in line 395 in the revised manuscript and change it to: “Due to climate change induced warming trends, both air temperatures and water temperatures are steadily increasing (Mohseni, Erickson and Stefan, 1999; Pedersen and Sand-Jensen, 2007; Harvey et al., 2011; Kędra, 2020). This is clearly visible when comparing the change in number of extreme warm days and the increase of mean water temperature in all studied catchments with time.”

References:

Harvey, R. et al. (2011) ‘The influence of air temperature on water temperature and the concentration of dissolved oxygen in Newfoundland Rivers’, Canadian Water Resources Journal. Taylor & Francis Group , 36(2), pp. 171–192. doi: 10.4296/cwrj3602849.

Kędra, M. (2020) ‘Regional Response to Global Warming: Water Temperature Trends in Semi-Natural Mountain River Systems’, Water. MDPI AG, 12(1), p. 283. doi: 10.3390/w12010283.

Mohseni, O., Erickson, T. R. and Stefan, H. G. (1999) ‘Sensitivity of stream temperatures in the United States to air temperatures projected under a global warming scenario’, Water Resources Research. John Wiley & Sons, Ltd, 35(12), pp. 3723–3733. doi: 10.1029/1999WR900193.

Pedersen, N. L. and Sand-Jensen, K. (2007) ‘Temperature in lowland Danish streams: contemporary patterns, empirical models and future scenarios’, Hydrological Processes. John Wiley & Sons, Ltd, 21(3), pp. 348–358. doi: 10.1002/hyp.6237.

Review:

Line 398-399: I do not understand this sentence.

Answer:

We are sorry for this unclear phrasing, we propose to change it to:

“Since test data consists of the last 20% of the overall data, the exact length of these time series is dependent on the catchment but is always a subset of the years 2008-2015.”

Review:

Lines 438-439: Do you have any explanation regarding the difference of performance observed?

Answer:

Thank you for this interesting question. As mentioned above, the relationship between catchment characteristics and ML model performance is highly interesting and will be addressed in a subsequent publication in detail. However, we think that the general performance of statistical/ML approaches for stream water temperature prediction is most certainly related to a few driving catchment characteristics and believe that we will be able to answer this in the above-mentioned, subsequent paper. This will include an analysis of the relation to specific catchment characteristics and corresponding driving hydrological processes.

Review:

Lines 450-455: I’m not completely convinced by the significance of this regression. Indeed, simply from boxplots we do see that the difference between catchments is by far the most important predictor.

Answer:

In general, the significance of multiple factors (catchment, experiment, model type) in the regression model does not mean that they are equally important. As you state, the catchment influence is by far larger than the influence of the other regressors However, from these results we can show that the other two regressors also explain some part of the overall variance. Since we know that the catchment influence is by far the largest, we might include this statement here as well to avoid potential misunderstandings. Thus, we propose to add a sentence in line 455: “Overall, the influence of the catchment is higher than the influence of model type and experiment, which is clearly shown with their around one order of magnitude larger coefficients.”

Review:

Line 487: What is the number of time steps and optimal time step here?

Answer:

The number of time steps is a RNN model hyperparameter which defines the number of previous time steps used as model input. With ”optimal time steps” we meant the number of timesteps estimated with the Bayesian hyperparameter optimization. We propose to clarify this by changing it to:

“By removing time information from the inputs, the estimated time steps by Bayesian hyperparameter optimization are 37.78 days longer than when using time information as additional input.”

Review:

Lines 489-490: Total time should also be provided in order to see how this ~2h decrease is important. Also, how is the p-value obtained here?

Answer:

The proposed overview of model run times (related to one previous major comment) should enable comparison in the revised manuscript. The p-value is obtained by comparing the distribution of training times of the RNN models with and without time information (fuzzy months) with a Kruskal-Wallis test.

Review:

Line 536: The claim about short term predictions and CC is too ambitious here. What is shown is the increase in performance compared to benchmark models, which is already a really important step. For short term prediction and CC, see my longer comments above, but I think more work and discussion are needed to really assess the ability of the models for these applications.

Answer:

We agree. Please find our suggestions for the revised manuscript in the comments 1: 3 and 9:11 above. We also kindly refer to our next paper at this point.

Review:

Lines 546-547: The improvement here can be explained by the Bayesian optimization method used?

Answer:

While an adequate hyperparameter optimization was necessary for some models (e.g. FNN, XGBoost), others would also produce good results without it (e.g. RF). Thus, the improvement can be attributed to the combination of the adequate representation of time (fuzzy months) as data input, the applied hyperparameter optimization, the choice of lagged time steps and the used input variables. To clarify this, we propose to change the statement to:

“Consequently, our presented approaches show a significant improvement compared to existing machine learning daily stream water temperature prediction models, which can be attributed to the adequate representation of time (fuzzy months) as data input, the applied hyperparameter optimization, the choice of lagged time steps and the used input variables.

Review:

Lines 548-550: what do you mean by “spatial information at different scale”? Indeed, ML models do not provide any spatially distributed output (which can be achieved with distributed physical models), but only point informations.

Answer:

Here we refer to the ability of ML models to generally being able to learn from given sequences (e.g. time series) or objects (e.g. ortho pictures). By writing “spatial information” we mean that catchment characteristics from objects could potentially be used too. We will revise the statement to make it clearer and change it to: “However, machine learning methods are more powerful and flexible than previous modelling approaches and are able to simultaneously use spatial and temporal information at different scales (Reichstein et al., 2019).”

Citation: https://doi.org/10.5194/hess-2020-670-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (17 Mar 2021) by Bettina Schaefli

AR by Moritz Feigl on behalf of the Authors (15 Apr 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (26 Apr 2021) by Bettina Schaefli

RR by Salim Heddam (27 Apr 2021)

ED: Publish as is (27 Apr 2021) by Bettina Schaefli

AR by Moritz Feigl on behalf of the Authors (27 Apr 2021)

Short summary

In this study we developed machine learning approaches for daily river water temperature prediction, using different data preprocessing methods, six model types, a range of different data inputs and 10 study catchments. By comparing to current state-of-the-art models, we could show a significant improvement of prediction performance of the tested approaches. Furthermore, we could gain insight into the relationships between model types, input data and predicted stream water temperature.