Streamflow Estimation in Ungauged Regions using Machine Learning: Quantifying Uncertainties in Geographic Extrapolation

Le, Manh-Hung; Kim, Hyunglok; Adam, Stephen; Do, Hong Xuan; Beling, Peter; Lakshmi, Venkataraman

doi:https://doi.org/10.5194/hess-2022-320

Preprints

https://doi.org/10.5194/hess-2022-320

Preprints

09 Sep 2022

| 09 Sep 2022

Status: this preprint was under review for the journal HESS but the revision was not accepted.

Streamflow Estimation in Ungauged Regions using Machine Learning: Quantifying Uncertainties in Geographic Extrapolation

Manh-Hung Le, Hyunglok Kim, Stephen Adam, Hong Xuan Do, Peter Beling, and Venkataraman Lakshmi

Abstract. The majority of ungauged regions around the world are in protected areas and rivers with non-perennial flow regimes, which are vital to water security and conservation. There is a limited amount of ground data available in such regions, making it difficult to obtain streamflow information. This study examines how in situ streamflow datasets in data- rich regions can be used to extrapolate streamflow information into regions with poor data availability. These data-rich regions include North America (987 catchments), South America (813 catchments), and Western Europe (457 catchments). South Africa and Central Asia are defined as data-poor regions. We obtained 81 catchments and 133 catchments for these two data-poor regions, respectively, and assumed they are pseudo ungauged regions for our analysis. We trained machine learning (ML) algorithms using climate and catchments attributes input variables in data-rich (i.e., source) regions and analyzed the possibility of using these pre-trained ML models to estimate climatological monthly streamflow over data-poor (i.e., target) regions. We found that including diverse climate and catchment attributes in training data sets can greatly improve ML algorithms' performance regardless of significant geographical distance between input datasets. The pre-trained ML models over North America and South America could be used effectively to estimate streamflow over data-poor regions. This study provides insight into the selection of input datasets and ML algorithms with different sets of hyperparameters for a geographic streamflow extrapolation.

Received: 08 Sep 2022 – Discussion started: 09 Sep 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1831 KB)

Supplement (351 KB)

Download & links

Manh-Hung Le, Hyunglok Kim, Stephen Adam, Hong Xuan Do, Peter Beling, and Venkataraman Lakshmi

Status: closed

CC1:
'Comment on hess-2022-320', Alex Sun, 14 Sep 2022

PUB is an important and challenging research topic. This work applies the transfer learning concept by leveraging gages from the extended CAMELS data family. However, I feel the authors stopped short from addressing how much information is actually "transferrable" between different CAMELS datasets. My student and I recently assessed the feasibility of transfer learning between CAMELS-UK and CAMELS-US by using a relatively simple and yet popular tool arising from data clustering, namely, UMAP. We used the following common CAMELS attributes
common_attributes = [

'q_mean',

'runoff_ratio',

'slope_fdc',

'baseflow_index',

'stream_elas',

'high_q_freq',

'high_q_dur',

'low_q_freq',

'low_q_dur',

'zero_q_freq',

'hfd_mean',

'soil_depth_pelletier',

'p_mean',

'pet_mean',

'p_seasonality',

'frac_snow',

'aridity',

'high_prec_freq',

'high_prec_dur',

'low_prec_freq',

'low_prec_dur',

'root_depth_50',

'elev_mean']

The attached figure shows CAMELS-UK and CAMELS-US gages share little in common. They pretty much exist in their own clusters. We welcome the authors to check our calculations. If true, this indicates most information is extracted from local neighbors (i.e., the neighbors of a US gage most likely exist in the US). I suggest the authors add some sort of spatial pattern analysis to untangle the source-target relations (see Sun et al., 2021 and 2022).
References:
Sun, A. Y., Jiang, P., Mudunuru, M. K., & Chen, X. (2021). Explore Spatio‐Temporal Learning of Large Sample Hydrology Using Graph Neural Networks. Water Resources Research, 57(12), e2021WR030394.
Sun, A. Y., Jiang, P., Yang, Z. L., Xie, Y., & Chen, X. (2022). A graph neural network approach to basin-scale river network learning: The role of physics-based connectivity and data fusion. Hydrology and Earth System Sciences Discussions, 1-35.

Citation: https://doi.org/10.5194/hess-2022-320-CC1
- CC2: 'Reply on CC1', Manh-Hung Le, 05 Nov 2022
  
  Thank you very much for your constructive comments. Since our study aims to examine our method in a real-world case study, we focus on transforming climatological streamflow information. CAMELS databases are only accessible in the U.S., UK, Chile, Brazil, and Australia. It would be possible to collect more samples for our analysis by averaging streamflow to get a long-term hydroclimatology value per catchment. We have highlighted this limitation in our revised manuscript [In Section 5 – Limitations and further studies].
  
  “The purpose of our study is to investigate our method in the context of a real-world case study, so we focus on transforming climatological streamflow information. We understand climatological streamflow may have practical limitations, however, by averaging streamflow to get a long-term hydroclimatology value per catchment, it might be possible to collect more samples for our analysis. There are publicly available CAMELS datasets with finer temporal resolution (daily time step). However, CAMELS databases are only accessible in the U.S., UK, Chile, Brazil, and Australia. Further refinements of streamflow prediction (e.g., daily) could be investigated in the future when CAMELS becomes available in a wider region.”
  With respect to your suggested references, we have incorporated these references in the revised manuscript because they are pertinent to our research.
  As per your suggestion, we conducted additional analysis using UMAP. Figure 1 (in the attached file), for example, (you can find all month’s UMAP result in the Supporting Information document), shows the spatial pattern analysis using UMAP which untangles the inputs for source and target regions for January (Other months' results are pretty comparable). It is interesting to note that the target catchments (rectangular) are mostly within catchments from the source, demonstrating a possibility that pre-trained ML over the source regions (circle, cross and plus marks) can predict the output at the target regions. However, it is worth noting that that the UMAP only displays the relative distance. Therefore, it would be premature to assume that pre-train ML models cannot predict streamflow in target regions even if the target and source UMAP values are far off.
  Manh-Hung Le,
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/hess-2022-320-CC2
RC1:
'Comment on hess-2022-320', Anonymous Referee #1, 08 Oct 2022

Title: Streamflow Estimation in Ungauged Regions using Machine Learning: Quantifying Uncertainties in Geographic Extrapolation

General:

This paper attempts to make predictions of monthly averaged streamflow in data scarce regions with machine learning models that were trained in data rich regions. The test their predictions with different permutations of training regions. As expected, the models perform better with different climates and catchments attributes in the training set. Interestingly, however, the results suggest that models trained in North and South America are more reliable than models trained in Europe. They also find, as expected, that extreme gradient boost outperforms support vector mache and random forest. The paper is written fairly well, with exceptions noted below, and provides additional support for the well established conclusion that machine learning models trained on diverse data sets can be useful outside the basins which they are trained. This paper expands that conclusion by transferring the learned models to entirely new regions, in particular to data sparse regions, which is important, as the authors point out.

It was not clear to me if these models were forward looking or backward. I am not entirely sure how useful a monthly average streamflow prediction is in practice, especially if the forcings which drive the prediction are aggregated over that particular month, which would have the prediction a backward estimate. If, however, the forcings are aggregates from the previous month, then this is valuable to water resources management. I ask the authors to make this clarification in their data and methodology sections.

This paper omits non-machine learning models from the study because they are harder to set up. And unfortunately there is no benchmark model/s presented. I believe that this could potentially draw criticism. I do fully understand the need for easy-to-use models in some situations. I would encourage the authors to rethink their framing of the model selection in the introduction and conclusion. Perhaps it would be good to make a case for the benefits of easy-to-use models, and make a case that these shallow learning models are suitable for monthly averaged streamflow over the state-of-the-art LSTM mode, which has been shown again and again to outperform other streamflow models, even when trained out of sample.

Abstract:

Line 21 has double periods.

Introduction:

Lines 38 and 39 claim about stream gauges being the most accurate way to measure streamflow is vague and trivial. Are you making a distinction between remote sensing and in situ measurements? There are many methods of gauging a stream, some more accurate than others. I’m not sure what is the purpose of the sentence, remove or clarify.

Lines 78 and 79: If there is a good argument that ML is not the most promising approach, I’d like to see a citation. Otherwise just state it directly as “machine learning models are arguably one of the most promising approaches”

Line 86: I’m not sure it is obvious what at “traditional” hydrological model is.

Line 107-108: I assumed your hypothesis was about ML model’s ability to transfer learning from one region to another, but here you claim that you use ML models because they are easier to set up?

Lines 108-109: I think the last sentence of this paragraph is fragmented. What kind of water resources prediction? In what context are the water resources secure or insecure?

Data:

Line 127: What is the rational for removing values greater than 2,000 cms?

Line 140: Can you make it clear if your model is making a forward or backward prediction? Is your monthly forcing aggregates from the same month in which your monthly averaged streamflow comes from?

Figure 1: What unit is catchment density?

Methodology:

Lines 207-209: This wording is a little confusing. Can you rephrase to make it clear that the validation set was used to tune the hyper-parameters? Meaning, your training set is used to get the model weights, and then you check the quality of those weights by calculating an error on the validation set, then modify a hyper-parameter and train again, then check the quality of the new weights on the validation set. And to be clear, you do not calculate any error on the test set until the hyper-parameters have been chosen, right?

Table 3: Consider moving the regions into the table header, instead of as a note.

Results and Discussion:

Line 236: “The local-based models also served as benchmark models” This should be moved to the methods section.

Limitations and further studies:

Line 326: In parentheses you have “daily or monthly”, but I think you meant “daily or hourly”

Conclusions

Line 334-335: “ML algorithms to quickly test our hypothesis since ML algorithms could be easier to set up than traditional hydrological models.” I think this is a bad reason to us ML. There is no use doing a study with one tool instead of another simply because it is easier.

Line 351: double periods.

Citation: https://doi.org/10.5194/hess-2022-320-RC1
- AC1: 'Reply on RC1', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC1
RC2:
'Comment on hess-2022-320', Anonymous Referee #2, 20 Oct 2022

This paper investigated the monthly streamflow prediction in ungauged regions using Machine Learning (ML) based methods. The authors compared three ML methods in global basins with two large regions as data-poor targets. The overall structure of this ms is clear to follow and the topic is intriguing to me. I have some comments as shown below on better clarifying the methodology and performing more profound discussions on the results to safely draw the conclusion. Hopefully these suggestions can help to improve the quality of this study.

Introduction: The authors did a good job here with a comprehensive review on the present studies and I enjoyed reading this part.

Methodology:

To my knowledge, the present cutting-edge ML applications in streamflow prediction mainly focus on daily prediction with deep learning models like LSTM which show superior performance over other models as shown by several studies already cited in this ms. The advantage of DL models over traditional ML was not only shown in hydrology but also in many other fields. I feel the authors may discuss more on the motivation of their choices on monthly prediction and model selection with traditional ML methods.

Better clarification on the framework and experiment design is needed to help readers easily understand the method section. I am quite confused about the meaning of “100” mentioned in line 219 and throughout the ms. Does this mean a 100-fold cross validation to cover all the available data? If so, there would be no basin overlapping for each testing but how the 100 simulation range comes then? I also didn’t understand how the training, validation and testing dataset were formed with limited details given. How do you organize and divide data in the time dimension? The streamflow prediction is a time dynamic problem and I see the authors use data across multiple decades, however I only find the results reported for 12 individual months without time continuous information given.

If I understand correctly, the authors train individual models for different months. I am just curious how this choice was made and how the model would behave with one model trained on data from all months instead, especially given the power of ML models handling big data.

Results:

Reading through the result section, I hope the authors can do a more profound analysis and discussion on their results, not limited to simply describing the figures. The present figures are kind of redundant to me especially without many discussions involved. You may consider removing some unnecessary ones.

For the PUR performance evaluation, the authors need to clarify more about the absolute performance in target regions, not only the performance difference from the local models. It’s quite intuitive to get worse PUR performance compared with local models, but the readers care more about the direct evaluation, like how will ML models behave and can we get functional models for predictions in ungauged regions? Looking at Figure 8, I feel the absolute PUR performances are mostly close to KGE value of 0.0 (y axis starting at -2.0 can be somehow misleading to readers), which implies unsatisfactory performance for a functional model.

It’s quite interesting and also surprising to me for the statement of line 290 that including more training data (EX7 here) leads to lower performance. I hope the authors can have more investigation and discussions on this point, which could be quite controversial given the common agreement that ML models usually benefit from bigger data. Thinking about this, I feel it may depend on different scenarios, such as different types of models used with different capacities to handle large data, and how you train and evaluate the model - the model with more input data may not get optimized which leads to underfitting. Taking one example, for experiments EX1-EX7, the optimized hyperparameters can be different with varying training data availability, and a fair comparison should be built on the optimized conditions of different models.

I didn’t understand the results shown in Figure 3 well. Are these the results on source (gauged) or target (ungauged) basins? Are they reported on the testing data, and if so how did you divide the testing data?

Conclusion:

As mentioned in the above comments, I feel the two key points in line 341 and line 343 are kind of contradictory regarding whether more diverse data can lead to better performance or not. The authors should carefully investigate this point before drawing a conclusion here. In addition, as mentioned previously, more analyses on the absolute PUR performance are needed to get the strong conclusion in line 351 that these models can be capable of solving PUR problems in ungauged regions, especially given the deteriorated performance shown in Figure 8.

Citation: https://doi.org/10.5194/hess-2022-320-RC2
- AC2: 'Reply on RC2', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC2
RC3:
'Comment on hess-2022-320', Anonymous Referee #3, 21 Oct 2022
I found this manuscript is very confusing. I am not sure about their numerical experiments. Before I have a good understanding of their experiments I cannot give a good review on the results. Below are my comments for now. I am happy to give more detailed review after I have a better understanding of their numerical experiments from their revised manuscirpt.

The title mentioned “quantifying uncertainties in geographic extrapolation”. I am wondering how the authors quantified the uncertainties. This uncertainty quantification is one of the objectives of this study if I understand the authors correctly, but I did not see any related discussion in the introduction till the results analysis.

The conclusion in the abstract said “This study provides insight into the selection of input datasets and ML algorithms with different sets of hyperparameters for a geographic streamflow extrapolation.” I am wondering what the insights are specifically.

The effectiveness of transfer learning depends on the similarity of the source and the targe. I am wondering whether the authors performed a similarity analysis which I think it is important to analyze the effectiveness of the extrapolation. And it might explain that adding more sample data from the souces did not improve the performance in predicting the targets.

Line 107, what “hypothesis”?

Why specifically chose these three ML methods? How about the more recently widely used LSTM network? It is known that these three chosen ML methods cannot learn the temporal dependence and the memory effects of the dynamic inputs on streamflow outputs.

Did the authors consider the influence of lagged P and T on current streamflow when they designed the numerical simulations?

Please be specific about the input and output data. Both spatial and temporal data were considered, how the authors split the data for training-validation-testing in terms of both space (i.e., catchments) and time period. The description of 25%-25%-50% of the total number of data is very vague. I do not know what the total number of data represent?

I am confused about the local-based models. It said “using target catchments to train the ML algorithms”, did it also include the source catchments or just target catchments?

Figure 2. I am confused about the total data, i.e., training is about 25% of total. Did this total data include all five regions (source +target) or just source/target?

Table 3 and the 7 experiments need more explanation. I am not sure what these 7 experiments are.

Line 241, for each of these 100 simulations, the hyperparameter tuning was performed and the best results were presented? Please clarify.
Citation: https://doi.org/10.5194/hess-2022-320-RC3
- AC3: 'Reply on RC3', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC3

Status: closed

CC1:
'Comment on hess-2022-320', Alex Sun, 14 Sep 2022

PUB is an important and challenging research topic. This work applies the transfer learning concept by leveraging gages from the extended CAMELS data family. However, I feel the authors stopped short from addressing how much information is actually "transferrable" between different CAMELS datasets. My student and I recently assessed the feasibility of transfer learning between CAMELS-UK and CAMELS-US by using a relatively simple and yet popular tool arising from data clustering, namely, UMAP. We used the following common CAMELS attributes
common_attributes = [

'q_mean',

'runoff_ratio',

'slope_fdc',

'baseflow_index',

'stream_elas',

'high_q_freq',

'high_q_dur',

'low_q_freq',

'low_q_dur',

'zero_q_freq',

'hfd_mean',

'soil_depth_pelletier',

'p_mean',

'pet_mean',

'p_seasonality',

'frac_snow',

'aridity',

'high_prec_freq',

'high_prec_dur',

'low_prec_freq',

'low_prec_dur',

'root_depth_50',

'elev_mean']

The attached figure shows CAMELS-UK and CAMELS-US gages share little in common. They pretty much exist in their own clusters. We welcome the authors to check our calculations. If true, this indicates most information is extracted from local neighbors (i.e., the neighbors of a US gage most likely exist in the US). I suggest the authors add some sort of spatial pattern analysis to untangle the source-target relations (see Sun et al., 2021 and 2022).
References:
Sun, A. Y., Jiang, P., Mudunuru, M. K., & Chen, X. (2021). Explore Spatio‐Temporal Learning of Large Sample Hydrology Using Graph Neural Networks. Water Resources Research, 57(12), e2021WR030394.
Sun, A. Y., Jiang, P., Yang, Z. L., Xie, Y., & Chen, X. (2022). A graph neural network approach to basin-scale river network learning: The role of physics-based connectivity and data fusion. Hydrology and Earth System Sciences Discussions, 1-35.

Citation: https://doi.org/10.5194/hess-2022-320-CC1
- CC2: 'Reply on CC1', Manh-Hung Le, 05 Nov 2022
  
  Thank you very much for your constructive comments. Since our study aims to examine our method in a real-world case study, we focus on transforming climatological streamflow information. CAMELS databases are only accessible in the U.S., UK, Chile, Brazil, and Australia. It would be possible to collect more samples for our analysis by averaging streamflow to get a long-term hydroclimatology value per catchment. We have highlighted this limitation in our revised manuscript [In Section 5 – Limitations and further studies].
  
  “The purpose of our study is to investigate our method in the context of a real-world case study, so we focus on transforming climatological streamflow information. We understand climatological streamflow may have practical limitations, however, by averaging streamflow to get a long-term hydroclimatology value per catchment, it might be possible to collect more samples for our analysis. There are publicly available CAMELS datasets with finer temporal resolution (daily time step). However, CAMELS databases are only accessible in the U.S., UK, Chile, Brazil, and Australia. Further refinements of streamflow prediction (e.g., daily) could be investigated in the future when CAMELS becomes available in a wider region.”
  With respect to your suggested references, we have incorporated these references in the revised manuscript because they are pertinent to our research.
  As per your suggestion, we conducted additional analysis using UMAP. Figure 1 (in the attached file), for example, (you can find all month’s UMAP result in the Supporting Information document), shows the spatial pattern analysis using UMAP which untangles the inputs for source and target regions for January (Other months' results are pretty comparable). It is interesting to note that the target catchments (rectangular) are mostly within catchments from the source, demonstrating a possibility that pre-trained ML over the source regions (circle, cross and plus marks) can predict the output at the target regions. However, it is worth noting that that the UMAP only displays the relative distance. Therefore, it would be premature to assume that pre-train ML models cannot predict streamflow in target regions even if the target and source UMAP values are far off.
  Manh-Hung Le,
  On behalf of all authors
  
  Citation: https://doi.org/10.5194/hess-2022-320-CC2
RC1:
'Comment on hess-2022-320', Anonymous Referee #1, 08 Oct 2022

Title: Streamflow Estimation in Ungauged Regions using Machine Learning: Quantifying Uncertainties in Geographic Extrapolation

General:

This paper attempts to make predictions of monthly averaged streamflow in data scarce regions with machine learning models that were trained in data rich regions. The test their predictions with different permutations of training regions. As expected, the models perform better with different climates and catchments attributes in the training set. Interestingly, however, the results suggest that models trained in North and South America are more reliable than models trained in Europe. They also find, as expected, that extreme gradient boost outperforms support vector mache and random forest. The paper is written fairly well, with exceptions noted below, and provides additional support for the well established conclusion that machine learning models trained on diverse data sets can be useful outside the basins which they are trained. This paper expands that conclusion by transferring the learned models to entirely new regions, in particular to data sparse regions, which is important, as the authors point out.

It was not clear to me if these models were forward looking or backward. I am not entirely sure how useful a monthly average streamflow prediction is in practice, especially if the forcings which drive the prediction are aggregated over that particular month, which would have the prediction a backward estimate. If, however, the forcings are aggregates from the previous month, then this is valuable to water resources management. I ask the authors to make this clarification in their data and methodology sections.

This paper omits non-machine learning models from the study because they are harder to set up. And unfortunately there is no benchmark model/s presented. I believe that this could potentially draw criticism. I do fully understand the need for easy-to-use models in some situations. I would encourage the authors to rethink their framing of the model selection in the introduction and conclusion. Perhaps it would be good to make a case for the benefits of easy-to-use models, and make a case that these shallow learning models are suitable for monthly averaged streamflow over the state-of-the-art LSTM mode, which has been shown again and again to outperform other streamflow models, even when trained out of sample.

Abstract:

Line 21 has double periods.

Introduction:

Lines 38 and 39 claim about stream gauges being the most accurate way to measure streamflow is vague and trivial. Are you making a distinction between remote sensing and in situ measurements? There are many methods of gauging a stream, some more accurate than others. I’m not sure what is the purpose of the sentence, remove or clarify.

Lines 78 and 79: If there is a good argument that ML is not the most promising approach, I’d like to see a citation. Otherwise just state it directly as “machine learning models are arguably one of the most promising approaches”

Line 86: I’m not sure it is obvious what at “traditional” hydrological model is.

Line 107-108: I assumed your hypothesis was about ML model’s ability to transfer learning from one region to another, but here you claim that you use ML models because they are easier to set up?

Lines 108-109: I think the last sentence of this paragraph is fragmented. What kind of water resources prediction? In what context are the water resources secure or insecure?

Data:

Line 127: What is the rational for removing values greater than 2,000 cms?

Line 140: Can you make it clear if your model is making a forward or backward prediction? Is your monthly forcing aggregates from the same month in which your monthly averaged streamflow comes from?

Figure 1: What unit is catchment density?

Methodology:

Lines 207-209: This wording is a little confusing. Can you rephrase to make it clear that the validation set was used to tune the hyper-parameters? Meaning, your training set is used to get the model weights, and then you check the quality of those weights by calculating an error on the validation set, then modify a hyper-parameter and train again, then check the quality of the new weights on the validation set. And to be clear, you do not calculate any error on the test set until the hyper-parameters have been chosen, right?

Table 3: Consider moving the regions into the table header, instead of as a note.

Results and Discussion:

Line 236: “The local-based models also served as benchmark models” This should be moved to the methods section.

Limitations and further studies:

Line 326: In parentheses you have “daily or monthly”, but I think you meant “daily or hourly”

Conclusions

Line 334-335: “ML algorithms to quickly test our hypothesis since ML algorithms could be easier to set up than traditional hydrological models.” I think this is a bad reason to us ML. There is no use doing a study with one tool instead of another simply because it is easier.

Line 351: double periods.

Citation: https://doi.org/10.5194/hess-2022-320-RC1
- AC1: 'Reply on RC1', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC1
RC2:
'Comment on hess-2022-320', Anonymous Referee #2, 20 Oct 2022

This paper investigated the monthly streamflow prediction in ungauged regions using Machine Learning (ML) based methods. The authors compared three ML methods in global basins with two large regions as data-poor targets. The overall structure of this ms is clear to follow and the topic is intriguing to me. I have some comments as shown below on better clarifying the methodology and performing more profound discussions on the results to safely draw the conclusion. Hopefully these suggestions can help to improve the quality of this study.

Introduction: The authors did a good job here with a comprehensive review on the present studies and I enjoyed reading this part.

Methodology:

To my knowledge, the present cutting-edge ML applications in streamflow prediction mainly focus on daily prediction with deep learning models like LSTM which show superior performance over other models as shown by several studies already cited in this ms. The advantage of DL models over traditional ML was not only shown in hydrology but also in many other fields. I feel the authors may discuss more on the motivation of their choices on monthly prediction and model selection with traditional ML methods.

Better clarification on the framework and experiment design is needed to help readers easily understand the method section. I am quite confused about the meaning of “100” mentioned in line 219 and throughout the ms. Does this mean a 100-fold cross validation to cover all the available data? If so, there would be no basin overlapping for each testing but how the 100 simulation range comes then? I also didn’t understand how the training, validation and testing dataset were formed with limited details given. How do you organize and divide data in the time dimension? The streamflow prediction is a time dynamic problem and I see the authors use data across multiple decades, however I only find the results reported for 12 individual months without time continuous information given.

If I understand correctly, the authors train individual models for different months. I am just curious how this choice was made and how the model would behave with one model trained on data from all months instead, especially given the power of ML models handling big data.

Results:

Reading through the result section, I hope the authors can do a more profound analysis and discussion on their results, not limited to simply describing the figures. The present figures are kind of redundant to me especially without many discussions involved. You may consider removing some unnecessary ones.

For the PUR performance evaluation, the authors need to clarify more about the absolute performance in target regions, not only the performance difference from the local models. It’s quite intuitive to get worse PUR performance compared with local models, but the readers care more about the direct evaluation, like how will ML models behave and can we get functional models for predictions in ungauged regions? Looking at Figure 8, I feel the absolute PUR performances are mostly close to KGE value of 0.0 (y axis starting at -2.0 can be somehow misleading to readers), which implies unsatisfactory performance for a functional model.

It’s quite interesting and also surprising to me for the statement of line 290 that including more training data (EX7 here) leads to lower performance. I hope the authors can have more investigation and discussions on this point, which could be quite controversial given the common agreement that ML models usually benefit from bigger data. Thinking about this, I feel it may depend on different scenarios, such as different types of models used with different capacities to handle large data, and how you train and evaluate the model - the model with more input data may not get optimized which leads to underfitting. Taking one example, for experiments EX1-EX7, the optimized hyperparameters can be different with varying training data availability, and a fair comparison should be built on the optimized conditions of different models.

I didn’t understand the results shown in Figure 3 well. Are these the results on source (gauged) or target (ungauged) basins? Are they reported on the testing data, and if so how did you divide the testing data?

Conclusion:

As mentioned in the above comments, I feel the two key points in line 341 and line 343 are kind of contradictory regarding whether more diverse data can lead to better performance or not. The authors should carefully investigate this point before drawing a conclusion here. In addition, as mentioned previously, more analyses on the absolute PUR performance are needed to get the strong conclusion in line 351 that these models can be capable of solving PUR problems in ungauged regions, especially given the deteriorated performance shown in Figure 8.

Citation: https://doi.org/10.5194/hess-2022-320-RC2
- AC2: 'Reply on RC2', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC2
RC3:
'Comment on hess-2022-320', Anonymous Referee #3, 21 Oct 2022
I found this manuscript is very confusing. I am not sure about their numerical experiments. Before I have a good understanding of their experiments I cannot give a good review on the results. Below are my comments for now. I am happy to give more detailed review after I have a better understanding of their numerical experiments from their revised manuscirpt.

The title mentioned “quantifying uncertainties in geographic extrapolation”. I am wondering how the authors quantified the uncertainties. This uncertainty quantification is one of the objectives of this study if I understand the authors correctly, but I did not see any related discussion in the introduction till the results analysis.

The conclusion in the abstract said “This study provides insight into the selection of input datasets and ML algorithms with different sets of hyperparameters for a geographic streamflow extrapolation.” I am wondering what the insights are specifically.

The effectiveness of transfer learning depends on the similarity of the source and the targe. I am wondering whether the authors performed a similarity analysis which I think it is important to analyze the effectiveness of the extrapolation. And it might explain that adding more sample data from the souces did not improve the performance in predicting the targets.

Line 107, what “hypothesis”?

Why specifically chose these three ML methods? How about the more recently widely used LSTM network? It is known that these three chosen ML methods cannot learn the temporal dependence and the memory effects of the dynamic inputs on streamflow outputs.

Did the authors consider the influence of lagged P and T on current streamflow when they designed the numerical simulations?

Please be specific about the input and output data. Both spatial and temporal data were considered, how the authors split the data for training-validation-testing in terms of both space (i.e., catchments) and time period. The description of 25%-25%-50% of the total number of data is very vague. I do not know what the total number of data represent?

I am confused about the local-based models. It said “using target catchments to train the ML algorithms”, did it also include the source catchments or just target catchments?

Figure 2. I am confused about the total data, i.e., training is about 25% of total. Did this total data include all five regions (source +target) or just source/target?

Table 3 and the 7 experiments need more explanation. I am not sure what these 7 experiments are.

Line 241, for each of these 100 simulations, the hyperparameter tuning was performed and the best results were presented? Please clarify.
Citation: https://doi.org/10.5194/hess-2022-320-RC3
- AC3: 'Reply on RC3', Hyunglok Kim, 19 Dec 2022
  
  Dear Reviewer,
  We appreciate your taking the time to review this manuscript and suggest many valuable comments.
  Please see the attached file.
  Sincerely,
  Manh-Hung Le
  
  Hyunglok Kim
  
  Stephen Adam
  
  Hong Xuan Do
  
  Peter A. Beling
  
  Venkataraman Lakshmi
  
  Citation: https://doi.org/10.5194/hess-2022-320-AC3

Manh-Hung Le, Hyunglok Kim, Stephen Adam, Hong Xuan Do, Peter Beling, and Venkataraman Lakshmi

Supplement

https://doi.org/10.5194/hess-2022-320-supplement

Manh-Hung Le, Hyunglok Kim, Stephen Adam, Hong Xuan Do, Peter Beling, and Venkataraman Lakshmi

Viewed

Total article views: 2,100 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,337	704	59	2,100	124	55	66

HTML: 1,337
PDF: 704
XML: 59
Total: 2,100
Supplement: 124
BibTeX: 55
EndNote: 66

Views and downloads (calculated since 09 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	351	129	7	487
Oct 2022	181	54	7	242
Nov 2022	72	39	2	113
Dec 2022	60	30	7	97
Jan 2023	35	24	0	59
Feb 2023	47	49	0	96
Mar 2023	28	26	1	55
Apr 2023	45	21	0	66
May 2023	40	7	0	47
Jun 2023	30	6	1	37
Jul 2023	33	36	1	70
Aug 2023	22	18	2	42
Sep 2023	27	22	2	51
Oct 2023	25	13	0	38
Nov 2023	14	9	0	23
Dec 2023	14	9	1	24
Jan 2024	17	27	2	46
Feb 2024	16	13	2	31
Mar 2024	24	25	2	51
Apr 2024	28	8	5	41
May 2024	16	13	3	32
Jun 2024	32	8	3	43
Jul 2024	14	11	1	26
Aug 2024	16	3	0	19
Sep 2024	14	5	0	19
Oct 2024	11	4	0	15
Nov 2024	11	6	1	18
Dec 2024	9	11	0	20
Jan 2025	8	11	2	21
Feb 2025	16	7	1	24
Mar 2025	17	7	3	27
Apr 2025	13	11	0	24
May 2025	18	9	2	29
Jun 2025	28	24	1	53
Jul 2025	5	9	0	14

Cumulative views and downloads (calculated since 09 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	351	129	7	487
Oct 2022	181	54	7	242
Nov 2022	72	39	2	113
Dec 2022	60	30	7	97
Jan 2023	35	24	0	59
Feb 2023	47	49	0	96
Mar 2023	28	26	1	55
Apr 2023	45	21	0	66
May 2023	40	7	0	47
Jun 2023	30	6	1	37
Jul 2023	33	36	1	70
Aug 2023	22	18	2	42
Sep 2023	27	22	2	51
Oct 2023	25	13	0	38
Nov 2023	14	9	0	23
Dec 2023	14	9	1	24
Jan 2024	17	27	2	46
Feb 2024	16	13	2	31
Mar 2024	24	25	2	51
Apr 2024	28	8	5	41
May 2024	16	13	3	32
Jun 2024	32	8	3	43
Jul 2024	14	11	1	26
Aug 2024	16	3	0	19
Sep 2024	14	5	0	19
Oct 2024	11	4	0	15
Nov 2024	11	6	1	18
Dec 2024	9	11	0	20
Jan 2025	8	11	2	21
Feb 2025	16	7	1	24
Mar 2025	17	7	3	27
Apr 2025	13	11	0	24
May 2025	18	9	2	29
Jun 2025	28	24	1	53
Jul 2025	5	9	0	14

Viewed (geographical distribution)

Total article views: 1,996 (including HTML, PDF, and XML) Thereof 1,996 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 18 Jul 2025

Short summary

Limited ground data makes streamflow information difficult to obtain in ungauged regions. We demonstrate how data-rich areas (North America, South America, and Western Europe) can provide streamflow information to data-poor areas (South Africa, Central Asia). By using machine learning algorithms, we observed diverse climate and catchment attributes that could be useful for our demonstration. In this study, we attempt to understand the uncertainty associated with geographic extrapolation.


Total:	0
HTML:	0
PDF:	0
XML:	0

Streamflow Estimation in Ungauged Regions using Machine Learning: Quantifying Uncertainties in Geographic Extrapolation

Supplement

Viewed

Viewed (geographical distribution)

Cited

3 citations as recorded by crossref.