the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A National Scale Hybrid Model for Enhanced Streamflow Estimation – Consolidating a Physically Based Hydrological Model with Long Short-term Memory Networks
Abstract. Accurate streamflow estimation is essential for effective water resources management and adapting to extreme events in the face of changing climate conditions. Hydrological models have been the conventional approach for streamflow inter/extrapolation in time and space for the past decades. However, their large-scale applications have encountered challenges, including issues related to efficiency, complex parameterization, and constrained performance. Deep learning methods, such as Long Short-Term Memory networks (LSTM), have emerged as a promising and efficient approach for large-scale streamflow estimation. In this study, we conducted a series of experiments to identify optimal hybrid modelling schemes to consolidate physically based models with LSTM aimed at enhancing streamflow estimation in Denmark.
The results showed that the hybrid modelling schemes outperformed the Danish Water Resources Model (DKM) in both gauged and ungauged basins. While the standalone LSTM rainfall-runoff model outperformed DKM in many basins, it faced challenges when predicting streamflow in groundwater-dependent catchments. A serial hybrid modelling scheme (LSTM-q), which used DKM outputs and climate forcings as dynamic inputs for LSTM training, demonstrated higher performance. LSTM-q improved the median Nash-Sutcliffe Efficiency (NSE) by 0.18 in gauged basins and 0.11 in ungauged basins compared to DKM. Similar accuracy improvements were achieved with alternative hybrid schemes, i.e., by predicting the residuals between DKM-simulated streamflow and observations using a LSTM. Moreover, the developed hybrid models enhanced the accuracy of extreme events, which encourages the integration of hybrid models within an operational forecasting framework. This study highlights the advantages of synergizing existing physically based hydrological models with LSTM models, and the proposed hybrid schemes hold the potential to achieve high-quality, large-scale streamflow estimations.
- Preprint
(3162 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2023-235', Anonymous Referee #1, 26 Dec 2023
The authors compare the performance of different types of hybrid models based on LSTM networks and a physically-based model, DKM, in estimating streamflow for Denmark. Generally, the hybrid models outperform LSTM rainfall-runoff model (LSTM-rr) in ungauged basins. They find the hybrid dynamic inputs LSTM model (LSTM-q) and the LSTM residual error model (LSTM-qr) have the overall best performance. The hybrid models also improve streamflow estimates in groundwater dependent basins. The study is interesting, and the authors provide a comprehensive discussion. My concerns on the study are as follows.
In this study, LSTM-q and LSTM-qr are the optimal streamflow models for Denmark. Will the models be optimal in other regions in the world? I am worried that this might be a local conclusion. The authors discuss it in the Discussion section (Line 504-512) by comparing their conclusion with that from Konapala et al., (2020). They argue that the different best hybrid models in the two studies is due to the higher accuracy of DKM. Is it possible that the difference is also due to the different study domains, i.e., Denmark in this study and CONUS in Konapala et al., (2020)? In general, CONUS has a much deeper groundwater table depth than Denmark. The authors may evaluate the performance of different hybrid models in various groundwater table depth ranges to check whether the conclusion will be changed.
The common practice is to separate the data into training, validation and testing sets in the time order. Why do the authors choose 2011-2019 as the validation period and 1990-1999 as the testing period? Does the selected testing period have fewer human impacts on streamflow? In addition, Section 3.2 compares the event performance of LSTM hybrid models during the validation period (2011-2019). At least 80% of study data used in the section are validation data, which have been observed by the hybrid models during the hyperparameter tuning. The performance of the hybrid models is expected to be good, but may not deliver reliable information.
In Table 1, why does the hybrid models include as input phreatic depths at both 100 and 500 m resolutions?
Please improve the quality of the figures particularly Figure 8. If possible, please also increase the fontsizes in the figures.
Specific comments:
- Line 129-130: Might change “hidden unit sizes” to “hidden neurons”.
- Section 2.5: “Table 3” in Line 314 and 330 should be “Table 4”.
- Table 5: Maybe only write the best evaluation scores in bold for better visualization.
- Figure 8: The word “bias” seems not be a right word to use in the caption, which suggests the error. Please consider replacing it with the word like “difference”.
Citation: https://doi.org/10.5194/hess-2023-235-RC1 -
AC1: 'Reply on RC1', Jun Liu, 16 Jan 2024
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2023-235/hess-2023-235-AC1-supplement.pdf
-
RC2: 'Comment on hess-2023-235', Anonymous Referee #2, 27 Dec 2023
This manuscript proposes the use of different combinations of long short term memory (LSTM) with a physical model- Danish Water Resources Model (DKM) to improve the accuracy of streamflow prediction. It is suggested that the hybrid model improved the model accuracy in ungauged and gauged basins. The authors further pointed out that the hybrid models could enhance the accuracy of extreme events. The knowledge gap is convincing and the paper is clear. However, I have some comments that should be addressed before publication.
Specific comments:
1) In the introduction, the authors claimed that the DKM model is a well-established groundwater modeling system (Line 100). However, Nevertheless, this paper lacks observational evidence, particularly in results to support this claim. References are suggested.
2) Sections 2.1 and 2.2.4 are overlapped and can be merged into a whole section.
3) A table is suggested to compare the inputs and outputs of each type of model in section 2.3.
4) The caption of Figure 2 is not clear. Please explain each of the subplots.
5) In section 3.2, the operational forecasting framework uses only the observations of meteorological factors as model inputs. However, many studies try to combine the historical simulations or observations as model inputs which contributes to model forecasting. In the authors’ cases, please comment on the impacts if considering the history series as model inputs.
6) In the discussion part, it is interesting to get the conclusion of model performance: LSTM-qr ≈ LSTM-q > LSTM-rr > LSTM-qf > LSTM-pf ≈ DKM. Could authors explain why the LSTM-rr performed better than the LSTM-pf, as the LSTM-pf is the pretraining and finetuning LSTM-rr?
Citation: https://doi.org/10.5194/hess-2023-235-RC2 -
AC2: 'Reply on RC2', Jun Liu, 16 Jan 2024
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2023-235/hess-2023-235-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Jun Liu, 16 Jan 2024
-
RC3: 'Comment on hess-2023-235', Anonymous Referee #3, 03 Jan 2024
The paper aims to investigate the advantages of utilizing a distributed Process-based Model (PBM) in implementing an LSTM representation for streamflow prediction. The researchers tested various traditional combinations to analyze the pros and cons of each configuration. They concluded that LSTM with the output of the PBM as input (LSTM-q) and an LSTM model learning the residual error of PBM (LSTM-qr) were the best models.
One of the interesting findings of the study is that the hybrid model requires less memory (sequence length) than a simple LSTM (LSTM-rr). This indicates that by using PBM in LSTM, it can incorporate longer temporal dependences, which mitigates one of the issues of LSTM representation. Another notable finding related to this is that LSTM decreases performance in groundwater-dominated catchments, as suggested by other studies.
However, I have some major comments about the lack of clarity in defining the criteria for the best model and the explanation of some figures. Although the authors defined several metrics to evaluate the model, it was not clear which one or what combination of them was used to define the best model. Additionally, in many cases, the differences in performance are so small that they are probably not statistically significant.
Moreover, some figures were presented without any further explanation. For instance, Figure 3 shows 16 subplots, but the text only mentions two lines about it. It's crucial to present figures that support the story presented in the paper, and if a figure isn't explained, it should go to the appendix. However, the authors should try to analyze each figure as much as possible because they will find more details that support their findings.
Minor comments.
Line 29: If you mention extra/interpolation, please explain why it is important for your goal.
Line 59-60: The statement could mislead readers to believe that only DL methods experience a decline in performance. Please modify it.
Line 97: Please use a software or a method to verify references as I found an incorrect citation. It should be De la Fuente et al. (https://doi.org/10.5194/hess-2023-252) instead of Fuente et al.
Line 98: I agree with the sentence, but you should provide references, considering the "limited attention" given to the topic.
Table 1: It would be very useful to add some summary statistics, such as range and mean.
Line 291. Why did you change the loss function? Different loss functions emphasize different components of the error.
Line 307. Line 307: The hyperparameter search could generate some inconsistencies because only 6.2% of the parameter space is being explored (100/1620). To address this, it is recommended to fix hyperparameters with low sensitivity such as LR, BC, NE, and DR. This way, a more detailed exploration of the hyperparameters that matter can be carried out.
Table 3. How much is the difference with the second best? It is a little strange that models with the simulated streamflow as input have more hidden cells than the baseline that is learning the entire dynamic of the system.
Table 5. If the difference between the mean and median is not mentioned, it's recommended to either delete one of them or move the discussion to the appendix.
Line 348. LSTM-q outperforms LSTM-qr only for NSE2. Please, check your analyses.
Line 353. In the text, it is mentioned that LSTM-rr has a lower NSE_log than DKM, but the table shows the opposite.
Figure 3. You did not analyze the figure. Therefore, you should either delete or move it to the appendix. However, I believe that this figure is more informative than Table 5, so I encourage you to describe it.
Figure 3. Cumulative distribution functions (CDFs) are typically displayed with metrics on the x-axis and probability on the y-axis. To improve clarity, it is suggested to limit the axis for KGE and NSE between [0,1]. This will provide a better visualization of the behavior of each line. Additionally, you may consider showing only some of the metrics in detail, while the rest could be placed in the appendix.
Line 360. A value greater than zero is still unsatisfactory. Please rephrase this sentence.
Line 363. It can be difficult for someone from another country to identify specific areas. Adding a latitude-longitude grid can help with this issue.
Figure 4. If you are not going to describe the other maps, move them to the appendix and present only the histogram. The current color scheme makes it difficult to identify patterns. To address this, Figure a) could benefit from a traffic light palette (green for good, yellow for regular, and red for bad). Meanwhile, the other figures could use a palette that transitions from white in the center to either red or blue on the extremes, respectively. This approach will allow readers to better focus on relevant changes.
Figure 5. It is difficult to distinguish between the colors of LSTM-rr and LSTM-qf. Adding the KGE or NSE of each model to the legend would provide an additional comparison.
Line 392. Please provide references to support the fact that this finding is not surprising. As far as I know, lumped models tend to perform poorly in predicting larger areas when there is a non-uniform distribution of precipitation. However, larger catchments are usually influenced more by baseflow which should be easier to predict. Hence, to validate this conclusion, it would be helpful to compare it with other studies conducted in the region.
Line 397. To gain a better understanding of the region, please add the range of groundwater levels.
Line 402. I would like to draw your attention to the fact that the correlation analysis was conducted using only one variable. This means that any interaction between attributes has been overlooked. Additionally, the Pearson correlation only represents linear correlations, which underestimates more complex relationships. To address these issues, I suggest using the Spearman correlation and discarding low correlation values. Alternatively, you could build a random forest model to examine how the combination of attributes affects performance.
Figure 6. Please double-check the color bar and ensure that the white color is set to zero.
Line 428. The sentence "chosen based on their superior performance" is not accurate as the performance of the LSTM models is similar. Additionally, the models were not statistically compared.
Figure 7. The figure contains too much information with the density function and histogram. It would be better to only show the density function and widen the figure.
Line 445-446. Could you describe the locations of these regions?”
Line 448. Figure 7 exhibits an underestimation of the streamflow. Can you explain the source of this statement?
Line 460. Could you please clarify on which results this conclusion is based?
Figure 9. Why aren't the values from Table 5 included in this figure? Also, why is LSTM-rr showing lower values than the others?
Line 481. If there are not many studies that have made this comparison, you should include references to those studies.
Line 498. Can we attribute this improvement to the use of longer memory, as shown in Figure 9?
Line 500. You must explain your ranking process as you are using multiple metrics. Are you using one of them, their average, or some combination? Additionally, many of the models may not be statistically significant.
Line 525. Could you please clarify what you mean by shallow structures? It seems contradictory to say that 256 hidden cells in parallel with 365 days of sequence length is a shallow structure. Many studies have shown that using more than 2 layers in a series does not result in significant improvement. This suggests that a single layer has sufficient complexity to capture the necessary processes.
Line 526. I do not believe that the features and attributes are deeply hidden. The problem lies in the representation and inputs used to extract information.
Line 529-530. Instead of making your representation deeper to increase its complexity, you should try using a multi-representation approach. Different representations or architectures can capture different pieces of information. Using local models may alleviate the issue, but it does not solve it. This is like trying to approximate a high-degree polynomial by using only order 2 polynomial segments. Adding more order 2 polynomials segments does not increase the complexity; it only segments the extraction of information.
Line 542. Some researchers have used graph neural networks with or without routing parameters in training. Mention them.
Line 559. Are these values significantly different from the LSTM-rr? for this reason, you must be more specific about what was your final multi-objective criteria.
Line 561-562. Why were LSTM-qr and LSTM-q not able to beat always to the DKM model despite using its outputs?
Line 567-568. Do you have any suggestions or ideas?
Line 570. It would be helpful if you could mention the complex hydrological processes.
Citation: https://doi.org/10.5194/hess-2023-235-RC3 -
AC3: 'Reply on RC3', Jun Liu, 16 Jan 2024
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2023-235/hess-2023-235-AC3-supplement.pdf
-
AC3: 'Reply on RC3', Jun Liu, 16 Jan 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
321 | 127 | 22 | 470 | 12 | 11 |
- HTML: 321
- PDF: 127
- XML: 22
- Total: 470
- BibTeX: 12
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1