the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Comparing machine learning and deep learning models for probabilistic post-processing of satellite precipitation-driven streamflow simulation
Yuhang Zhang
Phu Nguyen
Bita Analui
Soroosh Sorooshian
Kuolin Hsu
Yuxuan Wang
Abstract. Deep learning (DL) models are popular but computationally expensive, machine learning (ML) models are old-fashioned but more efficient. Their differences in hydrological probabilistic post-processing are not clear at the moment. This study conducts a systematic model comparison between the quantile regression forest (QRF) model and probabilistic long short-term memory (PLSTM) model as hydrological probabilistic post-processors. Specifically, we compare these two models to deal with the biased streamflow simulation driven by three kinds of satellite precipitation products in 522 sub-basins of Yalong River basin of China. Model performance is comprehensively assessed by a series of scoring metrics from the probabilistic and deterministic perspectives, respectively. In general, the QRF model and the PLSTM model are comparable in terms of probabilistic prediction. Their performance is closely related to the flow accumulation area of the sub-basin. For sub-basins with flow accumulation area less than 60,000 km2, the QRF model outperforms the PLSTM model in most of the sub-basins. For sub-basins with flow accumulation area larger than 60,000 km2, the PLSTM model has an undebatable advantage. In terms of deterministic predictions, the PLSTM model should be more preferred than the QRF model, especially when the raw streamflow is poorly simulated and used as an input. But if we put aside the model performance, the QRF model is more efficient in all cases, saving half the time than the PLSTM model. This study can deepen our understanding of ML and DL models in hydrological post-processing and enable more appropriate model selection in practice.
- Preprint
(3486 KB) - Metadata XML
-
Supplement
(438 KB) - BibTeX
- EndNote
Yuhang Zhang et al.
Status: closed
-
RC1: 'Comment on hess-2022-377', Anonymous Referee #1, 15 Dec 2022
This study compares two post-processing methods of streamflow simulation obtained using different precipitation products based on satellite data. A comprehensive evaluation is performed on 522 sub-catchments located in China to assess the performances in terms of reliability, sharpness, and various hydrological skills. The paper is well-written and complete, the figures are clear and the interpretations of the results are convincing. My recommendation is that the paper can be accepted for publication after minor corrections which are listed below.
l.44-46: I strongly disagree with this statement. There is no evidence that satellite precipitation estimation is the most promising hydrological model input. As an example, ERA5 is mostly driven by satellite data and is not able to reproduce most of the precipitation features at a high spatial resolution (Bandhauer et al., 2022; Reder et al., 2022), does not reproduce the strong relationships between precipitation characteristics and the topography in mountainous areas, underestimate hourly and daily extreme values and overestimate the number of wet days (Bandhauer et al., 2022). At high spatial and temporal resolutions, the assimilation of ground measurements and/or radar data is needed to reproduce extreme events (Reder et al., 2022). However, I agree that satellite precipitation estimation is valuable in regions where ground measurements are scarce.
l.75: A more recent application of MOS method is provided by Bellier et al. (2018).
l.80: short memory: I guess that ‘term’ is missing between ‘short‘ and ‘memory’.
l.123: serval -> several.
l.195: “so the model is reliable”. Is it possible to rephrase the sentence to indicate that this is an assumption and not your personal judgement? As the authors do not provide evidence that the model is able to reproduce the natural runoff process (I understand that it is not possible), it would be fairer.
l.247: Klotze -> Klotz.
l.255-256: The terms “single-model” and “multi-model” are a bit misleading, as I understand that the authors refer to precipitation products here. I suggest replacing them by “single-precipitation” product and “multi-precipitation” or something similar.
l.348: Missing dot after “threshold”.
l.448: “Little precipitation events”: I was not sure if the authors refer to localized precipitation events here, or with moderate intensities. Is it possible to be more specific?
Bandhauer, Moritz, Francesco Isotta, Mónika Lakatos, Cristian Lussana, Line Båserud, Beatrix Izsák, Olivér Szentes, Ole Einar Tveito, and Christoph Frei. 2022. “Evaluation of Daily Precipitation Analyses in E-OBS (V19.0e) and ERA5 by Comparison to Regional High-Resolution Datasets in European Regions.” International Journal of Climatology 42 (2): 727–47. https://doi.org/10.1002/joc.7269.
Bellier, Joseph, Isabella Zin, and Guillaume Bontron. 2018. “Generating Coherent Ensemble Forecasts After Hydrological Postprocessing: Adaptations of ECC-Based Methods.” Water Resources Research 54 (8): 5741–62. https://doi.org/10.1029/2018WR022601.
Reder, A., M. Raffa, R. Padulano, G. Rianna, and P. Mercogliano. 2022. “Characterizing Extreme Values of Precipitation at Very High Resolution: An Experiment over Twenty European Cities.” Weather and Climate Extremes 35 (March): 100407. https://doi.org/10.1016/j.wace.2022.100407.
Citation: https://doi.org/10.5194/hess-2022-377-RC1 -
AC1: 'Reply on RC1', aizhong ye, 08 Feb 2023
We would like to thank Referee #1 for his/her constructive comments on our manuscript. Please find our responses in the supplement.
Citation: https://doi.org/10.5194/hess-2022-377-AC1 - AC2: 'Reply on RC1', aizhong ye, 08 Feb 2023
-
AC1: 'Reply on RC1', aizhong ye, 08 Feb 2023
-
RC2: 'Comment on hess-2022-377', Anonymous Referee #2, 19 Dec 2022
I reviewed the manuscript entitled “Comparing machine learning and deep learning models for probabilistic post-processing of satellite precipitation-driven streamflow simulation” by Zhang et al. The manuscript compares the uses of a machine learning method (QRF) and a deep learning method (PLSTM) for bias-correction of streamflow simulations. The study uses the reference precipitation-driven streamflow as the reference for the bias-correction instead of the observed streamflow due to the data availability of the region. Overall, I have five major concerns.
- Lack of interpretations on results
This study used several statistics for model performance evaluation, namely the continuous rank probability score (CRPS), the weighted CRPS, the reliability diagram, and the sharpness. The figures/tables were used to demonstrate those statistics. My first and biggest concern is the lack of interpretation on the appearance of the figures/tables. For example, I am less familiar with the concept of a reliable diagram; after reading section 4.2.4, I was still not able to understand what Figure 7 and 8 were showing. It seems that the optimum is to have lines following the diagonal line. But how to quantitatively define “close to the diagonal line”? If it is close then it is a reliable prediction. But what exactly is meant for “reliable prediction”? If a line is mostly located above (below) the diagonal line, it is an underestimation (overestimation) of what? Another example is the concept of sharpness. I was not able to understand this concept after reading lines 312-315 where the concept was introduced. After reading section 4.2.5, the section dedicated to the sharpness-related results, I was even more puzzled. The section compared the variability of the different streamflow estimations and it seems that if those statistics show smaller values (lower variability), then the model is better. Again, what is it better for and why? It is hard to interpret the meaning probably due to the lack of descriptions on those two methods (reliable diagram and sharpness). Rather than those, I also found the use of CRPS and twCRPS redundant (see the same pattern between panel a and c and b and d in Figure 4). The patterns of Figure 3 also need to be interpreted properly.
- Drainage area thresholds
The authors provided scatter plots between drainage areas and CRPS (CRPSS) in Figure 6. Two different drainage area thresholds (20,000 and 60,000 km2) were used to split the space of the plots for CRPS and CRPSS, respectively. I was not sure how those thresholds were selected. It seems that they are arbitrarily selected by the authors. Moreover, in the latter Figures 7 and 8, only 60,000 km2 was used as the threshold, while in Figure 12, 20,000 km2 was used again. I can’t see a clear reason for switching between thresholds. Rather than that, the authors mentioned in several locations that the statistics show dependencies on the drainage area. I don’t disagree that the patterns are not random (see Figure 6 and 12), but how do the authors explain those patterns? The current descriptions are merely on the appearance of the plots without convincing explanation.
- Selection of typical sub-basins
The manuscript dedicated two sections (4.2.6 and 4.3.5) for pilot analysis of two sub-basins. However, I can’t see a clear reason for having those pilot analyses. Nor do I see any convincing reason supporting that the two sub-basins chosen are “typical”. I don’t even know the definition of “typical” here. I think the authors need to address what had been shown by rendering such pilot analysis (the necessity of emphasizing analysis of the two sub-basins)? How do those analyses help to tell the story? Besides, please add in the methodology section the criteria of choosing the "typical" sub-basins.
- Composition of the discussion section
I found the current discussion section superficial, lacking the in-depth explanations on some critical observations from the result section. For example, in section 5.2, the authors mentioned that “In their study, the CMAL-LSTM model achieved the best model performance, which is why we chose it.”. But what was used in this study is PLSTM, not CMAL-LSTM by Klotz et al. (2022). Another example is the use of global vs. local models in section 5.3. The authors explain that they chose to train local models because they have limited computational resources. I don’t against either training one global model or several local models; I think it is just the choice of the users. But I found this content irrelevant to the science of this study. In my opinion, there are several observations from the result section that are worthy to explain. First, on the hydrological model performance, why are the headwater and the downstream catchments show worse performance than the other catchments (Figure 3)? Why is the gauge-adjusted GSMaP worse than the near-real-time PDIR? Second, on the relative performance between QRF and PLSTM, why is QRF better than PLSTM in the probabilistic evaluation but the reversed situation shown in the deterministic evaluation? Third, on the dependency between metrics and drainage area. As it was mentioned in the previous comment, the patterns are not random. How do we interpret those patterns? Besides, I think it is too much to dedicate two sub-section (section 5.4 and 5.5) on future research directions. Consider merging them and making the writing concise. Discuss something valuable from your results rather than some general facts from literature.
- Presentation of materials and writing of the manuscript
I think the structure of the sections needs to be improved. Both the methodology and the result sections reach three hierarchical levels. Some sub-sections just have one paragraph. I can see a clear room to make the structure more concise by limiting it to only two hierarchical levels (see my detailed writing tips in the annotated manuscript). In addition to that, writing of the manuscript needs to be improved significantly. I can identify grammatical issues and sentences with bad structure. Please pay specific attention to the tense (past vs. present), the use of articles, the use of plural vs. singular form, and the rules of using acronyms.
- AC4: 'Reply on RC2', aizhong ye, 08 Feb 2023
-
RC3: 'Comment on hess-2022-377', Anonymous Referee #3, 08 Jan 2023
This is a well-written manuscript. This reviewer only has minor technical comments. See attached for details.
- AC3: 'Reply on RC3', aizhong ye, 08 Feb 2023
Status: closed
-
RC1: 'Comment on hess-2022-377', Anonymous Referee #1, 15 Dec 2022
This study compares two post-processing methods of streamflow simulation obtained using different precipitation products based on satellite data. A comprehensive evaluation is performed on 522 sub-catchments located in China to assess the performances in terms of reliability, sharpness, and various hydrological skills. The paper is well-written and complete, the figures are clear and the interpretations of the results are convincing. My recommendation is that the paper can be accepted for publication after minor corrections which are listed below.
l.44-46: I strongly disagree with this statement. There is no evidence that satellite precipitation estimation is the most promising hydrological model input. As an example, ERA5 is mostly driven by satellite data and is not able to reproduce most of the precipitation features at a high spatial resolution (Bandhauer et al., 2022; Reder et al., 2022), does not reproduce the strong relationships between precipitation characteristics and the topography in mountainous areas, underestimate hourly and daily extreme values and overestimate the number of wet days (Bandhauer et al., 2022). At high spatial and temporal resolutions, the assimilation of ground measurements and/or radar data is needed to reproduce extreme events (Reder et al., 2022). However, I agree that satellite precipitation estimation is valuable in regions where ground measurements are scarce.
l.75: A more recent application of MOS method is provided by Bellier et al. (2018).
l.80: short memory: I guess that ‘term’ is missing between ‘short‘ and ‘memory’.
l.123: serval -> several.
l.195: “so the model is reliable”. Is it possible to rephrase the sentence to indicate that this is an assumption and not your personal judgement? As the authors do not provide evidence that the model is able to reproduce the natural runoff process (I understand that it is not possible), it would be fairer.
l.247: Klotze -> Klotz.
l.255-256: The terms “single-model” and “multi-model” are a bit misleading, as I understand that the authors refer to precipitation products here. I suggest replacing them by “single-precipitation” product and “multi-precipitation” or something similar.
l.348: Missing dot after “threshold”.
l.448: “Little precipitation events”: I was not sure if the authors refer to localized precipitation events here, or with moderate intensities. Is it possible to be more specific?
Bandhauer, Moritz, Francesco Isotta, Mónika Lakatos, Cristian Lussana, Line Båserud, Beatrix Izsák, Olivér Szentes, Ole Einar Tveito, and Christoph Frei. 2022. “Evaluation of Daily Precipitation Analyses in E-OBS (V19.0e) and ERA5 by Comparison to Regional High-Resolution Datasets in European Regions.” International Journal of Climatology 42 (2): 727–47. https://doi.org/10.1002/joc.7269.
Bellier, Joseph, Isabella Zin, and Guillaume Bontron. 2018. “Generating Coherent Ensemble Forecasts After Hydrological Postprocessing: Adaptations of ECC-Based Methods.” Water Resources Research 54 (8): 5741–62. https://doi.org/10.1029/2018WR022601.
Reder, A., M. Raffa, R. Padulano, G. Rianna, and P. Mercogliano. 2022. “Characterizing Extreme Values of Precipitation at Very High Resolution: An Experiment over Twenty European Cities.” Weather and Climate Extremes 35 (March): 100407. https://doi.org/10.1016/j.wace.2022.100407.
Citation: https://doi.org/10.5194/hess-2022-377-RC1 -
AC1: 'Reply on RC1', aizhong ye, 08 Feb 2023
We would like to thank Referee #1 for his/her constructive comments on our manuscript. Please find our responses in the supplement.
Citation: https://doi.org/10.5194/hess-2022-377-AC1 - AC2: 'Reply on RC1', aizhong ye, 08 Feb 2023
-
AC1: 'Reply on RC1', aizhong ye, 08 Feb 2023
-
RC2: 'Comment on hess-2022-377', Anonymous Referee #2, 19 Dec 2022
I reviewed the manuscript entitled “Comparing machine learning and deep learning models for probabilistic post-processing of satellite precipitation-driven streamflow simulation” by Zhang et al. The manuscript compares the uses of a machine learning method (QRF) and a deep learning method (PLSTM) for bias-correction of streamflow simulations. The study uses the reference precipitation-driven streamflow as the reference for the bias-correction instead of the observed streamflow due to the data availability of the region. Overall, I have five major concerns.
- Lack of interpretations on results
This study used several statistics for model performance evaluation, namely the continuous rank probability score (CRPS), the weighted CRPS, the reliability diagram, and the sharpness. The figures/tables were used to demonstrate those statistics. My first and biggest concern is the lack of interpretation on the appearance of the figures/tables. For example, I am less familiar with the concept of a reliable diagram; after reading section 4.2.4, I was still not able to understand what Figure 7 and 8 were showing. It seems that the optimum is to have lines following the diagonal line. But how to quantitatively define “close to the diagonal line”? If it is close then it is a reliable prediction. But what exactly is meant for “reliable prediction”? If a line is mostly located above (below) the diagonal line, it is an underestimation (overestimation) of what? Another example is the concept of sharpness. I was not able to understand this concept after reading lines 312-315 where the concept was introduced. After reading section 4.2.5, the section dedicated to the sharpness-related results, I was even more puzzled. The section compared the variability of the different streamflow estimations and it seems that if those statistics show smaller values (lower variability), then the model is better. Again, what is it better for and why? It is hard to interpret the meaning probably due to the lack of descriptions on those two methods (reliable diagram and sharpness). Rather than those, I also found the use of CRPS and twCRPS redundant (see the same pattern between panel a and c and b and d in Figure 4). The patterns of Figure 3 also need to be interpreted properly.
- Drainage area thresholds
The authors provided scatter plots between drainage areas and CRPS (CRPSS) in Figure 6. Two different drainage area thresholds (20,000 and 60,000 km2) were used to split the space of the plots for CRPS and CRPSS, respectively. I was not sure how those thresholds were selected. It seems that they are arbitrarily selected by the authors. Moreover, in the latter Figures 7 and 8, only 60,000 km2 was used as the threshold, while in Figure 12, 20,000 km2 was used again. I can’t see a clear reason for switching between thresholds. Rather than that, the authors mentioned in several locations that the statistics show dependencies on the drainage area. I don’t disagree that the patterns are not random (see Figure 6 and 12), but how do the authors explain those patterns? The current descriptions are merely on the appearance of the plots without convincing explanation.
- Selection of typical sub-basins
The manuscript dedicated two sections (4.2.6 and 4.3.5) for pilot analysis of two sub-basins. However, I can’t see a clear reason for having those pilot analyses. Nor do I see any convincing reason supporting that the two sub-basins chosen are “typical”. I don’t even know the definition of “typical” here. I think the authors need to address what had been shown by rendering such pilot analysis (the necessity of emphasizing analysis of the two sub-basins)? How do those analyses help to tell the story? Besides, please add in the methodology section the criteria of choosing the "typical" sub-basins.
- Composition of the discussion section
I found the current discussion section superficial, lacking the in-depth explanations on some critical observations from the result section. For example, in section 5.2, the authors mentioned that “In their study, the CMAL-LSTM model achieved the best model performance, which is why we chose it.”. But what was used in this study is PLSTM, not CMAL-LSTM by Klotz et al. (2022). Another example is the use of global vs. local models in section 5.3. The authors explain that they chose to train local models because they have limited computational resources. I don’t against either training one global model or several local models; I think it is just the choice of the users. But I found this content irrelevant to the science of this study. In my opinion, there are several observations from the result section that are worthy to explain. First, on the hydrological model performance, why are the headwater and the downstream catchments show worse performance than the other catchments (Figure 3)? Why is the gauge-adjusted GSMaP worse than the near-real-time PDIR? Second, on the relative performance between QRF and PLSTM, why is QRF better than PLSTM in the probabilistic evaluation but the reversed situation shown in the deterministic evaluation? Third, on the dependency between metrics and drainage area. As it was mentioned in the previous comment, the patterns are not random. How do we interpret those patterns? Besides, I think it is too much to dedicate two sub-section (section 5.4 and 5.5) on future research directions. Consider merging them and making the writing concise. Discuss something valuable from your results rather than some general facts from literature.
- Presentation of materials and writing of the manuscript
I think the structure of the sections needs to be improved. Both the methodology and the result sections reach three hierarchical levels. Some sub-sections just have one paragraph. I can see a clear room to make the structure more concise by limiting it to only two hierarchical levels (see my detailed writing tips in the annotated manuscript). In addition to that, writing of the manuscript needs to be improved significantly. I can identify grammatical issues and sentences with bad structure. Please pay specific attention to the tense (past vs. present), the use of articles, the use of plural vs. singular form, and the rules of using acronyms.
- AC4: 'Reply on RC2', aizhong ye, 08 Feb 2023
-
RC3: 'Comment on hess-2022-377', Anonymous Referee #3, 08 Jan 2023
This is a well-written manuscript. This reviewer only has minor technical comments. See attached for details.
- AC3: 'Reply on RC3', aizhong ye, 08 Feb 2023
Yuhang Zhang et al.
Yuhang Zhang et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
809 | 293 | 28 | 1,130 | 37 | 14 | 12 |
- HTML: 809
- PDF: 293
- XML: 28
- Total: 1,130
- Supplement: 37
- BibTeX: 14
- EndNote: 12
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1