the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A diversity centric strategy for the selection of spatio-temporal training data for LSTM-based streamflow forecasting
Abstract. Deep learning models are increasingly being applied to streamflow forecasting problems. Their success is in part attributed to the large and hydrologically diverse datasets on which they are trained. However, common data selection methods fail to explicitly account for hydrological diversity contained within training data. In this research, clustering is used to characterise temporal and spatial diversity, in order to better understand the importance of hydrological diversity within regional training datasets. This study presents a novel, diversity-based resampling approach to creating hydrologically diverse datasets. First, the undersampling procedure is used to undersample temporal data, and is used to show how the amount of temporal data needed to train models can be halved without any loss in performance. Next, it is applied to reduce the number of basins in the training dataset. While basins cannot be omitted from training without some loss in performance, we show how hydrologically dissimilar basins are highly beneficial to model performance. This is shown empirically for Canadian basins; models trained to sets of basins separated by thousands of kilometres outperform models trained to localised clusters. We strongly recommend an approach to training data selection that encourages a broad representation of diverse hydrological processes.
- Preprint
(5466 KB) - Metadata XML
-
Supplement
(421 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2024-169', Anonymous Referee #1, 16 Jul 2024
Review of "A diversity centric strategy for the selection of spatio-temporal training data for LSTM-based streamflow forecasting" by Everett Snieder and Usman T. Khan
This study investigates the impact of forming hydrologically diverse training datasets on model performance and generalization. It aims to quantify hydrological diversity using clustering and to evaluate the effects of adding similar and dissimilar basins to training datasets. The study successfully achieves its objective by demonstrating the importance of diverse training data in improving model performance and generalization.
The study's findings and discussions offer meaningful implications for future research and model training practices, making it a useful contribution to hydrology.The paper is well written, structured and clear.
Main comments:
- The rationale for choosing the specific dynamic features is not entirely clear to me. I would appreciate if the authors could slightly more elaborate on this.
- In the methodology, the clustering of catchment static attributes and hydrological dynamics is done independently. Can the authors clarify the motivation for this? What would happen if this were done in one step?
- The authors mention that regulated catchments may behave differently than expected from the selected catchment attributes. How many of the catchments are regulated? Is there a catchment attribute that accounts for this and would it be worth adding such? In the authors' opinion, would it affect the equal sample size, which was one of the constraints for the clustering approach?
- Although the focus of the study is on LSTM-based streamflow forecasting, there is a body of literature on the value of data for calibration, regionalization and on model testing that I would consider relevant in this context. This starts, for example, with Klemes (KLEMEŠ, V. (1986). Operational testing of hydrological simulation models. Hydrological Sciences Journal, 31(1), 13–24. https://doi.org/10.1080/02626668609491024)
- Some of the figures have linewidth that are too big for the plot size and make them difficult to read (Figure3, Figure4 (mainly upper panel), Figure5) Consider making the ratio more readable.
I have a few minor and ediorial comments, which I list line by line below:
L40 I would already here add the references for the mentioned CAMELS data sets
L48 References to these many studies? or say that they are mentioned below
L52 SOM is the first time mentioned, please introduced the full term
L56 Direct citation
L66 all available basins
L62-79 Here it would be good to see if different clustering approaches were used in the mentioned studies or if all used k-means and for which motivation
L109 basin label
L131 , -> .
L148 direct citation
Figure 2 caption add .
Figure 4 It would be helpful for reading the figure, if the last sentence of the caption could be included in the plot itself, for instance by using facets.Citation: https://doi.org/10.5194/hess-2024-169-RC1 -
AC1: 'Reply on RC1', Everett Snieder, 06 Sep 2024
Thank you for your encouraging comments and thoughtful feedback. We greatly appreciate the time and effort you took to share your perspectives with us. We hope that the detailed response provided below meets your expectations and addresses all of your concerns.
Main comments:
- The rationale for choosing the specific dynamic features is not entirely clear to me. I would appreciate if the authors could slightly more elaborate on this.
Assuming the dynamic features are those used for temporal clustering; these five include streamflow, day of year of sample (2 features), and change in flow (2 features). They were based on expert knowledge and features used in previous studies:
Day of year attributes can be used to allow the clustering model to distinguish between different seasonal periods (Abrahart et al., 2001) and
Sequential flows, which provides the model with change in flow (Toth, 2009).
We will add these details in the revised version of the manuscript where we introduce the dynamic features for Exp 1a.
Abrahart, R. J., See, L., and Kneale, P. E.: Investigating the role of saliency analysis with a neural network rainfall-runoff model, Computers & Geosciences, 27, 921–928, https://doi.org/10.1016/S0098-3004(00)00131-X, 2001.
Toth, E.: Classification of hydro-meteorological conditions and multiple artificial neural networks for streamflow forecasting, Hydrology and Earth System Sciences, 13, 1555–1566, https://doi.org/10.5194/hess-13-1555-2009, 2009.
- In the methodology, the clustering of catchment static attributes and hydrological dynamics is done independently. Can the authors clarify the motivation for this? What would happen if this were done in one step?
The motivation for clustering basins and dynamic features independently was (1) to evaluate them independently from one another, (2) it ensured that an even number of samples would be drawn from each basin, which is simplifies some of our experiments, which compare varying quantities of training data from hydrologically dissimilar basins. A one-step unified approach would be very interesting and potentially useful in selecting training data, but we did not deem it necessary to meet our research objectives. We believe that a one-step clustering method would have a similar effect to clustering in series. We will include this motivation for the two-step approach in the revised manuscript.
- The authors mention that regulated catchments may behave differently than expected from the selected catchment attributes. How many of the catchments are regulated? Is there a catchment attribute that accounts for this and would it be worth adding such? In the authors' opinion, would it affect the equal sample size, which was one of the constraints for the clustering approach?
Unfortunately, there are no catchment attributes within HYSETS that indicate whether a catchment is regulated and the number of regulated catchments is not known. While there is an increasing number of hydraulic infrastructure databases (Zhang et al., 2023; Mulligan et al., 2020), they are incomplete and lack key information such as year of construction. The effects of infrastructure on regional learning and spatial generalisation is a very interesting topic and worthy of its own study.
Filtering for regulated catchments would certainly reduce the number of available basins and potentially bias the geographical distributions of basins (e.g., Quebec and British Columbia have high numbers of regulated catchments). We can include a short discussion of this topic in the revised manuscript.
Mulligan, M., van Soesbergen, A., and Sáenz, L.: GOODD, a global dataset of more than 38,000 georeferenced dams, Sci Data, 7, 31, https://doi.org/10.1038/s41597-020-0362-5, 2020.
Zhang, A. T. and Gu, V. X.: Global Dam Tracker: A database of more than 35,000 dams with location, catchment, and attribute information, Sci Data, 10, 111, https://doi.org/10.1038/s41597-023-02008-2, 2023.
- Although the focus of the study is on LSTM-based streamflow forecasting, there is a body of literature on the value of data for calibration, regionalization and on model testing that I would consider relevant in this context. This starts, for example, with Klemes (KLEMEŠ, V. (1986). Operational testing of hydrological simulation models. Hydrological Sciences Journal, 31(1), 13–24. https://doi.org/10.1080/02626668609491024)
Thank you for the recommended article. Our submitted manuscript uses the split-sample approach described in Kelemes (1986) and have added the citation in-text.
- Some of the figures have linewidth that are too big for the plot size and make them difficult to read (Figure3, Figure 4 (mainly upper panel), Figure5) Consider making the ratio more readable.
We apologise for the lack of clarity in the figure and have reduced the linewidth in the plots. The updated figures with the thinner linewidths are copied below, and will be included in the revised manuscript. See the revised figure in Figures RC1a, RC1b, and RC1c in the attatched document.
I have a few minor and editorial comments, which I list line by line below:
L40 I would already here add the references for the mentioned CAMELS data sets
Thank you for this recommendation. We’ve replaced the mentions of CAMELS with a single reference and citation for Caravan, which includes all the CAMELS datasets.
L48 References to these many studies? or say that they are mentioned below
We apologise for the lack of clarity and have modified the text as follows:
“Many studies, which are reviewed below, have applied clustering to spatial and temporal data as a means to quantify hydrological diversity.”
L52 SOM is the first time mentioned, please introduced the full term
SOM refers to a self-organising map; we apologise for the omission and have updated the manuscript to reflect this change.
L56 Direct citationWe apologise for this error and have fixed the reference.
L66 all available basins
Thank you for identifying this error – we have made the correction.
L62-79 Here it would be good to see if different clustering approaches were used in the mentioned studies or if all used k-means and for which motivationThank you for this recommendation, we have added detail about the classification methods used in each study. The updated text is as follows:
“However, other studies have used clustering to estimate hydrological diversity, such that basin selection can explicitly account for hydrological diversity. These cases tend to use some form of clustering (either supervised or unsupervised) to quantify hydrological diversity within training data and the effects it has on model generalisation. Zhang et al. (2022) applied K-means clustering to a set of 35 mountainous basins in China based on hydroclimatic attributes, finding that a model trained to all available basins typically outperformed those trained to individual clusters. Hashemi et al. (2022) applied a similar approach by classifying basins into distinct hydrological regimes based hydrometeorological thresholds. As done in (Zhang et al., 2022), their study compared locally and globally trained models, finding only minor differences in the performance between the two. A common problem in comparing global and locally trained models is that these comparisons typically do not control for sample size. As a result, the improved performance of the global model can be impacted by the regularisation effect on the sample size. In other words, deep learning models trained to small datasets may be overfitted and thus, poorly generalised. Fang et al. (2022) accounts for this potential issue. Their study used existing `ecoregion' basin classification, which were classified by the United States Environmental Protection Agency, and evaluates the effects of additional training basins at three similarity intervals.”
L109 basin labelThank you for identifying this error – we have made the correction.
L131 , -> .Thank you for identifying this error – we have made the correction.
Figure 2 caption add .Thank you for identifying this error – we have made the correction.
Figure 4 It would be helpful for reading the figure, if the last sentence of the caption could be included in the plot itself, for instance by using facets.Thank you for this recommendation – we’ve added the lead time to the subplot titles, which is shown in Figure RC1b.
-
AC1: 'Reply on RC1', Everett Snieder, 06 Sep 2024
-
RC2: 'Comment on hess-2024-169', Anonymous Referee #2, 16 Jul 2024
I agree with the authors that large-scale training of LSTM models can lead to performance trade-offs between different regions. However, a diverse training dataset is crucial for improving overall model performance. This study demonstrates that temporal undersampling does not compromise model accuracy and enhances model efficiency, while spatial undersampling results in only a marginal performance decrease. Surprisingly, the authors found that adding dissimilar basins leads to greater improvements than adding more similar basins. The paper provides a discussion on the underlying reasons for this observation. Overall, the paper is well-written and thoroughly demonstrated, and I recommend its publication with moderate revisions.
General Comments:
I suggest mentioning the computational time used for the CUS, RUS, and baseline model training to highlight the practical benefits this undersampling method can bring to modeling.
Detailed Comments:
Line 52: What does SOM represent?
Line 56: Please correct the citation format.
Line 103-104: Which forcing dataset is used here?
Line 127-132: How are these hyperparameters tuned? For a dataset with 2000 basins, the batch size seems small.
Line 183: Please clarify what the two day of year features are. It is unclear.
Line 234: Which method is used for clustering in experiment 2? Is it K-means?
Line 240: The configurations of experiment 2b are unclear to me. Please clarify.
Line 248: Can you provide the names/locations of the basins/gages for those unfamiliar with the labels of the basins?
Figure 2: How does CUS-Q work for multiple features, such as streamflow, gradient, and the two day of year features? I.e., how do you define the distance between samples with multiple features in K-means?
Figure 4: It is hard to see the solid black line in Figure 4.
Line 254-255: Please rephrase this sentence.
Line 277: What are the testing basins for all configurations? Are they all 128 basins? It seems the undersampling of basins would harm model performance to some extent, and the benefits of CUS_B are limited.
Line 306-307: This finding is really interesting. I wonder if this is because the cluster number (2) is too small, causing them to still share many similarities. It is worth studying with a larger number of clusters.
Line 317: I suggest moving all the discussion below to a separate discussion section.
Figure 10: This figure is hard to read and understand.
Line 319-321: Which boxes in Figure 10 correspond to these two cases?
Figure 10: The number of evaluation basins (4) is too small to represent the spatial generalizability of the model.
Citation: https://doi.org/10.5194/hess-2024-169-RC2 -
AC2: 'Reply on RC2', Everett Snieder, 06 Sep 2024
Thank you for your kind comments and valuable feedback. We hope that the comprehensive response below meets your expectations and thoroughly addresses any concerns you may have.
General Comments:
I suggest mentioning the computational time used for the CUS, RUS, and baseline model training to highlight the practical benefits this undersampling method can bring to modeling.
Thank you for this recommendation, we have added training times to the SI. As shown in the table below, there are practical benefits to the proposed undersampling methods, in terms of reduced computational time. The runtimes are included in Tables RC2a and RC2b in the attached supplement.
Detailed Comments:
Line 52: What does SOM represent?
SOM represents “Self-organising map”; we apologise for the omission and have updated the manuscript.
Line 56: Please correct the citation format.
Thank you for identifying this error – we have corrected the citation.
Line 103-104: Which forcing dataset is used here?
The forcing dataset is obtained from the HYSETS (the Canadian portion of which is also included in Caravan). The text has been clarified as follows:
“Additional forcing data from the HYSETS database includes daily basin-averaged minimum temperature, maximum temperature, precipitation, and snow water equivalent (SWE).”
Line 127-132: How are these hyperparameters tuned? For a dataset with 2000 basins, the batch size seems small.
Hyperparameters were tuned on the training dataset and validated on an independent partition of 12 years (L115), which is independent of the test partition reported in the manuscript. An ad-hoc grid-search (one hyperparameter modification at a time) was conducted using a model trained to 64 randomly sampled basins (rather than using the full dataset of 2500). Varying the batch size was not found to significantly impact model performance, as shown in the figure below. We are happy to include this analysis in the revised manuscript, if deemed necessary. Some hyperparameter results are shown in Figure RC2a of the attatched file.
Line 183: Please clarify what the two day of year features are. It is unclear.
We have rewritten and corrected a small error in the day of year feature description. The day of year features are calculated using the day of year (from 1 (Jan. 1) to 365 (Dec. 31)) and are used to allow the model to distinguish flows occurring in different seasons. Both Sin and Cos transformations are applied to the day of the year to create continuity across each new year (from Dec. 31 to Jan 1.). Mathematically, these features are calculated as sin((2/π)*(DOY/365)) and cos((2/π)*(DOY/365)) where DOY is the day of year.
Line 234: Which method is used for clustering in experiment 2? Is it K-means?
Yes, K-means was used (Constrained K-means variant). We have added a reference to the corresponding description in the methods section.
“In experiment 2a, basins are divided into two clusters (which are referred to as C0 and C1), using the K-means method described in Sec. 2.5.”
Line 240: The configurations of experiment 2b are unclear to me. Please clarify.
We apologise for the lack of clarity. The objective of this experiment is to determine whether temporal and spatial undersampling can be used in series to further reduce the computational requirements of training. The configuration of experiment 2b mirrors that of experiment 2a, but cluster-based temporal undersampling is apple to the training dataset of each model, such that the models are trained to half the volume of data. The description of experiment 2b has been updated as follows:
“Next in experiment 2b, experiment 2a is repeated, but with cluster-based streamflow undersampling applied to the training dataset. The resulting models are trained to half the amount of training data as those in 2a.”
Line 248: Can you provide the names/locations of the basins/gages for those unfamiliar with the labels of the basins?
Thank you for this recommendation. We have added the names and provinces to the first mentions of each basin code, and a reference to the HYDAT database where the codes can be queried. For example, the updated figure 1 caption is as follows:
“Left column, from top to bottom: unclustered (a), clustered (c), and undersampled streamflow (e) for basin 01AD003 (HYDAT ID; located along St. Francis River in New Brunswick).”
Figure 2: How does CUS-Q work for multiple features, such as streamflow, gradient, and the two day of year features? I.e., how do you define the distance between samples with multiple features in K-means?
The CUS-Q feature set includes all of the features described on lines 181-184. Since CUS-Q is applied to basins individually, this results in a matrix with a shape of 4380 (samples) by 5 clustering features). The standard Euclidian distance metric is used in K-means.
“The engineered feature set includes streamflow (qt), two streamflow gradient features (given as (qt-3-qt)/3}$ and {qt-1-qt}), and two day of year features (given as sin((2/π)*(DOY/365)) and cos((2/π)*(DOY/365)) where DOY is the day of year {1,...,365}). The sin and cos functions ensure that the day of year features are continuously changing from December to January. Features such as the streamflow gradient, encourages the representation of rising and falling limbs within the clusters - which would be indistinguishable based solely on streamflow value.”
Figure 4: It is hard to see the solid black line in Figure 4.
We apologise for the lack of clarity in the figure and have adjusted the line colours and styles in the revised manuscript. The updated figure is included as Figure RC2b in the attatched supplement.
Line 254-255: Please rephrase this sentence.
We apologise for the unclear wording and have rephrased the sentence as follows:
“The performance of models trained on a set of 64 randomly sampled basins is shown in Fig. 4 in terms of NSE (a-c) and PI (d-f) for three cases: without resampling, with cluster-based temporal undersampling CUS_Q, and random temporal undersampling.”
Line 277: What are the testing basins for all configurations? Are they all 128 basins? It seems the undersampling of basins would harm model performance to some extent, and the benefits of CUS_B are limited.
Correct, models are evaluated on all 128 basins. The purpose of this experiment is not to achieve better performance than the baseline model, it is to determine the extent to which models trained to a subset of basins (identified using spatial clustering) can generalise to the entire set.
Line 306-307: This finding is really interesting. I wonder if this is because the cluster number (2) is too small, causing them to still share many similarities. It is worth studying with a larger number of clusters.
We agree that a larger number of clusters would be better, however, larger numbers of clusters result in clusters that do not contain enough basins for validation. A workaround would be to specify a minimum cluster constraint on the K-means clustering algorithm, however, that results in clusters with low cohesion, as quantified using the silhouette score. The relationship between the minimum cluster size average silhouette scores are included in the Figure S1 in the Supplementary Information. This figure is copied in the attached supplement as Figure RC2c.
Line 317: I suggest moving all the discussion below to a separate discussion section.
Thank you for this recommendation - we will add a separate discussion subsection (Sec. 3.3.1) in the revised manuscript.
Figure 10: This figure is hard to read and understand.
We apologise for the lack of clarity. In our revised manuscript, we have split the subplots in half, so that each ‘experiment’ and metric has its own subplot. We have also added the experiment name to the subplot titles, with the hope that it improves the interpretability. See example in Figure RC2d.
Line 319-321: Which boxes in Figure 10 correspond to these two cases?
The result is true for both experiments. We have adjusted the text to reference the modified figure, in which experiments are contained in distinct subplots to improve clarity.
Figure 10: The number of evaluation basins (4) is too small to represent the spatial generalizability of the model.
We agree that the sample size is a limitation of this experiment and have rerun the experiment to include another set of 4 basins. Since it’s difficult to aggregate the figure included in the submitted manuscript, the new figures are shown in the attached supplement (Figures RC2e and RC2f) and will be included with the SI.
-
AC2: 'Reply on RC2', Everett Snieder, 06 Sep 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
306 | 109 | 59 | 474 | 39 | 12 | 11 |
- HTML: 306
- PDF: 109
- XML: 59
- Total: 474
- Supplement: 39
- BibTeX: 12
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1