Semi-supervised learning approach to improve the predictability of data-driven rainfall-runoff model in hydrological data-sparse regions

Yoon, Sunghyun; Ahn, Kuk-Hyun

doi:https://doi.org/10.5194/hess-2023-148

Preprints

https://doi.org/10.5194/hess-2023-148

Preprints

18 Jul 2023

| 18 Jul 2023

Status: this discussion paper is a preprint. It has been under review for the journal Hydrology and Earth System Sciences (HESS). The manuscript was not accepted for further review after discussion.

Semi-supervised learning approach to improve the predictability of data-driven rainfall-runoff model in hydrological data-sparse regions

Sunghyun Yoon and Kuk-Hyun Ahn

Abstract. Numerous data-driven models have been introduced to establish reliable predictions in the rainfall-runoff relationship. The majority of these models are trained using a supervised learning (SL) approach, with paired observed samples of climate and streamflow data. However, in practice, the availability of such paired observations is often constrained due to sparse data from streamflow gauges worldwide, which typically covers only a few years. This limited number of paired samples can significantly impede the learning ability of the data-driven model. The semi-supervised learning approach, which is an emerging machine learning paradigm that additionally incorporates unpaired samples, has the potential to be a highly effective method for modeling rainfall-runoff relationships. In this study, we present a novel semi-supervised learning-based framework for rainfall-runoff modeling. Our framework introduces a unique loss function designed to handle two distinct types of samples, namely paired and unpaired samples, effectively during the training process. To validate the effectiveness of the proposed framework, we conducted an extensive set of experiments employing a diverse range of designs, all of which utilized the LSTM network. The experiments are based on 531 basins from the freely available CAMELS dataset, which spans the entire continuous United States. Results indicate that the proposed framework show significantly enhanced performance compared to the baseline models. Results also show that the framework can serve as a viable alternative to the previously developed fully supervised approaches. Lastly, we address potential avenues for enhancing the model and provide an outline of our future research plans in this domain.

Received: 10 Jun 2023 – Discussion started: 18 Jul 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 2942 KB)

Supplement (2375 KB)

Download & links

Sunghyun Yoon and Kuk-Hyun Ahn

Status: closed

RC1:
'Comment on hess-2023-148', Anonymous Referee #1, 03 Oct 2023
General comments:
This research presents an LSTM-based semi-supervised framework that leverages self-learning to model the rain-runoff relationship across U.S. regions. The study emphasizes simulations under data-scarce conditions and enhances the use of extended meteorological data sequences. However, there's a lack of clarity in detailing the experimental design logic, and the framework's effectiveness could benefit from a more comprehensive evaluation. Claims of "significantly enhanced performance" are open to debate and require additional information.
Specific comments:
The title of the article reads "Semi-supervised learning approach", yet the core of the methods is fundamentally rooted in self-training. This process involves generating pseudo-labels with the model, which are then utilized as new training data to further optimize the model. A revision to offer a more precise description is recommended.

The narrative in the methods section is somewhat disorganized. It would be beneficial to restructure the sections pertaining to the model and experimental design for greater clarity.

Regarding lines 223-226: Could you elaborate on what is meant by the "pre-trained model" and how it was sourced? This aspect may have implications for subsequent performances of idv-LSTM and rgn-LSTM.

Lines 285-287: Please provide a brief rationale for the choice "Ψ takes on values of 10, 30, and 50".

Lines 289-298: The current explanation seems somewhat ambiguous. It might be helpful to provide a visual representation of the framework and further clarify the model's approach to handling both labeled and unlabeled data during its training and validation phases, inclusive of the pre-trained model.

Lines 318-321: While the conclusion mentions a baseline model, but the related information is insufficient, and it's hard to judge the resulting evaluation.

Lines 358-368: It mentioned the use of CAMELS-GB as a source model. It's essential to note that a transferred source model tends to have more variety than local data. It's feasible when the number of regional basins is between 30-150, but the significance might wane when considering 600 or more basins.

As for Figure 7 and its associated descriptions: The enhancements seen in the regional model exhibit fluctuations, with the median of improvements oscillating between 0.010-0.027. Considering the dataset's size and spatial distribution differences, one could question the framework's generalizability, especially when the only difference between a single and multiple years in the teacher model is a mere 1 and 3 years.

In the results section, the bulk of evaluations and consequent findings hinge on the differential values in the NSE metric to discern between models. Such assessments might offer a skewed perspective. For instance, a surge of 0.1 embodies different ramifications when transitioning from 0.3 versus 0.8. If there's a lack of thorough evaluation in basin selection (considering spatial variations) and basin count, it insinuates potential constraints in the framework's enhancements.

The figure titles, including Figure 5, seem devoid of essential descriptive content. For instance, the number of basins in the dataset is often left unspecified.

Figure 2 could benefit from a more detailed depiction of the framework.
Citation: https://doi.org/10.5194/hess-2023-148-RC1
- AC1: 'Reply on RC1', Kuk-Hyun Ahn, 09 Oct 2023
  
  Please find our responses to the Comments of Reviewer 1 in the PDF attached.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC1
RC2:
'Comment on hess-2023-148', Anonymous Referee #2, 23 Oct 2023

This paper introduces a method where they pretrained an LSTM model on a source dataset, and then use that model to make predictions for a new basin (pseudo labels), which are used in the loss function along with new observations in the new region. This method was compared against previously published transfer learning methods. The authors argued their approach was better. I think, overall, it is of some interest to me to see how well their method works, as such teacher-student models have been widely used in AI, but not extensively evaluated in hydrology. However, unfortunately, the execution of the paper leaves much to be desired. Thus I if they can address the following questions, I could review it again.
0. It is not clear to me why this method is called semi-supervised learning. Fundamentally, the source model is supervised learning and the new model is also trained using supervised learning. While some similar work in AI who used a student-teacher paradigm did call themselves as such, these studies typically used much more unlabeled data and the structures within. Semi-supervised learning typically involves leveraging a small amount of labeled data alongside a larger pool of unlabeled data to improve learning. In this setup, the pseudo labels generated for the new site's data act as a mechanism to utilize unlabeled data. However, in the case of this paper, it is not clear to me what information is leveraged from the target domain's data, especially the other unlabeled data points.
1a. The biggest technical problem as I see is that I cannot independently verify that their baseline performance is state-of-the-art. They also lacked comparable results to any other studies. Hence I cannot tell you if the benefits are truly as claimed. Since their primary target of comparison is transfer learning (Ma et al., 2021), they should try to compare with that paper. If they cannot, another paper seems to work on CAMELS and may be comparable is this one (Feng et al. 2021, doi: 10.1029/2021GL092999).
1b. There should be at least two alternative approaches: (i) directly training using all the training data and just make forward runs on the test basins; (ii) transfer learning. It is not clear to me if their named experiments (rgn-LSTM) use either of these two, and is just their student-teacher approach.
1c. It should be acknowledged that transfer learning can use different input items across source and target regions, even different amounts of inputs. The authors' approach cannot allow this (without additional changes).
2. The experimental design and comparisons are very confusing and difficult to remember. There are so many different versions, experiments and acronyms and it was a torture to demand readers or reviewers to remember all these. You should more clearly present the core comparison, making it really easy to see the benefits, and then expand on more experiments. You can also consider removing some unimportant experiments.
3. When you apply student-teacher paradigm, the source dataset should be a large and diverse data. UK does not have a very diverse geography. It would make more sense to use CAMELS USA as the source data. You may see different comparisons in that way.
4. The organization of the paper is very poor. You have to dig in carefully and read all the way to Section 2.3 to get the main idea "... the pseudo labels for unlabeled dataset are generated from a pre-trained teacher model trained on labeled dataset. Student model is trained in supervised manner on both the labeled and (pseudo label assigned) unlabeled datasets." This should be clear in the abstract and related work should be mentioned in the abstract.
Some minor points.
1. Earlier papers (Gauch et al., 2021 as cited, and Fang et al., 2022, doi: 10.1029/2021WR029583) have already examined how to form the training dataset. The general conclusion is that one should use all the training dataset, and the more diverse and large the training data, the better. hence some sentences for the motivation mentioned in the paper need to be revised.
2. Their introduction should clearly direct the readers to understand why the unlabeled data are useful.

Citation: https://doi.org/10.5194/hess-2023-148-RC2
- AC2: 'Reply on RC2', Kuk-Hyun Ahn, 26 Oct 2023
  
  Please find our responses to the Comments of Reviewer 2 in the PDF attached.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC2
- AC3: 'Reply on RC2', Kuk-Hyun Ahn, 26 Oct 2023
  
  Publisher’s note: this comment is a copy of AC2 and its content was therefore removed.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC3

Status: closed

RC1:
'Comment on hess-2023-148', Anonymous Referee #1, 03 Oct 2023
General comments:
This research presents an LSTM-based semi-supervised framework that leverages self-learning to model the rain-runoff relationship across U.S. regions. The study emphasizes simulations under data-scarce conditions and enhances the use of extended meteorological data sequences. However, there's a lack of clarity in detailing the experimental design logic, and the framework's effectiveness could benefit from a more comprehensive evaluation. Claims of "significantly enhanced performance" are open to debate and require additional information.
Specific comments:
The title of the article reads "Semi-supervised learning approach", yet the core of the methods is fundamentally rooted in self-training. This process involves generating pseudo-labels with the model, which are then utilized as new training data to further optimize the model. A revision to offer a more precise description is recommended.

The narrative in the methods section is somewhat disorganized. It would be beneficial to restructure the sections pertaining to the model and experimental design for greater clarity.

Regarding lines 223-226: Could you elaborate on what is meant by the "pre-trained model" and how it was sourced? This aspect may have implications for subsequent performances of idv-LSTM and rgn-LSTM.

Lines 285-287: Please provide a brief rationale for the choice "Ψ takes on values of 10, 30, and 50".

Lines 289-298: The current explanation seems somewhat ambiguous. It might be helpful to provide a visual representation of the framework and further clarify the model's approach to handling both labeled and unlabeled data during its training and validation phases, inclusive of the pre-trained model.

Lines 318-321: While the conclusion mentions a baseline model, but the related information is insufficient, and it's hard to judge the resulting evaluation.

Lines 358-368: It mentioned the use of CAMELS-GB as a source model. It's essential to note that a transferred source model tends to have more variety than local data. It's feasible when the number of regional basins is between 30-150, but the significance might wane when considering 600 or more basins.

As for Figure 7 and its associated descriptions: The enhancements seen in the regional model exhibit fluctuations, with the median of improvements oscillating between 0.010-0.027. Considering the dataset's size and spatial distribution differences, one could question the framework's generalizability, especially when the only difference between a single and multiple years in the teacher model is a mere 1 and 3 years.

In the results section, the bulk of evaluations and consequent findings hinge on the differential values in the NSE metric to discern between models. Such assessments might offer a skewed perspective. For instance, a surge of 0.1 embodies different ramifications when transitioning from 0.3 versus 0.8. If there's a lack of thorough evaluation in basin selection (considering spatial variations) and basin count, it insinuates potential constraints in the framework's enhancements.

The figure titles, including Figure 5, seem devoid of essential descriptive content. For instance, the number of basins in the dataset is often left unspecified.

Figure 2 could benefit from a more detailed depiction of the framework.
Citation: https://doi.org/10.5194/hess-2023-148-RC1
- AC1: 'Reply on RC1', Kuk-Hyun Ahn, 09 Oct 2023
  
  Please find our responses to the Comments of Reviewer 1 in the PDF attached.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC1
RC2:
'Comment on hess-2023-148', Anonymous Referee #2, 23 Oct 2023

This paper introduces a method where they pretrained an LSTM model on a source dataset, and then use that model to make predictions for a new basin (pseudo labels), which are used in the loss function along with new observations in the new region. This method was compared against previously published transfer learning methods. The authors argued their approach was better. I think, overall, it is of some interest to me to see how well their method works, as such teacher-student models have been widely used in AI, but not extensively evaluated in hydrology. However, unfortunately, the execution of the paper leaves much to be desired. Thus I if they can address the following questions, I could review it again.
0. It is not clear to me why this method is called semi-supervised learning. Fundamentally, the source model is supervised learning and the new model is also trained using supervised learning. While some similar work in AI who used a student-teacher paradigm did call themselves as such, these studies typically used much more unlabeled data and the structures within. Semi-supervised learning typically involves leveraging a small amount of labeled data alongside a larger pool of unlabeled data to improve learning. In this setup, the pseudo labels generated for the new site's data act as a mechanism to utilize unlabeled data. However, in the case of this paper, it is not clear to me what information is leveraged from the target domain's data, especially the other unlabeled data points.
1a. The biggest technical problem as I see is that I cannot independently verify that their baseline performance is state-of-the-art. They also lacked comparable results to any other studies. Hence I cannot tell you if the benefits are truly as claimed. Since their primary target of comparison is transfer learning (Ma et al., 2021), they should try to compare with that paper. If they cannot, another paper seems to work on CAMELS and may be comparable is this one (Feng et al. 2021, doi: 10.1029/2021GL092999).
1b. There should be at least two alternative approaches: (i) directly training using all the training data and just make forward runs on the test basins; (ii) transfer learning. It is not clear to me if their named experiments (rgn-LSTM) use either of these two, and is just their student-teacher approach.
1c. It should be acknowledged that transfer learning can use different input items across source and target regions, even different amounts of inputs. The authors' approach cannot allow this (without additional changes).
2. The experimental design and comparisons are very confusing and difficult to remember. There are so many different versions, experiments and acronyms and it was a torture to demand readers or reviewers to remember all these. You should more clearly present the core comparison, making it really easy to see the benefits, and then expand on more experiments. You can also consider removing some unimportant experiments.
3. When you apply student-teacher paradigm, the source dataset should be a large and diverse data. UK does not have a very diverse geography. It would make more sense to use CAMELS USA as the source data. You may see different comparisons in that way.
4. The organization of the paper is very poor. You have to dig in carefully and read all the way to Section 2.3 to get the main idea "... the pseudo labels for unlabeled dataset are generated from a pre-trained teacher model trained on labeled dataset. Student model is trained in supervised manner on both the labeled and (pseudo label assigned) unlabeled datasets." This should be clear in the abstract and related work should be mentioned in the abstract.
Some minor points.
1. Earlier papers (Gauch et al., 2021 as cited, and Fang et al., 2022, doi: 10.1029/2021WR029583) have already examined how to form the training dataset. The general conclusion is that one should use all the training dataset, and the more diverse and large the training data, the better. hence some sentences for the motivation mentioned in the paper need to be revised.
2. Their introduction should clearly direct the readers to understand why the unlabeled data are useful.

Citation: https://doi.org/10.5194/hess-2023-148-RC2
- AC2: 'Reply on RC2', Kuk-Hyun Ahn, 26 Oct 2023
  
  Please find our responses to the Comments of Reviewer 2 in the PDF attached.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC2
- AC3: 'Reply on RC2', Kuk-Hyun Ahn, 26 Oct 2023
  
  Publisher’s note: this comment is a copy of AC2 and its content was therefore removed.
  
  Citation: https://doi.org/10.5194/hess-2023-148-AC3

Sunghyun Yoon and Kuk-Hyun Ahn

Supplement

https://doi.org/10.5194/hess-2023-148-supplement

Sunghyun Yoon and Kuk-Hyun Ahn

Viewed

Total article views: 950 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
636	262	52	950	80	46	66

HTML: 636
PDF: 262
XML: 52
Total: 950
Supplement: 80
BibTeX: 46
EndNote: 66

Views and downloads (calculated since 18 Jul 2023)

Month	HTML	PDF	XML	Total
Jul 2023	191	46	6	243
Aug 2023	52	24	2	78
Sep 2023	31	8	2	41
Oct 2023	87	25	11	123
Nov 2023	11	5	0	16
Dec 2023	14	7	1	22
Jan 2024	22	4	2	28
Feb 2024	13	12	1	26
Mar 2024	20	16	6	42
Apr 2024	14	5	5	24
May 2024	10	12	4	26
Jun 2024	12	2	1	15
Jul 2024	20	3	2	25
Aug 2024	8	4	0	12
Sep 2024	12	1	0	13
Oct 2024	13	5	0	18
Nov 2024	8	3	1	12
Dec 2024	7	6	1	14
Jan 2025	13	4	2	19
Feb 2025	10	6	0	16
Mar 2025	9	6	3	18
Apr 2025	8	8	1	17
May 2025	14	12	1	27
Jun 2025	18	22	0	40
Jul 2025	13	11	0	24
Aug 2025	6	5	0	11

Cumulative views and downloads (calculated since 18 Jul 2023)

Month	HTML	PDF	XML	Total
Jul 2023	191	46	6	243
Aug 2023	52	24	2	78
Sep 2023	31	8	2	41
Oct 2023	87	25	11	123
Nov 2023	11	5	0	16
Dec 2023	14	7	1	22
Jan 2024	22	4	2	28
Feb 2024	13	12	1	26
Mar 2024	20	16	6	42
Apr 2024	14	5	5	24
May 2024	10	12	4	26
Jun 2024	12	2	1	15
Jul 2024	20	3	2	25
Aug 2024	8	4	0	12
Sep 2024	12	1	0	13
Oct 2024	13	5	0	18
Nov 2024	8	3	1	12
Dec 2024	7	6	1	14
Jan 2025	13	4	2	19
Feb 2025	10	6	0	16
Mar 2025	9	6	3	18
Apr 2025	8	8	1	17
May 2025	14	12	1	27
Jun 2025	18	22	0	40
Jul 2025	13	11	0	24
Aug 2025	6	5	0	11

Viewed (geographical distribution)

Total article views: 905 (including HTML, PDF, and XML) Thereof 905 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 10 Aug 2025

Short summary

Numerous data-driven models have been introduced to establish reliable predictions in the rainfall-runoff relationship. The majority of these models are trained with paired observed samples of climate and streamflow data. However, the availability of paired observations is often constrained due to sparse data from streamflow gauges. This study presents a semi-supervised learning-based framework for rainfall-runoff modeling. The framework show significantly enhanced performance.


Total:	0
HTML:	0
PDF:	0
XML:	0