Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Jeung, Minhyuk; Her, Younggu; Baek, Sang-Soo; Yoon, Kwangsik

doi:https://doi.org/10.5194/hess-2024-284

Preprints

https://doi.org/10.5194/hess-2024-284

Preprints

07 Oct 2024

| 07 Oct 2024

Status: this preprint was under review for the journal HESS but the revision was not accepted.

Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Abstract. Machine learning (ML) is now commonly employed as a tool for hydrological prediction due to recent advances in computing resources and increases in data volume. The prediction accuracy of ML (or data-driven) modeling is known to be improved through training with additional data; however, the improvement mechanism needs to be better understood and documented. This study explores the connection between the amount of information contained in the data used to train an ML model and the model’s prediction accuracy. The amount of information was quantified using Shannon’s information theory, including marginal and transfer entropy. Three ML models were trained to predict the flow discharge, sediment, total nitrogen, and total phosphorus loads of four watersheds. The amount of information contained in the training data was increased by sequentially adding weather data and the simulation outputs of uncalibrated and/or calibrated mechanistic (or theory-driven) models. The reliability of training data was considered a surrogate of information quality, and accuracy statistics were used to measure the quality (or reliability) of the uncalibrated and calibrated theory-driven modeling outputs to be provided as training data for ML modeling. The results demonstrated that the prediction accuracy of hydrological ML modeling depends on the quality and quantity of information contained in the training data. The use of all types of training data provided the best hydrological ML prediction accuracy. ML models trained only with weather data and calibrated theory-driven modeling outputs could most efficiently improve accuracy in terms of information use. This study thus illustrates how a theory-driven approach can help improve the accuracy of data-driven modeling by providing quality information about a system of interest.

Received: 10 Sep 2024 – Discussion started: 07 Oct 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2673 KB)

Supplement (1498 KB)

Download & links

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Status: closed

RC1:
'Comment on hess-2024-284', Anonymous Referee #1, 03 Nov 2024
The manuscript entitled "Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality" present a valuable discussion about the influence of information quantity and quality on the performance of machine-learning-based (ML) models for hydrological prediction.
Below are some points regarding its methodology, results, and potential areas for improvement:
It is quite trivial that calibrated models can offer training samples with high quality and thus help machine learning models achieve significant performance improvement. Could you please further clarify which key scientific findings/insights can be offered by this study?

Figure 1. classifies Random Forest (RF), Support Vector Machine (SVM) as clustering methods, Artificial Neural Network (ANN) as neural network method. What are the essential differences between the two categories of ML models and whether such differences will influence the following discussion?

For Sect. 2.2, the input variables of machine learning models are not clear. It might need further explanation about the setting-up process of machine learning models.

Line 151: why is the threshold correlation arbitrarily selected as 0.30?

Figure 4. uses 3D plotting which might make comparison between different cases and models difficult. Could you please use a 2D figure with legends instead?
Citation: https://doi.org/10.5194/hess-2024-284-RC1
- AC1: 'Reply on RC1', Minhyuk Jeung, 22 Dec 2024
  
  Dear reviewer and editor,
  We are deeply thankful for your thorough and insightful comments. The manuscript has been revised in accordance with the valuable comments and suggestions of the reviewers.
  Please find my detailed review attached.
  Kind regards,
  
  Citation: https://doi.org/10.5194/hess-2024-284-AC1
RC2:
'Comment on hess-2024-284', Anonymous Referee #2, 24 Nov 2024

Dear authors and editor,
Thank you for the opportunity to review this manuscript. Please find my detailed review attached.
Warmly,
RC2

Citation: https://doi.org/10.5194/hess-2024-284-RC2
- AC2: 'Reply on RC2', Minhyuk Jeung, 22 Dec 2024
  
  Dear reviewer and editor,
  We are deeply thankful for your thorough and insightful comments. The manuscript has been revised in accordance with the valuable comments and suggestions of the reviewers.
  Please find my detailed review attached.
  Kind regards,
  
  Citation: https://doi.org/10.5194/hess-2024-284-AC2

Status: closed

RC1:
'Comment on hess-2024-284', Anonymous Referee #1, 03 Nov 2024
The manuscript entitled "Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality" present a valuable discussion about the influence of information quantity and quality on the performance of machine-learning-based (ML) models for hydrological prediction.
Below are some points regarding its methodology, results, and potential areas for improvement:
It is quite trivial that calibrated models can offer training samples with high quality and thus help machine learning models achieve significant performance improvement. Could you please further clarify which key scientific findings/insights can be offered by this study?

Figure 1. classifies Random Forest (RF), Support Vector Machine (SVM) as clustering methods, Artificial Neural Network (ANN) as neural network method. What are the essential differences between the two categories of ML models and whether such differences will influence the following discussion?

For Sect. 2.2, the input variables of machine learning models are not clear. It might need further explanation about the setting-up process of machine learning models.

Line 151: why is the threshold correlation arbitrarily selected as 0.30?

Figure 4. uses 3D plotting which might make comparison between different cases and models difficult. Could you please use a 2D figure with legends instead?
Citation: https://doi.org/10.5194/hess-2024-284-RC1
- AC1: 'Reply on RC1', Minhyuk Jeung, 22 Dec 2024
  
  Dear reviewer and editor,
  We are deeply thankful for your thorough and insightful comments. The manuscript has been revised in accordance with the valuable comments and suggestions of the reviewers.
  Please find my detailed review attached.
  Kind regards,
  
  Citation: https://doi.org/10.5194/hess-2024-284-AC1
RC2:
'Comment on hess-2024-284', Anonymous Referee #2, 24 Nov 2024

Dear authors and editor,
Thank you for the opportunity to review this manuscript. Please find my detailed review attached.
Warmly,
RC2

Citation: https://doi.org/10.5194/hess-2024-284-RC2
- AC2: 'Reply on RC2', Minhyuk Jeung, 22 Dec 2024
  
  Dear reviewer and editor,
  We are deeply thankful for your thorough and insightful comments. The manuscript has been revised in accordance with the valuable comments and suggestions of the reviewers.
  Please find my detailed review attached.
  Kind regards,
  
  Citation: https://doi.org/10.5194/hess-2024-284-AC2

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Supplement

https://doi.org/10.5194/hess-2024-284-supplement

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Viewed

Total article views: 1,252 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
782	157	313	1,252	51	33	46

HTML: 782
PDF: 157
XML: 313
Total: 1,252
Supplement: 51
BibTeX: 33
EndNote: 46

Views and downloads (calculated since 07 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	127	16	4	147
Nov 2024	59	22	32	113
Dec 2024	52	13	51	116
Jan 2025	19	9	24	52
Feb 2025	12	7	27	46
Mar 2025	22	10	57	89
Apr 2025	21	12	48	81
May 2025	24	4	46	74
Jun 2025	19	13	21	53
Jul 2025	17	7	0	24
Aug 2025	75	9	2	86
Sep 2025	314	20	1	335
Oct 2025	21	15	0	36

Cumulative views and downloads (calculated since 07 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	127	16	4	147
Nov 2024	59	22	32	113
Dec 2024	52	13	51	116
Jan 2025	19	9	24	52
Feb 2025	12	7	27	46
Mar 2025	22	10	57	89
Apr 2025	21	12	48	81
May 2025	24	4	46	74
Jun 2025	19	13	21	53
Jul 2025	17	7	0	24
Aug 2025	75	9	2	86
Sep 2025	314	20	1	335
Oct 2025	21	15	0	36

Viewed (geographical distribution)

Total article views: 1,234 (including HTML, PDF, and XML) Thereof 1,234 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Oct 2025

Short summary

Machine learning (ML) techniques have become widely used due to the availability of large data repositories and advancements in computing resources and methods. Our study explored the connection between a model’s accuracy and the information content of input data. Results showed that the accuracy of three ML models significantly improved when high-quality input data were included. These findings highlight the importance of data quality in ML model training.


Total:	0
HTML:	0
PDF:	0
XML:	0