14 Jun 2024
 | 14 Jun 2024
Status: this preprint is currently under review for the journal HESS.

A diversity centric strategy for the selection of spatio-temporal training data for LSTM-based streamflow forecasting

Everett Snieder and Usman T. Khan

Abstract. Deep learning models are increasingly being applied to streamflow forecasting problems. Their success is in part attributed to the large and hydrologically diverse datasets on which they are trained. However, common data selection methods fail to explicitly account for hydrological diversity contained within training data. In this research, clustering is used to characterise temporal and spatial diversity, in order to better understand the importance of hydrological diversity within regional training datasets. This study presents a novel, diversity-based resampling approach to creating hydrologically diverse datasets. First, the undersampling procedure is used to undersample temporal data, and is used to show how the amount of temporal data needed to train models can be halved without any loss in performance. Next, it is applied to reduce the number of basins in the training dataset. While basins cannot be omitted from training without some loss in performance, we show how hydrologically dissimilar basins are highly beneficial to model performance. This is shown empirically for Canadian basins; models trained to sets of basins separated by thousands of kilometres outperform models trained to localised clusters. We strongly recommend an approach to training data selection that encourages a broad representation of diverse hydrological processes.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Everett Snieder and Usman T. Khan

Status: open (until 09 Aug 2024)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on hess-2024-169', Anonymous Referee #1, 16 Jul 2024 reply
  • RC2: 'Comment on hess-2024-169', Anonymous Referee #2, 16 Jul 2024 reply
Everett Snieder and Usman T. Khan
Everett Snieder and Usman T. Khan


Total article views: 286 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
198 72 16 286 25 10 10
  • HTML: 198
  • PDF: 72
  • XML: 16
  • Total: 286
  • Supplement: 25
  • BibTeX: 10
  • EndNote: 10
Views and downloads (calculated since 14 Jun 2024)
Cumulative views and downloads (calculated since 14 Jun 2024)

Viewed (geographical distribution)

Total article views: 270 (including HTML, PDF, and XML) Thereof 270 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 18 Jul 2024
Short summary
Improving the accuracy of flood forecasts is paramount to minimising flood damage. Machine-learning models are increasingly being applied for flood forecasting. Such models are typically trained to large historic hydrometeorological datasets. In this work, we evaluate methods for selecting training datasets, that maximise the spatiotemproal diversity of the represented hydrological processes. Empirical results showcase the importance of hydrological diversity in training ML models.