Neural networks in catchment hydrology: a comparative study of different algorithms in an ensemble of ungauged basins in Germany

Weißenborn, Max; Breuer, Lutz; Houska, Tobias

doi:https://doi.org/10.5194/hess-29-5131-2025

Articles | Volume 29, issue 19

https://doi.org/10.5194/hess-29-5131-2025

Articles | Volume 29, issue 19

Research article

14 Oct 2025

Research article |

| 14 Oct 2025

Neural networks in catchment hydrology: a comparative study of different algorithms in an ensemble of ungauged basins in Germany

Max Weißenborn, Lutz Breuer, and Tobias Houska

Abstract

This study presents a comparative analysis of different neural network models, including convolutional neural networks (CNN), long short-term memory (LSTM) and gated recurrent units (GRUs), with regard to predicting discharge within ungauged basins in Hesse, Germany. All models were trained on 54 catchments with 28 years of daily meteorological data, either including or excluding 11 static catchment attributes. The training process for each model scenario combination was repeated 100 times using a Latin hypercube sampler for hyperparameter optimisation with batch sizes of 256 and 2048. The evaluation was carried out using data from 35 additional catchments (6 years) to ensure predictions in basins that were not part of the training data. This evaluation assessed predictive accuracy and computational efficiency concerning varying batch sizes and input configurations and conducted a sensitivity analysis of dynamic input features. The findings indicated that all examined artificial neural networks demonstrated significant predictive capabilities, with a CNN model exhibiting slightly superior performance, closely followed by LSTM and GRU models. The integration of static features was found to improve performance across all models, highlighting the importance of feature selection. Furthermore, models utilising larger batch sizes displayed reduced performance. The analysis of computational efficiency revealed that a GRU model was 41 % faster than the CNN model and 59 % faster than the LSTM model. Despite a modest disparity in performance among the models (<3.9 %), the GRU model's advantageous computational speed rendered it an optimal compromise between predictive accuracy and computational demand.

Download & links

How to cite.

Received: 22 Jun 2024 – Discussion started: 18 Jul 2024 – Revised: 28 Apr 2025 – Accepted: 22 Jul 2025 – Published: 14 Oct 2025

1 Introduction

Artificial intelligence (AI) is increasingly being used to answer scientific questions, including those in the realm of hydrology (Kratzert et al., 2019 a, b; Afzaal et al., 2019; Nabipour et al., 2020). The predictive accuracy of AI in these hydrological studies, particularly concerning discharge, is of paramount importance for flood control, watershed management or the estimation of water availability (Sharma and Machiwal, 2021; Brunner et al., 2021). In the era of climate change, which causes tremendous variability in rainfall patterns and increases evapotranspiration, the role of precise hydrological forecasts becomes even more essential (Tabari, 2020). An area of particular challenge is prediction in ungauged basins (PUB), an endeavour fraught with substantial uncertainty due to the lack of empirical data for model calibration (Blöschl, 2016). Effective models for PUB should thus possess robust generalisation capabilities across diverse watershed behaviours, enabling more universal basin type predictions (Sivapalan et al., 2003).

As demonstrated by Kratzert et al. (2019 a), an artificial neural network (ANN) model, namely the long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997), showed unprecedented accuracy in PUB. The employed LSTM model exhibited the ability to generalise rainfall–runoff predictions across a substantial number of basins (531), surpassing the performance of traditional hydrological models that typically operate best when independently calibrated for each separate basin. Further comparative analyses, such as those by Le et al. (2023), evaluated the performance of LSTM against other ANNs like multilayer perceptrons (MLPs) and convolutional neural networks (CNNs) in daily streamflow prediction. This study revealed superior performance by LSTM and CNN models over conventional ANNs, with LSTM exhibiting a marginal edge over CNN. Moreover, a novel approach proposed by Ghimire et al. (2021) involved a hybrid CNN–LSTM model, designed for hourly discharge predictions. When benchmarked against various ANNs (CNN; LSTM; deep neural network – DNN), traditional AI models (extreme learning machine, MLP) and ensemble methods (decision tree, gradient boosting regression, extreme gradient boosting, multivariate adaptive regression splines), the CNN–LSTM model displayed superior performance with regard to multiple evaluation metrics, although all ANNs exhibited high efficacy. This is evidence that deep learning, a subset of machine learning characterised by multilayered ANNs, holds substantial promise for streamflow prediction. However, while numerous studies have explored discharge prediction using ANNs, only a limited number have conducted comparative analyses of different ANN architectures. Table 1 summarises these studies from 2020 to December 2023, noting that most incorporate lagged target variables as inputs. This methodology, though effective, is less applicable for PUB due to the absence of discharge data in ungauged or poorly gauged regions, necessitating the use of discharge-independent inputs.

Among the studies shown in Table 1, three specifically addressed this constraint. The first, by Nguyen et al. (2023 a), evaluated CNN and LSTM models for daily discharge prediction in the 3S river basin, exclusively using daily mean temperature and precipitation data. This study adopted a “regional” approach, akin to that of Kratzert et al. (2019 a), training both model architectures with data from all three sub-basins. The LSTM was found to outperform the CNN, although the latter's results were not extensively discussed. The second study, by Wegayehu and Muluneh (2023), contrasted three super ensemble learners against eight base models, including LSTM, a gated recurrent unit model (GRU) and a compound CNN–GRU model, for daily discharge prediction. Here, the LSTM ranked among the top three in four out of five scenarios based on R² metrics. However, its performance declined significantly in the absence of feature selection, indicating susceptibility to redundant features. Notably, this study trained separate models for each basin, thus not directly addressing PUB generalisation capabilities. The third study, by Oliveira et al. (2023), compared three ANN models (LSTM, CNN and MLP) for daily discharge estimation in a single basin, where the CNN model exhibited superior performance (Nash–Sutcliffe efficiency (NSE) of 0.86). However, this does not imply generalisability in non-calibrated catchments as both calibration and testing occurred within the same basin. Regrettably, this limitation pertains to all three studies. Consequently, this research aimed to bridge the existing literature gap by comparing the performance of three distinct ANN architectures for predicting discharge in ungauged basins. Through a comparative analysis, this study not only addresses a significant gap in hydrological literature but also provides valuable insights into the relative strengths and limitations of each ANN model, thereby guiding future applications and development in the field of hydrological prediction. Furthermore, a comprehensive sensitivity analysis was conducted to identify key drivers affecting the prediction of each model. This methodological approach contributes to refining model selection and calibration strategies in hydrological forecasting.

The first architecture under examination was the LSTM, which demonstrated robust performance in numerous studies (Kratzert et al., 2019 a, b; Le et al., 2023; Nguyen et al., 2023 a). Although LSTM models demonstrated promising performance, the inherent sequential architecture of LSTM led to higher computational costs. This resulted in a relative decrease in computational efficiency when compared to feed-forward neural networks or CNNs, as discussed in Gauch et al. (2021). In pursuit of addressing these limitations and challenges that are inherent to LSTM models, the second architecture chosen for examination was the CNN. This model is characterised by its parallel processing capabilities, significantly boosting computational efficiency, a critical factor when handling large-scale, high-resolution time series data; extensive input sequences; and a multitude of input features (Bai et al., 2018). The third architecture under consideration was the gated recurrent unit. GRU, a variant of LSTM, is recognised for its proficiency in effectively capturing temporal dependencies in time series data while imposing less computational burden (Cho et al., 2014).

Given that PUB is often characterised by data scarcity, this study incorporated two distinct scenarios: the first involving the use of only daily forcing data and the second extending this with additional static catchment features. This approach allowed for an evaluation of the model's generalisation capacity when constrained to minimal data. Additionally, it provided insights into the degree to which static catchment features could contribute to enhancing model performance, as indicated by Kratzert et al. (2019 a). Accordingly, the objectives of this study were delineated as follows:

to evaluate the potential of predicting discharge in ungauged basins by means of daily forcing data with ANNs, namely LSTM, CNN and GRU;
to compare the computational efficiency of LSTM, CNN and GRU models for daily time series prediction;
to investigate the potential of static features to enhance prediction performance; and
to assess the impact of batch size on model performance and computational efficiency.

Nguyen et al. (2023 a)Wegayehu and Muluneh (2023)Oliveira et al. (2023)Nguyen et al. (2023 b)Cheng et al. (2020)Vatanchi et al. (2023)Le et al. (2023)Haznedar et al. (2023)Hong et al. (2020)Le et al. (2021)Li et al. (2022)Hong et al. (2021)Deng et al. (2022)Herbert et al. (2021)

Table 1Overview of recent studies focused on comparing discharge prediction using various artificial neural networks. “Target independence” indicates that discharge data were not utilised as input features during model training and/or testing. “Ungauged” indicates model evaluation with catchments that were not part of the training dataset. “Multi-catchment” denotes that the models were evaluated on multiple catchments.

ANFIS denotes adaptive neuro-fuzzy inference system, ANN denotes artificial neural network, BiLSTM denotes bidirectional LSTM, CNN denotes convolutional neural network, DT denotes decision tree, DTR denotes decision tree regressor, FNN denotes feed-forward neural network, GB denotes gradient boosting, GRU denotes gated recurrent unit, LSTM denotes long short-term memory, LR denotes linear regression, MLP denotes multilayer perceptron, LASSO denotes least absolute shrinkage and selection operator, PSO denotes particle swarm optimisation, Res denotes residual, RF denotes random forest, RNN denotes recurrent neural network, SVR denotes support vector regression, and XGB denotes extreme gradient boosting. ^a Only the results of the LSTM model are stated. ^b Hyperparameter configuration nontransparent.

Download Print Version | Download XLSX

2 Materials and methods

2.1 Study area

All basins analysed in this study are located in the federal state of Hesse, Germany (Fig. 1). The climate of this region is temperate–humid and is characterised by moderate temperature and precipitation levels (Heitkamp et al., 2020). The topography of Hesse, characterised by a complex blend of lowlands, hilly terrains and modest mountain ranges, fosters a multifaceted hydrological setting. A variety of geological formations and soil types within the region contribute to the mixed pattern of infiltration rates, groundwater recharge and surface runoff (Jehn et al., 2021).

https://hess.copernicus.org/articles/29/5131/2025/hess-29-5131-2025-f01

Figure 1Geographic distribution of the catchments in Hesse and Hesse's location within Germany. Darker shades represent nested catchments, while intersections indicate catchments partially incorporated into both training and testing phases.

2.2 Data sources

The dataset used in this study was derived from Jehn et al. (2021). For each catchment, daily sums of precipitation [mm], daily sums of evapotranspiration [mm] and soil temperature at 5 cm soil depth [°C] were available along with the corresponding discharge [mm]. The discharge data were obtained from a gauging station located within the respective catchment. In addition, the dataset included 11 static catchment features corresponding to every catchment (Table 2). As suggested by Kratzert et al. (2019 a), the inclusion of static catchment attributes can improve the performance of machine learning models. Table 2 provides an understanding of the underlying aggregation of data, spatial resolution and units. Apart from discharge data, which are accessible upon contacting the Hessian Agency for Nature Conservation, Environment and Geology, all other datasets are publicly available within the associated repository of Jehn (2020).

Table 2Summary of daily forcing data and static catchment attributes utilised for modelling: detailing the spatial resolution of the original data sources with the aggregation methods and the respective units.

Download Print Version | Download XLSX

2.3 Data preprocessing

The preprocessing of the input data was an essential step to ensure that the quality and integrity of the data were maintained. This process entailed a detailed analysis of data continuity, encoding of non-numerical values, and splitting of the dataset into training and validation subsets, followed by data normalisation and subsequent transformation. The data analysis revealed discontinuities in the discharge data across the time series of 39 catchments. In order to provide the longest possible time series for the training process, a total of 54 out of the full set of 95 catchments were selected for model training. These catchments covered 28 years (1991–2018). Of the remaining 39 catchments, 35 were utilised for testing, each with a temporal resolution spanning 6 years from 1997 to 2002. Rivers containing artificial constructions that impede discharge through impoundments (e.g. reservoirs) were not considered in this analysis. However, it should be noted that a subset of the selected rivers might be equipped with hydraulic control mechanisms, such as floodgates (Jehn et al., 2021).

For both training and testing datasets, all categorical features (Table 2) were encoded using label encoding. In this approach, every unique variable of a categorical feature was replaced by a non-repeatable integer value (Lin et al., 2020). This method was preferred over the frequently recommended one-hot-encoding technique (Duan, 2019; Cerda and Varoquaux, 2022) in order to circumvent an increase in the total feature count equivalent to the number of unique feature variables, as occurs with the one-hot-encoding technique (Ul Haq et al., 2019). Moreover, label encoding accommodates ordinal scales, which are better suited for hierarchical features such as permeability. In contrast, categorical features without a meaningful order, such as soil type or soil texture, are better handled by the one-hot-encoding technique, which treats each category independently. Furthermore, Potdar et al. (2017) indicated that label encoding yielded the lowest performance among various investigated encoding methods. Consequently, it cannot be unequivocally asserted that this method stood as the optimal approach. To avoid further increasing the number of static input features, label encoding was selected.

The training dataset of 54 catchments was then further divided, using 80 % of the data for training and 20 % for validation. Subsequently, the two datasets were normalised by employing a min–max scaling method, with a range of [0,1] chosen as the boundaries. The choice of this scaling method was made empirically based on the observed performance in the dataset and model configuration. Concurrently, the precision of the data representation was configured to adhere to a float32 format. The target variable was scaled independently of the features. Moreover, to prevent data leakage, each feature normalisation was established solely based on the training dataset.

The normalised training dataset exhibited a shape of N×D for each catchment, where N signified the number of samples in time, and D represented the number of features. To assess the impact of additional static features, two distinct datasets were created. The first dataset included only three features with daily forcing data and assumed a shape of N×3, while the second incorporated all 11 static features and took a shape of N×14.

To transform the datasets into training batches, a two-dimensional moving window, characterised by dimensions of T×D, was subsequently implemented, where T represents the moving-window size, also known as the look-back period or sequence length (Fig. 2). This window is continuously incremented by a single period in the dimension of N, with the initial window encompassing observations [N₁, N_T]. The consecutive window encapsulates observations $[N_{2}, N_{T + 1}]$ , and this pattern is maintained until the window reaches the final element of the dataset (N_n). Consequently, the entire dataset was partitioned into $m = N_{n - T + 1}$ subsamples for every catchment. All subsamples were combined into a three-dimensional array ( $N_{n - T + 1} \times T \times D$ ). The transformed catchment datasets were stacked into one final training set with the shape of $C \times N_{n - T + 1} \times T \times D$ , where C was equal to the number of catchments. The identical transformation was implemented for both validation and test datasets, encompassing those with and without static features.

It is important to note that the transformation of the data is already part of the hyperparameterisation process, a concept further elucidated below.

https://hess.copernicus.org/articles/29/5131/2025/hess-29-5131-2025-f02

Figure 2Schematic procedure of data transformation by applying a moving window: this procedure primarily involves the partitioning of the data into distinct sections, employing a window (blue) that slides across the dataset, effectively creating a temporal snapshot (m). T delineates the window size within the temporal dimension, D represents the feature dimension, and N signifies the temporal samples with a daily resolution.

Neural networks in catchment hydrology: a comparative study of different algorithms in an ensemble of ungauged basins in Germany

2.1 Study area

2.2 Data sources

2.3 Data preprocessing

2.4 Hyperparameterisation

2.5 Model architectures

2.5.1 LSTM

2.5.2 GRU

2.5.3 CNN

2.6 Loss function

2.7 Model training

3.1 Model performance

Comparing evaluation metrics

Statistical variability across model runs

3.2 Runtime

Assessment of flow segment performance

3.3 Model sensitivity

A1 Hydrographs of the CNN model with static features and batch size of 256

A1.1 Highest performance

A1.2 Lowest performance

A2 Hydrographs of the LSTM model with static features and batch size of 256

A2.1 Highest performance

A2.2 Lowest performance

A3 Hydrographs of the GRU model with static features and batch size of 256

A3.1 Highest performance

A3.2 Lowest performance

A4 Hydrograph comparison of the best-performing models with static features and batch size of 256

A4.1 Mixed performance

A4.2 High performance for all models

A4.3 Low performance of all models