Parsimonious statistical learning models for low flow estimation

. Statistical learning methods offer a promising approach for low flow regionalization. We examine seven statistical learning models (lasso, linear and non-linear model based boosting, sparse partial least squares, principal component regression, random forest, and support vector machine regression) for the prediction of winter and summer low flow based on a hydrological diverse dataset of 260 catchments in Austria. In order to produce sparse models we adapt the recursive feature elimination for variable preselection and propose to use three different variable ranking methods (conditional forest, lasso 5 and linear model based boosting) for each of the prediction models. Results are evaluated for the low flow characteristic Q 95 ( Pr ( Q > Q 95) = 0 . 95) standardized by catchment area using a repeated nested cross validation scheme. We found a generally high prediction accuracy for winter ( R 2 CV of 0.66 to 0.7) and summer ( R 2 CV of 0.83 to 0.86). The models perform similar or slightly better than a Top-kriging model that constitutes the current benchmark for the study area. The best performing models are support vector machine regression (winter) and non-linear model based boosting (summer), but linear models exhibit 10 similar prediction accuracy. The use of variable preselection can significantly reduce the complexity of all models with only a small loss of performance. The so obtained learning models are more parsimonious, thus easier to interpret and more robust when predicting at ungauged sites. A direct comparison of linear and non-linear models reveals that non-linear relationships processes can be sufficiently captured by linear learning models, so there is no need to use more complex models or to add non-liner (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) non-linear (cid:58) effects. When performing low flow regionalization in a seasonal climate, the temporal stratification into 15 summer and winter low flows was shown to increase the predictive performance of all learning models, offering an alternative to catchment grouping that is recommended otherwise. Austria? The Model performance is evaluated by a repeated nested cross validation (CV) scheme, which provides a confident assessment of how well the models perform at ungauged sites.

is a fast selection method, which main downside is that it is restricted to the underlying model. Filter based methods, where one advantage is usually the computation time, may suffer from weaker predictive performance, as filtering options have no link to the final prediction model (Kuhn and Johnson, 2019;Guyon and Elisseeff, 2003). In contrast, wrapper methods have a higher computational burden, and greedy search algorithms as RFE may only find a local minimum, especially if interactions 65 are present (Kuhn and Johnson, 2019). The RFE, like any variable selection method, can suffer from a bias in the selection procedure (Ambroise and McLachlan, 2002), if poor validation strategies are chosen. Nevertheless, RFE can be an efficient technique to substantially reduce the predictor set (Kuhn and Johnson, 2019).
Although there appears to be a general consensus among hydrologists that parsimonious models offer a number of advantages over more complex models, including better parameter interpretability and robustness, surprisingly little effort has been 70 undertaken to assess the merits of variable selection for statistical low flow regionalization. This is especially the case for statistical learning methods, which generally allow for higher complexity than regionalization approaches. Apart from stepwise regression procedures (e.g. Laaha and Blöschl, 2007;Kroll and Song, 2013), Tyralis et al. (2021) tested the inherent variable selection of the boosting algorithm and found the number of selected variables to be depending on the runoff characteristics to be predicted. Another approach, which uses at least inherent variable rankings for predictor variables, is to employ variable 75 importance as the decrease in accuracy criterion of RF (Worland et al., 2018). They additionally used partial dependence plots, which they found beneficial for analysing relationships between predictors and the response. However, these approaches can be misleading, when variables in boosting models are falsely selected (Meinshausen and Bühlmann, 2010;Hofner et al., 2015), or variables in RF are ranked mistakenly high (Strobl et al., 2007). The most comprehensive approach was used by Ferreira et al. (2021), who employed the RFE for three different learning methods (OLS, Earth, RF), but did not include other prospective 80 learning methods. While all of these studies found variable selection or calculation of variable importance to be a crucial step, a broader assessment is missing that sheds light on the value of variable selection for different statistical learning approaches.
Therefore, we propose to use RFE for our variable selection and suggest three different approaches for computation of the variable ranking to disconnect the variable ranking and the final prediction model.
In this paper we perform a comparative assessment of seven statistical learning models for a comprehensive Austrian data set 85 covering 260 stations. With our study, we specifically addresses the lack of research for comparing these methods in a strongly seasonal climate with summer and winter low flow regimes. The following research questions will be addressed: (i) How well do statistical learning models perform as compared to established models (and which of the methods perform best)? (ii) What is the effect of different variable preselection methods on the performance of these models? (iii) What is the relative value of non-linear learning models compared to linear ones? ::: (iv) :::::: Which :::::::: variables ::: can ::: be :::::::: identified :: as ::: the ::::: most :::::::: important :::::: drivers ::: of 90 ::: low :::: flow ::: for ::::::: Austria? The Model performance is evaluated by a repeated nested cross validation (CV) scheme, which provides a confident assessment of how well the models perform at ungauged sites.
Our study area consists of 260 gauging stations in Austria (Fig. 1). Austria can be described as physiographic and hydrological diverse and is therefore a suitable testbed for regionalisation models. The altitude of gauges ranges from 143 to 1891 m a.s.l.. are available at the Austrian Hydrological Service (ehyd) and catchment characteristics were available from previous studies (e.g. Laaha and Blöschl, 2006). :::: Low :::: flow ::: and ::::::::: catchment :::::::::::: characteristics ::: are :::: both ::::: based :: on ::: the :::: total :::::::: upstream :::::::: catchment ::: of :::: each :::::: gauging ::::::: station. We calculated Q95, where Q95 (P r(Q > Q95) = 0.95) is the flow that is exceeded on 95 % of all days. Low 100 flow in Austria can be separated into two major seasonal regimes, with low flows in the Alpine region occurring mainly in the winter half-year and in the lowlands mainly in the summer half-year. To account for these different processes, we calculated Q95 for the summer season (from May to October) and for the winter season (from November to April) and both low flows (winter and summer) will be analyzed for the full study domain. Summer and winter Q95 was subsequently standardized by the catchment area. The resulting specific low flow discharges q95 (l s −1 km −2 ) were considered in the further analyzes. For 105 the study area, the average winter low flow is 6.0 l s −1 km −2 , which is considerably lower than the summer low flow, with 8.9 l s −1 km −2 on average. Figure 2 shows that summer q95 tends to have more near zero values than winter q95 and additionally summer low flow has a higher variation (standard deviation of 3.2 in the winter and 6.7 l s −1 km −2 in the summer). Summer q95 was transformed by the square root transformation to reach a symmetric distribution. After model fitting, the predictions were back-transformed for performance evaluation.

Catchment characteristics
We use a set of 87 covariables as possible predictors, some of which are highly correlated. These covariables can be separated into catchment and climate characteristics. The catchment characteristics used in this study are fully described in e.g. Blöschl (2005, 2006). They consist of nine land use categories, nine geological categories and information about catchment altitude, stream network density and steepness of the slope in the catchment. An overview is given in Table 1.

Climate characteristics
The calculation of the climate characteristics is based on the the SPARTACUS dataset for daily precipitation (Hiebl and Frei, 2018), daily minimum and maximum temperature (Hiebl and Frei, 2016). Data is available from 1961 to 2018 and the spatial resolution is 1 × 1 km. Additionally, we retrieved the HISTALP dataset of the fraction of solid precipitation (Efthymiadis et al., 2006;Chimani et al., 2011), which has a coarse spatial resolution of 5 × 5 min and a temporal range from 1801 to 2014.

Methods
This section is divided into two parts. The first part considers the seven statistical learning models used for prediction of summer and winter q95. The second part gives a short description of the RFE algorithm and the proposed variable ranking methods. Additionally, there will be an overview of our repeated nested cross validation scheme.

140
We considered seven statistical learning models that can be structured as follows. Two prediction models are using dimension reduction: (i) principal component regression (PCR) and (ii) sparse partial least squares (sPLS). Additionally, we used two linear models that possess an inherent variable selection method -(iii) the LASSO and (iv) a linear model based boosting approach (GLM). If simple linear terms are not sufficient we can extent the GLM model by non-linear smoothing functions.
All models can be considered as regression models where the response variable Y is a vector of length N (N = 260) catchment observations, which can either be summer q95 or winter q95. The predictor matrix X is a N × p matrix with elements x ij

150
representing the values of p = 87 numeric predictors for the i-th catchment.

LASSO
The LASSO was originally introduced by Tibshirani (1996), where the regression coefficients β lasso can be defined as follows:

155
with β 0 as the intercept and β j as the regression coefficients. The LASSO model performs a penalized model optimization known as L1 regularization that reduces parameters and shrinks the model. A tuning parameter lambda controls the strength of the penalty and thus the parsimony of the model. Setting lambda to zero results in the ordinary least squares regression estimates, whereas large values of lambda lead to a simple intercept model. In between these limits, lambda is performing a continuous subset selection (Hastie et al., 2009). We are using the glmnet package in R for computation (Simon et al., 2011), 160 where the coefficients are estimated by cyclical coordinate descent (Friedman et al., 2010). An optimal solution for lambda is chosen by 10-fold cross validation, where we choose lambda by the 1-standard error rule. We prefer this over using lambda with minimum error, as this results in sparser and more robust models in our case. The lasso approach can handle high correlated data, coefficients can be shrunk to zero and correlated variables do not enter the final model (Friedman et al., 2010).
where θ m is θ m = Cov(z m , y)/V ar(z m ). As for the LASSO, X has to be standardized in PCR before estimating the regression coefficients, as the principal components are dependent on the scaling of the initial variables. The number of principal components M is optimized by a 10-fold cross validation and the regression coefficients are estimated by ordinary least squares. 175 We fit our PCR model using the pls package in R (Mevik et al., 2020).

Sparse partial least squares regression (sPLS)
Additionally to LASSO and PCR we propose a third dimension reduction method, partial least squares regression (PLS). PLS uses linear combinations of X for the regression of Y, but these linear combinations are now constructed in dependence to Y (Hastie et al., 2009). This overcomes the drawback of PCR, which can not guarantee that the first principal components of X 180 are most suited to predict Y. PLS, originally developed by Wold (1966) is an iterative process that starts (i) with centering the response variable (U 1 ) and the predictor variables (V 1j ). Next(ii), p univariate regression models are constructed by regressing U 1 against each centered predictor variable V 1,j , which gives us p regression coefficients. These regression coefficients are now used to compute the first PLS component, which is the weighted average (iii) defined as:

185
where b j are the univariate regression coefficients defined as b j = Cov(V 1j , U 1 )/V ar(V 1j ), and the weights are w i = V ar(V 1j ).
In a next step (iv) U 2 is estimated by computing the residuals of the regression model U 1 ∼ T 1 . Furthermore, the predictor ma-trix V 2j is updated by the residuals of the models V 1j ∼ T 1 , where each predictor variable is regressed against the first PLS component. This process is repeated until M PLS components are extracted (de Jong, 1993). As PLS yields a high variability in the performance evaluation, we used an adapted PLS procedure, which is called sparse PLS (sPLS). This method was in-190 troduced by Chun and Keleş (2010) and includes a variable selection, as a L1 penalty is added to the calculation of the PLS components. The model is tuned by 10-fold CV in the spls package in R (Chung et al., 2019).

Linear and non-linear model based boosting
In this section the model based boosting algorithm is presented, which is used for fitting a simple linear model (GLM) and a generalized additive model (GAM). Boosting refers to an ensemble learning approach that converts a set of weak models, 195 termed learners, into a strong model with better model fit. A current approach is functional gradient descent boosting, a stagewise, additive approach, which improves a fitted model by adding, each step, a new learner that reduces the model errors. When predictors X are entered as separate learners, so that the prediction function f is an additive estimate based on simple linear terms f j (x j ) = β j x j , the approach allows one to obtain an inherent variable selection and can penalize regression coefficients (Mayr and Hofner, 2018). In addition, model based boosting can deal with multicollinearity and can handle e.g. linear, non-200 linear, spatial or random effects (Hofner et al., 2014).
Model based boosting, as applied in this study, aims to minimize an empirical risk based on the so-called loss function ρ(y, f ) characterizing the inadequacy of the fitted model. For regression problems including GLM and GAM we use the squared error loss function which results in a stage-wise least-square minimization of the residuals. The boosting algorithm is an iterative process, with the following steps (Bühlmann and Hothorn, 2007;Mayr and Hofner, 2018;Melcher et al., 2017): 1. In a first step all base learners are defined. A base learner can be an e.g. linear, non-linear, spatial or random effect.
The two models used in this study incorporate linear base learners for the linear model (GLM), and linear and non-linear 210 effects for the GAM model. As shown by initial analysis, spatial effects or higher order interaction effects did not improve the prediction performance, hence they were discarded from the analysis. Non-linear effects are modeled as P-splines (Schmid and Hothorn, 2008), which are decomposed into an unpenalized linear base learner and a penalized non-linear base learner, each with 1 degree of freedom. The non-linear base learner is centered by subtracting the unpenalized linear part. This approach is proposed by Kneib et al. (2009) andFahrmeir et al. (2004) and offers the possibility to spot the 2. In the first iteration the counter m, which is the number of boosting steps, is set to 0 and the initial function estimate is set tof [m] =f [0] . The first function estimate (f [0] ) is determined by an offset, which is the mean of the response for our purpose (f [0] := y). 220 3. The following steps are now repeated until the maximum number of boosting steps is reached, which was fixed to 1000 in this study: -The tuning parameter m is increased by 1.
-The negative gradient − d(ρ) d(f ) is computed and evaluated at the function estimate of the previous iteration f [m−1] , resulting in the negative gradient vector u [m] .

225
-Each base learner is now fitted by univariate regression against u [m] and the best fitting base learner (=: , where ν is a value between 0 and 1 and if ν is sufficiently low, the risk of finding only a local minimum is reduced. Therefore, ν was set to 0.1 in this study. In each boosting step (m > 0) only one base learner is selected and can be chosen again in later iterations. The number of 230 boosting steps are optimized by a 10-fold CV. Although studies indicate that repeated CV would yield more robust results (Seibold et al., 2018), due to computational costs and the low risk of overfitting 10-fold CV seemed sufficient. The model boosting was performed using the mboost package in R (Hothorn et al., 2021).

Random Forest (RF)
Random Forest (RF) is a bagging (bootstrap aggregating) method originally developed by Breiman (2001). In a RF model 235 multiple regression trees are generated using bootstrap samples and their predictions are averaged to yield the RF estimate.
Bootstrapping decorrelates the individual trees and adds some randomness to the predictions. There are several packages in R that can estimate RF models, but due to the computational burden of the study we used the fast ranger package (Wright and Ziegler, 2017). An RF model has several hyperparameters that can be tuned. The number of trees used for bagging is one of these parameters, but it just has to be sufficiently high, as more trees do normally not impair the prediction performance.

240
In this study we used 500 trees. Another parameter that needs to be optimized is the size of each bootstrap sample. We used a grid search from 0.7 to 0.9 for finding an optimal sample size for each bootstrap sample. Sampling was applied without replacement. Further, the number of variables that are randomly chosen for each split needs to be set, which was determined by p/3. We used the estimated response variance as splitting criterion because rank-based approaches or the use of extremely randomized trees (Geurts et al., 2006) did not improve prediction accuracy. 3.1.6 Support Vector Machines :::::::::: Regression (SVM :::: SVR) Support Vector Machines have their origin in classification but can be extended to regression problems. The method in its basic form uses a training dataset to create a line (or hyperplane) that separates the data into classes. The support vectors are the data points closest to the line or hyperplane and have the most influence on parameter estimation. In SVM regression ::::: SVR, each of the predictor variables can be transformed to a set of basis function h m (x j ). Hence the regression function f (x) can 250 be approximated by (Hastie et al., 2009): The number of M basis function is not limited and to estimate the coefficients β 0 and β m , H(β, β 0 ) has to be minimized: V can be any loss function, but in the initial idea of Vapnik (2000), it is defined by a threshold r. If the residuals are higher 255 than this value, they are included in the penalization of the model, if the residuals are lower they are discarded (Worland et al., 2018). Hence, the SVM regression :::: SVR : is insensitive to outliers, as coefficients will be penalized in respect to them. The fast computation of the coefficients is achieved by only computing the inner product of each x j . Therefore, different kernels can be used and we decided to use the non-linear radial kernel in our study. The SVM regression :::: SVR : model was estimated by the e1071 package in R (Meyer et al., 2021).

Variable preselection for parsimonious models
The variable selection procedure of this study is based on the RFE ::::::: recursive ::::::: feature ::::::::: elimination :::::: (RFE) : algorithm. RFE is a prospective method, that initially ranks the predictor variables after some measurements of importance and the least important variables are removed in a backward procedure (Granitto et al., 2006). The final number of variables are determined by an error measurement of an independent test set. In this part we will present our three approaches for variable ranking, the error 265 measurement to define the number of variables and a short overview of the validation scheme.

Variable ranking methods
We test three different methods for the variable ranking of the RFE. Thus, we can differentiate between the prediction accuracy of the prediction models and the capability of different variable ranking methods for producing ::::: more parsimonious models.
-The first variable ranking method is the lasso (lasso rank ), which is applied as described in Sect. 3.1.1, except that 270 standardized coefficients are calculated.
-Second, we use a linear model based boosting approach (glm rank ). Every model is estimated by 500 boosting steps, since boosting shows only slow overfitting behavior (Fig.3). One disadvantage is that some non-influential variables will be ranked, but computation time is reduced. For calculation of the standardized coefficients we apply the variable importance function of the caret package in R (Kuhn, 2021).

275
-The third method (cf rank ) uses conditional forests (Hothorn et al. (2006); Strobl et al. (2007Strobl et al. ( , 2009)) for variable ranking, where the standardized coefficients are a sum of the main and interaction effects for each variable. Standardized coefficients are again calculated using the variable importance function of the caret package. Our variable ranking is computed over a bootstrap sample to improve its robustness with respect to data. Therefore, an initial dataset D is split up into 25 bootstrap samples (B = (b 1 , b 2 , b 3 , ..., b 25 )). Sampling is performed without replacement and a 280 sample size of 68.2 % is used.
In a next step, each of the bootstrap samples is fitted to one of the variable ranking methods. For each bootstrap sample and each method standardized coefficients denoted as β method j,b are returned, where b refers to the bootstrap sample, method the variable ranking method (lasso rank , glm rank , cf rank ), and j is the considered predictor. The variable importance of each bootstrap sample for each selected coefficient is calculated by: Final variable rankings are computed by averaging over all 25 bootstrap samples (Eq. 9). The variables are ranked from highest to lowest.
varrank method Variables that were never selected in the 25 boostrap samples are not considered for the variable selection. This yields n var 290 preselected variables for each method.

305
The final number of variables (n f inal ) of a method is then obtained from the relationship of the RM SE l versus the number of variables l, as exemplified in Fig. 4. A common option is to use the min(RM SE) for determining n f inal , but this would only yield to a small reduction in the number of variables. We therefore propose to use a somewhat (+5 %) higher residual error, i.e. 1.05 × min(RM SE), as reference point. This should make the models more parsimonious with only a slight loss in performance.

Performance evaluation metrics
Model evaluation was performed by the root mean squared error RM SE CV and the relative root mean squared error RRM SE CV of each CV repetition: where y is the mean of all observations. We further compare our results by the cross validated R 2 CV defined as:  For a more focused assessment of individual catchments in terms of how CV performance depends on climate and catchment characteristics we use the absolute normalized error AN E CV of the i-th catchment: 4 Results 325 4.1 Model Performance without variable preselection Figure 5 shows the performance of all models without variable preselection and Table 2

Effect of variable preselection
The use of variable preselection can significantly reduce the complexity of all models with only a small loss of performance ( Fig. 6). For winter low flow, variable selection leads to a median R 2 CV decrease of 5.1 % (cf rank ), 5.8 % (glm rank ) and 7.1 350 % (lasso rank ) over all models. The spread in performance across models is a bit higher for the lasso rank with an interquartile range (IQR) of 4.1 %, compared to an IQR of 3 % for cf rank and 2.8 % for glm rank . Although cf rank yields a slightly better performance than glm rank and lasso rank , the conditional forest approach requires 35 variables, where glm rank and lasso rank are only using 12 and 14 variables for winter q95. The glm rank and lasso rank are therefore much more effective. The number of variables for the three variable ranking methods have almost no dispersion, with a IQR of 1 (glm rank ), 2 (cf rank ) and 355 3.5 (lasso rank ). Interestingly, the highest performance loss is observed with the linear boosting model (GLM) for all three variable ranking methods. This is due to the nature of boosting methods, whose main strength is efficient parameter estimation for high-dimensional multicollinear data sets. Clearly, variable preselection affects the performance of the method.
In contrast to winter low flow, where the performance loss corresponds well with the +5 % residual error specification, variable Variable ranking method cf glm lasso Figure 6. Reduction of R 2 CV is computed in respect to the model performance without any variable preselection. Each point is the median number of variables of one CV-run and one prediction model and the related R 2 CV . Each line shows (horizontal = Number of variables, vertical = Reduction in R 2 CV ) the upper and lower quartile for a variable ranking method. Colours can be found in Ram and Wickham (2018) selection for summer low flow only leads to a minor loss in performance (Fig. 6b). Here, the median R 2 CV decrease is only 1 360 % for cf rank and lasso rank , and 0.8 % for the glm rank . Also the differences between models are very small. Again, cf rank yields a substantially higher number of variables (22), than the glm rank method (8 variables) and the lasso rank (9 variables).
Furthermore, the IQR of the number of variables shows that the selected number of variables is about the same in all models for lasso rank (IQR = 3) and glm rank (IQR = 2), but greatly differs between the models based on cf rank (IQR = 23). This spread reflects a much lower parameter-reduction efficiency of cf rank for the linear models (22 predictors for PCR, 35 for 365 LASSO, 42 for sPLS, 50 for GLM) than for the non-linear models (12 predictors for RF, 17 for SVM :::: SVR, 18 for GAM).
Among all models, the RF provides the most parsimonious model ::::: model :::: with ::: the :::::: lowest ::::::: amount :: of ::::::: variables : for summer low flows. It consists (on median) of only three variables when lasso rank and glm rank are used for variable selection, and has a performance loss of less than 1 %.

Importance of predictors (from variable rankings)
We performed variable rankings for each of the three ranking methods 1000 times inside the CV runs. In this section we discuss the 10 best ranked variables for each variable ranking method, defined by the average rank over all 1000 repetitions. We focus on the two linear ranking methods, as the non-linear method cf rank did not perform well. Figure 7 gives an overview of the rank counts for winter low flow and shows that the catchment altitude is on average the high-375 est ranked variable. Meteorological based variables appear four (glm rank ) and five (lasso rank ) times under the ten best ranked variables. Predictors rated by both methods are aridity and snowmelt in the winter months. The glm rank lists preconditions as precipitation or dry days in the summer as important variables. However, lasso rank found dry days in winter and over the whole year to be more important. Geological variables as percentage of Quaternary sediments or limestone enter the models as indicator for catchment processes. Finally, land use characteristics as proportion of agriculture area and wasteland rocks were 380 found for the glm rank , where only the fraction of grassland is rated by lasso rank . Both are correlated with the proportion of lowland/high mountain areas and can be interpreted as topological characteristics as well.
A slightly different picture emerges from the assessment of summer low flow (Fig. 8), where the three highest ranked variables are maximum catchment altitude, mean catchment slope and dry days in summer for both variable ranking methods. Topological descriptors play a somewhat more dominant role for summer low flow, as four (glm rank ) and five (lasso rank ) variables 385 are highly ranked. Apart from mean catchment slope and maximum catchment altitude, difference in catchment altitude and stream network density are also found for both variable ranking methods. Two meteorological variables -aridity in winter (both methods) or autumn (lasso rank ) and the annual temperature range (lasso rank ) -are found by each of the methods for summer q95 additionally to the dry days in summer. Finally, geological features as the proportion of Flysch, or land use variables as the fraction of grassland are highly ranked for both approaches.

Predictive performance and benchmarking
We showed that statistical learning models can yield high prediction accuracy. It is now interesting to assess how the models fit into the picture of existing national and international studies. We first assess the performance relative to low flow regionalization studies for the Austrian study area. Laaha and Blöschl (2006) fitted a multiple regression model to annual low flow q95 on 325 395 (sub-)catchments. They reported a performance of R 2 CV = 0.70 when the study area was subdivided into regions that exhibit similar seasonal characteristics of low flow. In a subsequent study, Laaha et al. (2014) showed that Top-Kriging can outperform regional regression, especially when interpolating between gauges at the larger rivers. Top-kriging (TK, Skøien et al., 2006) is a geostatistical method that uses stream-network distance for low flow prediction and was shown to be more adequate than ordinary kriging approaches. For comparison of our results with the current benchmark (TK), we used the same 10-fold CV 400 runs as for our statistical learning models. TK yields a median R 2 CV of 0.68 for winter low flow, which is slightly below the SVM :::: SVR and the GLM model and equivalent to the RF model. TK is performing similar to most models for summer q95 with a median R 2 CV of 0.84, and performs quite similar as the two boosting approaches. Hence, we show that statistical learning models can perform as good or even better as the current benchmark TK for summer and winter low flow in the Austria study area.

405
It is also interesting to compare our findings to existing studies that assess statistical learning methods for low flow estimation.
However, comparison of performance metrics across studies is not straightforward. Worland et al. (2018) and Ferreira et al. (2021) assessed their prediction models to low flow characteristics such as Q95 and a quite similar characteristic 7Q10, but these were not standardized by catchment area as in our study. This can lead to superior performance metrics, particularly if there are significant variations in catchment size within the sample. Worland et al. (2018) reported a NSE (which is equivalent to 410 the R 2 CV in this study) of 0.92 for the meta cubist model, and Ferreira et al. (2021) reported a NSE value of almost 1. However, the scatterplots of the studies suggest that errors are still considerable, especially for the low observation values. Although the studies are not directly comparable to our study in terms of performance, a qualitative comparison is still warranted. Both studies found tree based methods, including the meta cubist model (Worland et al., 2018), the RF (Zhang et al., 2018), and tree based boosting (Tyralis et al., 2021) having a higher prediction accuracy than the other models. Our study complements 415 existing studies by examining additional learning models. Our results suggest that the SVM :::: SVR and the GAM boosting model can outperform tree-based models for the Austrian setting. However, differences in performance are rather small, so that other methods (e.g. GLM, LASSO, and tree-based RF) can also be considered well suited.
One major research gap addressed by this study is the separate evaluation of statistical learning models for seasonal low flow processes. All statistical learning models of this paper can be classified as global models, as all gauges are considered in the 420 same model without catchment grouping. Earlier studies showed that regional regression can increase the prediction accuracy compared to global regression Blöschl, 2006, 2007) as different low flow generation processes apply to summer and winter regions. Here we pursue a different strategy in which we separate summer and winter processes by a temporal stratification into summer and winter low flows. We found that analysing winter and summer low flow indices individually leads to increased prediction accuracy, especially for summer q95. This emphasizes that prediction accuracy of a specific 425 model is influenced by the underlying hydrological process, and different models can be suitable for different applications (Worland et al., 2018). For comparison, preliminary results without consideration of the seasonal regime of our study area has led to a prediction accuracy of a median (R 2 CV ) of 0.66 to 0.74 which is very similar to the earlier studies on annual low flow (TK 0.75).

430
In a comparative assessment on low flow studies based on the PUB assessment report , Salinas et al. (2013), showed that prediction accuracy is not only a function of model selection, but of the specific setting of the study area.
The assessment did not contain statistical learning models, so we want to embed our results in their findings. Figure 9, which is equivalent to Fig. 5 of Salinas et al. (2013), shows the AN E CV as a function of the aridity index, the catchment area and the catchment altitude. Our study confirms the finding of the PUB assessment report that the prediction accuracy decreases as the 435 aridity of the study area increases (Salinas et al., 2013). Although stations with an aridity index over one are missing, the trend is clearly evident. No trend is evident for the winter low flows, which are more driven by freezing processes than by a climatic water balance deficit. Decreasing performance in arid regions for drought detection was also found by Haslinger et al. (2014).
This effect may be additionally intensified because in arid regions the mean of observations can be near zero.
found remarkably divergent results for summer and winter low flow. Whereas our findings for summer low flow are in line with Salinas et al. (2013), we could not identify a clear tendency for winter low flow. Catchments located in lowlands and mountainous areas have a somewhat larger AN E CV than catchments with an elevation between 450 and 1500 meters. This suggests that winter low flows are better predictable in colder mid-mountain catchments than in warmer lowland catchments, where occurrence of frost events varies from year to year. Finally, we can show that prediction accuracy is increasing with 445 catchment size, which is fully consistent with Salinas et al. (2013).
Another finding of (Salinas et al., 2013) that is not captured by Fig. 9 is that predictions of low flows in cold climates are reaching a lower prediction accuracy than in humid, thus warmer climates. A comparable effect can be observed when comparing the results for winter low flow and summer low flow, where the best performing model in winter has a R 2 CV of 0.70 and in summer 0.86. This divergence can be explained by the more complex hydrological processes of winter low flows (Salinas 450 et al., 2013). It is shown here that this performance gap applies to seasonal climates with a warm season and a frost period in the same way as between cold and humid climates.

Linear vs. non-linear models
All studies that conducted a comparative assessment of statistical learning models for low flow estimation highlighted that non-linear models are superior in respect to linear approaches (e.g. Worland et al., 2018;Ferreira et al., 2021;Zhang et al., 455 2018; Tyralis et al., 2021). In principle this is consistent with our findings, where winter low flow is best predicted by the SVM :::: SVR model, and summer q95 is best approached by the GAM model. However, we showed that linear statistical learning models such as GLM or sPLS perform almost as well in our study. To shed more light on this issue we assessed the relative value of the GAM over the GLM boosting model in more detail. Both models are equivalent in case of linear relationships, but the GAM offers the possibility to extend the GLM with non-linear relationships if these improve the model. The comparison 460 yields that the GAM is selecting additional non-linear effects to increase the goodness-of-fit. However, the additional effects do not increase the predictive performance of model. In fact, the R 2 CV of the GAM is only 1 % higher for summer but 3 % lower for winter low flow when using the model without variable preselection. This suggests that non-linear processes, which are to be expected in such a heterogeneous study area as Austria, can be sufficiently captured by the superposition of linear terms, so there is no need to add non-linear effects or or to use a non-linear model. This is additionally supported by the performance of 465 the three variable ranking methods. The non-linear approach of conditional forest leads only to a small reduction of the set of predictors, with performance similar (winter q95) or worse (summer q95) than the two linear ranking methods.

Conclusions
In this study we investigated a broad range of statistical learning methods for a comprehensive dataset of 260 catchments in Austria. The results showed that all statistical learning models perform well and are therefore well suited for low flow region-470 alization. Performance is particularly high for summer low flow (R 2 CV = 0.86), but still leads to satisfactory results for winter low flow (R 2 CV = 0.70). The best performing models are support vector machine regression (winter) and non-linear model based boosting (summer), but linear models exhibit similar prediction accuracy. No superior model could be found for both low flow processes, as relative differences between learning methods are actually small. The models perform similar or slightly better than a Top-kriging model that constitutes the current benchmark for the study area.

475
Variable preselection is shown on average to reduce the predictor set (on median) from 87 variables to 12 for winter and 8 for summer low flow. This is achieved by a small loss in performance, which is about 5 % for winter low flow, and only 1 % for summer low flow. The results suggest that variable preselection can help to create parsimonious learning models that are easier to interpret and more robust when predicting at ungauged sites. The RF (summer) provides the most parsimonious model ::::: model ::::: with :: the :::::::: smallest :::::: number :: of ::::::::: predictors, which consists of only three variables and has a performance loss of less 480 than 1 %.
Linear prediction models such as the linear model based boosting reveal high prediction accuracy. Non-linear terms were shown to increase the goodness-of-fit of the models, but did not improve predictions at ungauged sites. Our results suggest that non-linear low flow relationships can be sufficiently captured by linear learning models, so there is no need to use more complex models or to add non-liner effects. This finding is confirmed by our variable ranking methods, where linear approaches 485 seem to be sufficient for our estimation problem.
Finally, the study shows that when performing low flow regionalization in a seasonal climate with a cold winter season, the temporal stratification into summer and winter low flows increases the predictive performance of all learning models. This suggests that conducting separate analyses of winter and summer low flows provides a data-efficient alternative to catchment grouping that is recommended otherwise.
Author contributions. JL designed the research layout and GL contributed to its conceptualization. JL performed the formal analyzes and prepared the draft paper. MM supported the analyzes. GL supervised the overall study. All authors contributed to the interpretation of the results and writing of the paper.