Coupled machine learning and the limits of acceptability approach applied in parameter identification for a distributed hydrological model

Monte Carlo (MC) methods have been widely used in uncertainty analysis and parameter identification for hydrological models. The main challenge with these approaches is, however, the prohibitive number of model runs required to get an adequate sample size which may take from days to months especially when the simulations are run in distributed 10 mode. In the past, emulators have been used to minimize the computational burden of the MC simulation through direct estimation of the residual-based response surfaces. Here, we apply emulators of MC simulation in parameter identification for a distributed conceptual hydrological model using two likelihood measures, i.e. the absolute bias of model predictions (Score) and another based on the time relaxed limits of acceptability concept (pLoA). Three machine learning models (MLMs) were built using model parameter sets and response surfaces with limited number of model realizations (4000). The 15 developed MLMs were applied to predict pLoA and Score for a large set of model parameters (95000). The behavioural parameter sets were identified using a time relaxed limits of acceptability approach based on the predicted pLoA values; and applied to estimate the quantile streamflow predictions weighted by their respective Score. The three MLMs were able to adequately mimic the response surfaces directly estimated from MC simulations with an R value of 0.7 to 0.92. Similarly, the models identified using the coupled ML emulators and the limits of acceptability approach have performed very well in 20 reproducing the median streamflow prediction both during the calibration and validation periods with an average NashSutcliffe efficiency value of 0.89 and 0.83, respectively.


Introduction
Conceptual hydrological models have wide range of applications in solving various water quantity and quality related problems. A conceptual model typically comprises one or more calibration parameters as part of the user's perception of the 25 hydrological processes in the catchment and the corresponding simplifications that are assumed to be acceptable for the intended modelling purpose (Beven, 1989;Refsgaard et al., 1997). One of the major challenges in using conceptual models, however, is the identification of model parameters to a particular catchment (e.g. Bárdossy and Singh, 2008). The failure to set parameter values in accordance to their theoretical bounds, the interaction between these parameters, as well as the absence of a unique best set of parameters are some of the causes of parameter uncertainty (Abebe and Price, 2003;Renard 30 et al., 2010). In light of the different sources of uncertainty, previous studies have pointed out the need for a rigorous uncertainty analysis and communicating model simulation results in terms of uncertainty bounds rather than with only crisp values (e.g. Uhlenbrook et al., 1999).
The GLUE methodology was inspired by the generalized sensitivity analysis concept of Hornberger and Spear (1981) and it is the most widely used uncertainty analysis framework in hydrology (Stedinger et al., 2008;Xiong et al., 2008;Shen et al., 2012). The residual-based version of this framework allows the user to choose a likelihood and the threshold value for 5 identification of behavioural and non-behavioural models. The limits of acceptability based GLUE methodology (GLUE LoA) (Beven, 2006) overcomes limitations of the residual-based GLUE, that arise from the subjectivity in choosing the likelihood and the threshold value, by setting error bounds around the observed values. Models whose prediction falling within the error bounds for all observations are considered behavioural. The original GLUE LoA, which was formulated as a rejectionist framework in testing environmental models as hypothesis, is too stringent to be used for calibration purpose especially in continuous rainfall-runoff modelling. In the past, different approaches have been made to minimize the rejection of useful models when using GLUE LoA. These approaches include relaxing the limits (e.g. Blazkova and Beven, 2009;Liu et al., 2009), using different model realizations for different periods of a hydrological year (e.g., Choi and Beven, 2007) and using a time relaxed approach with the degree of relaxation constrained by an additional efficiency criterion (Teweldebrhan et al., 2018). The time relaxed GLUE LoA approach (hereafter referred as GLUE pLoA) was based on the 15 empirical relationship between model efficiency and uncertainty in response to the percentage of model predictions that fall within the observation error bounds (pLoA). In a case study involving this approach and an operational hydrological model, the ensemble of model realizations with only 30-40 % of their predictions in a hydrologic year falling within the observation error bounds were able to predict streamflow during the evaluation period with an acceptable degree of accuracy for the intended use based on the commonly used efficiency criteria.

20
Monte Carlo (MC) simulation is commonly employed to quantify the uncertainty propagated from model parameters to predictions in model calibration and uncertainty analysis frameworks including the GLUE methodology. MC simulation involves the sampling of very large parameter sets from their respective parameter dimension. This is especially true when the parameter distribution is not known a priori and hence a uniform distribution is assumed. Although, the MC simulation is a widely accepted stochastic modelling techniques, it suffers from heavy computational burden (Yu et al., 2015). The 25 computational time and resources required by the MC simulation could be prohibitively expensive in the case of computationally intensive models such as those with a distributed setup (e.g. Shrestha et al., 2014). In the past, different approaches have been introduced to minimize the computational burden by reducing the number of model realizations in MC simulation. These include the use of more efficient parameter sampling techniques such as the Latin hypercube sampling (e.g. McKay et al., 1979;Iman and Conover, 1980) and adaptive Markov chain MC sampling (e.g. Blasone et al., 2008;Vrugt et 30 al., 2009) as well as through use of emulators (e.g. Wang et al., 2015). An emulator or surrogate model is a computationally efficient model that is calibrated over a small dataset obtained by the simulation of a computationally demanding model and used in its place during computationally expensive tasks (Pianosi et al., 2016).
In hydrology, surrogate modelling has been mainly used in optimization and sensitivity analysis frameworks (Oakley and O'Hagan, 2004;Emmerich et al., 2006;Razavi et al., 2012). This approach involves a limited number of model realizations

35
to build a surrogate model using the parameter sets and model outputs as covariates and independent variable, respectively.
Statistical (e.g. Jones, 2001;Hussain et al., 2002;Regis and Shoemaker, 2004), Gaussian processes (Kennedy and O'Hagan, 2001;Yang et al., 2018) and machine learning models (MLMs) (e.g. Yu et al., 2015) have been used as surrogate models to emulate MC simulations. A machine learning model estimates the dependency between the inputs and outputs of a system from the available data (Mitchell, 1977).
In this study three MLMs, i.e. random forest (RF), K-nearest neighbours (KNN), and artificial neural network (NNET) are built using limited number of model parameter sets and response surfaces to emulate the MC simulation through coupling with the limits of acceptability approach. In hydrology, machine learning approaches have been increasingly used in different areas of application following the improvement in computation power. MLMs have been used in direct prediction of different water quantity variables such as streamflow Modaresi et al., 2018; 5 Senent-Aparicio et al., 2018), evapotranspiration (Torres et al., 2011) and snow water equivalent (Marofi et al.,2011;Buckingham et al., 2015;Bair et al., 2017). Similarly MLMs have been used to predict water quality related variables such as nitrate concentration (Ransom et al., 2017) and sediment transport (Bhattacharya et al., 2017). MLMs have also been used to forecast the residuals of a conceptual rainfall-runoff model (Abebe and Prince, 2003) and as emulator for conducting parameter uncertainty analysis of a conceptual hydrological model in order to overcome the high computational cost of the 10 MC simulation . The main goal of this study is to emulate the time consuming MC simulation for parameter identification through coupling of the machine learning models with the time relaxed limits of acceptability approach. The first objective is to assess the possibility of using pLoA as a likelihood measure for identification of behavioural models using the coupled MLMs and the limits of acceptability approach, instead of the previously used residual-based likelihood measures. The second objective is to compare the relative performances of RF and KNN as 15 emulators of the MC simulation in relation to the standard machine learning based emulator, i.e. NNET. As of our best knowledge, RF and KNN have not been used before as emulators of the MC simulation in parameter identification for hydrological models. The third objective is to compare the performance of the MLMs trained using pLoA against those trained using the absolute bias based criterion (Score) as target variables in conducting sensitivity analysis in order to assess the relative influence of the model parameters on the simulation result.

20
This paper is structured as follows: Section 2 presents a brief introduction to parameter identification using the time relaxed GLUE LoA approach as well as the MLMs used in this study. This section will also present the procedure followed in coupling the MLMs with the time relaxed GLUE LoA to emulate the MC simulation. Section 3 introduces the hydrological model and the study area used in the case study. Section 4 presents the validation results of the ML models through comparison of the predicted response surfaces against those directly generated from the MC simulation as well as 25 comparison of the simulated streamflow from behavioural models identified using the coupled MLMs and the time relaxed GLUE LoA against the observed values. Implications of the results in relation to the dataset and models used in this study as well as relevant previous studies are discussed in Section 5 and conclusions are drawn in section 6.

Methodology
Coupling of the MLMs with the GLUE pLoA was realized in two main phases. In the first phase, the response surfaces were 30 generated using limited number of MC simulations with subsequent evaluation of each realization using pLoA and Score as likelihood measures. The MLMs were then built using the parameter sets and the response surfaces. In the second phase, the developed MLMs were applied to predict the response surfaces for new parameter sets and the GLUE pLoA was used to identify the behavioural parameter sets based on the predicted response surfaces. The R software and its package for classification and regression training (CARET) were used for building and application of the MLMs as well as for 35 conducting further analyses. from the prior distribution. Since the parameter distribution was not known a priori, a uniform MC sampling was employed.
The hydrological model was run using the sampled parameter sets and the streamflow predictions of all model realizations were retrieved for further analysis.
The GLUE limits of acceptability approach (GLUE LoA) (Beven, 2006) was used to characterize behavioural and nonbehavioural simulations. This approach relies on an assessment of uncertainty in the observational data and involves setting 10 an observation error bounds (limits) with due consideration to the observation and other sources of uncertainties such as from the input data. Since no streamflow uncertainty data were available in the study site, mean observational uncertainty of 25% was assumed and the streamflow limits were defined using this value. In this study, the time relaxed variant of the GLUE LoA (GLUE pLoA) was employed to characterize behavioural models. In GLUE pLoA, the requirement in the original formulation for the model realizations to satisfy the limits in 100% of the observations is relaxed; with the degree of 15 relaxation constrained as a function of an acceptable modelling uncertainty expressed by the containing ratio index ( ).
This index was also used to quantitatively evaluate the modelling and prediction uncertainties following similar procedure as used in previous studies involving the GLUE methodology (e.g. Xiong et al., 2009). It is expressed as the number of observations falling within their respective prediction bounds to the total number of observation (Eq. 1). When using the GLUE methodology, the observations are not expected to lie within the prediction bands at a percentage that equals the 20 given certainty level. However, the modeller can adopt the certainty level specified for producing the prediction limits as a kind of standard for assessing the efficiency of the prediction limits in enveloping the observations (Beven, 2006 The procedure followed in GLUE pLoA for relaxing the original formulation is detailed in Teweldebrhan et al. (2018). For completeness, we include a summary of the steps herein: Step 1: define an acceptable modelling uncertainty (CR) at a chosen certainty level (e.g. 5-95 %) based on previous 5 experience or literature values (e.g. Teweldebrhan et al., 2018). In this study the CR value obtained for the calibration period using the residual-based GLUE methodology was adopted as an acceptable CR value.
Step 2: relax the acceptable percentage of observations where model predictions fall within the limits. This is done by gradually lowering the requirement for bracketing the observations in 100% of the time steps up to the acceptable pLOA.
Step 3: test whether each model realization prediction falls within the limits at least for the specified percentage of the total 10 observations. If model realizations that satisfy the relaxed acceptability criteria are found, proceed to step 4, otherwise lower the threshold pLOA further and repeat this step.

15
The identified behavioural model realizations were used to predict streamflow weighted by their respective Score values.
When calculating Score, the prediction error, i.e. the deviation between the observed and simulated streamflow ( ) values ( -) was first converted into a normalized criterion. This was accomplished using a triangular membership function with its support representing the uncertainty in streamflow observations and the pointed core representing a perfect match between the observed and predicted values (

= ∑ =1
(4) where the notations n and N respectively refer to the number of streamflow observations and behavioural models.

Machine learning modelling
Three non-linear and non-parametric machine learning methods, i.e. RF, KNN, and NNET from the CARET package of R (Kuhn, 2008) were considered to emulate the MC simulation. In all methods, relevant parameters were optimized and the root mean squared error (RMSE) metric was used to identify the optimal model. This section briefly summarizes these 5 machine learning methods and the reader is referred to the above reference for detailed description of these algorithms.

Random forest
Random forest (RF) is a version of the Bagged (bootstrap-aggregated) trees algorithm (Breiman, 2001). As such, it is an ensemble method whereby a large number of individual trees are grown from random subsets of predictors, providing a weighted ensemble of trees (Bair et al. 2017). Bagging was reported to be a successful approach for combining unstable learners (e.g. Li et al., 2011). Since RF combines bagging with a randomization of the predictor variables used at each node, it yields an ensemble of low correlation trees (Li et al., 2011, Appelhans et al., 2015. The free parameter in this method, i.e. the number of randomly selected predictors at each node, was determined through optimization. RF is also less sensitive to non-important variables, since it implicitly performs variable selection (Okun and Priisalu, 2007).

15
K-nearest neighbors (KNN) approach uses the K-closest samples from the training dataset to predict a new sample. The value of K, i.e. the number of closest samples used in the final model was optimized. KNN is a nonparametric method where the information extracted from the observed datasets is used to predict the variable of interest without defining a predetermined parametric relationship between the predictors and predicted variables (Modaresi et al., 2018). KNN is also a non-linear method whose prediction solely depends on the distance of the predictor variables to the closest training dataset 20 known to the model (Appelhans et al., 2015). In this study, the Euclidean distance was used as a similarity measure for computing the distance between the new and training datasets.

Artificial neural network
An artificial neural network (NNET) constitutes an interconnected and weighted network of several simple processing units called neurons that are analogous to the biological neurons of the human brain (Hsieh, 1993;Tabari et al., 2010). The

25
neurons provide the link between the predictors and the predicted variable and in the case of supervised learning the weights of the neurons, i.e. the unidirectional connection strengths, are iteratively adjusted to minimize the error (Sajikumar and Thandavesware, 1999;Bair et al. 2018). NNET has the capability to detect and learn complex and nonlinear relationships between variables (Yu et al., 2015).
A multilayer perceptron is the most common type of neural network used in supervised learning (Zhao et al., 2005; 30 Marofi et al., 2011) and it consists of an input layer in which input data are fed, one or more hidden layers of neurons in which data are processed, and an output layer that produces output data for the given input (e.g. Senent-Aparicio et al., 2018).
In this study one middle layer was considered, with the number of neurons in the input and output layers being equal to the number of predictors and predicted variable, respectively. The two free parameters of NNET, i.e. the number of neurons in the middle layer and the value of weight decay were optimized. Based on a preliminary assessment on performances of models with a linear and sigmoid activation function, a linear activation function was used in the final model.

5
The procedure followed to build and apply the MLMs as emulators of the MC simulation is similar to the approach used in previous studies (e.g. Yu et al., 2015) with the exception of the parameter identification part. While the previous studies were conducted based on the residual-based GLUE, here we use the time relaxed GLUE LoA approach with two likelihood measures. The coupling procedure involved sampling of 5000 random samples from the dimensions of the uncertain model parameters. The hydrological model was run using these parameter sets with subsequent retrieval of the simulated 10 streamflow values. Each model realization was evaluated both in terms of its capability to generate simulated streamflow close to the observed values (Score) and its persistency in producing acceptable simulated values that fall within the observation error bounds (pLoA). Six MLMs (for the combinations of the two likelihoods, i.e. Score and pLoA and for the three ML methods, i.e. RF, KNN, and NNET) were trained and tested using the randomly selected parameter sets and their corresponding likelihood values directly estimated from the MC simulation. Sample sizes of 80% (S1) and 20% (S2) of the 15 5000 samples were respectively used for training and testing the MLMs (Table 1).  The procedure followed in building and evaluation of MLM-GLUE pLoA can be divided into two main phases as ii. Run the hydrological model using the sampled parameter sets and store the simulated streamflow corresponding to each parameter set.
iii. Calculate the performance of each model realization in terms of the percentage of time steps it is able to reproduce the observed streamflow within the observation error bounds, i.e. pLoA, and the total normalized absolute bias of the predicted streamflow (Score). A streamflow observation error bound of 25% was assumed in this study.

5
iv. Use 80% of the parameter sets, i.e. S1, of the samples generated at step i as covariates; and the performance of each parameter set (pLoA) calculated at the previous step as target variable to train the MLMs i.e. RF, KNN, and NNET (MLMs_pLoA). Similarly, train the three MLMs using same parameter sets (S1) as covariates but with Score as a target variable (MLMs_score).
v. Test the trained MLMs_pLoA using the remaining 20% of the parameter sets generated at step i, i.e. S2, and the corresponding target variable (pLoA) from step iii. Similarly, test the trained MLMs_score using the same samples (S2) but with Score as a target variable.
(b) Response surface estimation and behavioural model identification i. Repeat the step i in MLM training and testing (a) but with a much bigger sample size of 95000 (S3) ii. Predict pLoA and Score using MLMs_pLoA and MLMs_score, respectively and S3 as covariate.

15
iii. Identify behavioural samples (S4) from S3 using the time relaxed limits of acceptability approach (Section 2.1) based on the pLoA predicted by the MLM.
iv. Estimate weighted median streamflow prediction of the behavioral models. The Score predicted by the MLMs_score was first normalized using Eq. 4 and then used to weigh the relative contribution of each model realization.

Figure 2.
Schematic overview of the MLM training and testing as well as response surface prediction using the MLMs and the identification of behavioural models using the coupled MLM and GLUE pLoA.

Model performance measures
The performances of the generated ML models, i.e. RF, KNN, and NNET in terms of their capability to reproduce the 5 response surfaces were evaluated using the following three standard statistical criteria, i.e. root mean square error ( ), coefficient of determination ( 2 ) and the mean absolute bias ( ).
where , and , respectively denote the likelihood values (pLoA or Score) predicted using a given MLM and estimated using the MC simulation for the i th model realization. ̅ and ̅ are the average MLM predicted and MC estimated likelihood values, respectively. N is the total number of model realizations.
The Nash-Sutcliffe efficiency (NSE, Eq. 8) and the NSE with log-transformed data (LnNSE) were used for assessing the streamflow prediction of behavioral models identified using MLM-GLUE pLoA through comparison against the observed 5 values.
where , and , respectively represent simulated and observed streamflow for the i th time step and ̅ represents mean value of the observed streamflow series. institutions (e.g. Nyhus, 2017;Matt et al., 2018). The modelling framework has three main models (method stacks) and in this study, the PT_GS_K model was used for the parameter identification study using machine learning based emulators of the MC simulation. PT_GS_K is a conceptual hydrological model and in this study eight of its parameters are subjected to 15 uncertainty analysis. PT_GS_K uses the Priestley-Taylor (PT) method (Priestley and Taylor, 1972) for estimating potential evaporation; a quasi-physical based method for snow melt, sub-grid snow distribution and mass balance calculations (GS method); and a simple storage-discharge function (Lambert, 1972;Kirchner, 2009) for catchment response calculation (K).
Overall, these three methods constitute the PT_GS_K model in Shyft. The framework establishes a sequence of spatially distributed cells of arbitrary size and shape. As such it can provide lumped (single cell) or discretized (spatially distributed) 20 calculations, as in this study. The modelling framework (shyft) and the PT_GS_K model in particular were described in previous studies (e.g. Burkhart et al., 2016;Teweldebrhan et al., 2018) and the reader is referred to these materials for further details on the underlying methods of this model. The following table shows list of the uncertain model parameters and their parameter range.

Study site and data
The data used for training and validation of the ML emulators was retrieved from the Nea-catchment. This catchment is located in Sør-Trøndelag County, Norway (Fig. 3). Geographical location of the catchment ranges from 11. PT_GS_K model requires temperature, precipitation, radiation, relative humidity, and wind speed as forcing data. In this study, daily time series data of these variables were obtained from Statkraft (2018) with the exception of relative humidity.
Daily gridded relative humidity data was retrieved from ERA-interim (Dee et al., 2011). The model also requires the following physiographic data of each grid cell: average elevation, grid cell total area, and the areal fractions of forest, reservoir, lake, and glacier. Data for these physiographic variables were retrieved from two sources: the land cover data from Copernicus land monitoring service (2016)

Evaluation of behavioural parameter sets using observed streamflow
The behavioural model realizations identified using the coupled ML emulators and the limits of acceptability approach were  inter-annual variability in their performances (based on NSE) as compared to those identified using RF and NNET. A relatively higher inter-annual variability in average CR (0.66 to 0.79) for the validation periods was obtained when using RF.

The number of identified behavioural models has also shown an inter-annual variability for a given MLM and between
MLMs within a given calibration year. The highest and lowest number of behavioural models were respectively obtained in calibration years 2013 and 2014. Similarly, the highest number of behavioural models was obtained when using RF, while 15 the lowest record was obtained when using KNN.     (Table 4).

5
Sensitivity analysis is an important technique to assess the robustness of model based results and it is often performed in tandem with emulation based studies in order to determine which of the input parameters are more important in influencing the uncertainty in the model output (Ratto et al., 2012). Figure 7 shows the sensitivity of streamflow predictions to the model parameters based on the in-built variable importance assessment methods of the three MLMs trained to predict pLoA and Score. The relative measures of importance are scaled to have a maximum value of 100. The RF and KNN MLMs trained to 10 predict pLoA yielded similar relative importance of the model parameters. The catchment response parameters of the hydrological model, viz. c1, c2, and c3 have shown higher relative importance as compared to the snow and water balance parameters. On the other hand, the NNET trained to predict pLoA has yielded higher relative importance for wind scale (ws) and the rain/snow threshold temperature (tx) as compared to the linear (c2) and quadratic (c3) coefficients of the catchment response function. The RF and KNN MLMs trained to predict Score have also shown similar result to their equivalent

15
MLMs trained to predict pLoA with the exception of a swipe in the order of importance between the two least important parameters, fa and cv, when using RF. The result from the NNET trained to predict Score was less consistent with the result obtained from its corresponding MLM trained to predict pLoA. The former result was similar to the one obtained from the KNN trained to predict Score except that c3 was preceded by c1 and ws in the case of NNET. The snow coefficient of variation (cv) as well as the slow (sa) and fast (fa) albedo decay rates were the least important variables as identified using the three MLMs when applied to predict pLoA and Score. The relative importance of the model parameters obtained using the MLMs was generally consistent with the result obtained in previous study focused on parameter uncertainty analysis using the GLUE methodology (Teweldebrhan et al, 2018).

Discussion
The capability of MLMs as emulators of the MC simulation has been demonstrated in this and other similar studies. Machine 5 learning and other data-driven models have been applied as emulators to substitute complex and computationally intensive simulation models. These models have been referred in the literature as surrogate models (e.g. Yu et al., 2015) and metamodels (e.g. El Tabach et al., 2007). Emulators were reported to be particularly useful when a large number of simulations such as the MC simulation are required to be performed, for example, during optimization (Hemker et al., 2008) and sensitivity analysis (e.g. Reichert et al., 2011). The results from this study revealed that the MLMs trained with limited 10 sample size of artificially generated data from the simulation model were computationally efficient and providing reliable approximation of the underlying hydrological system. Similar advantages of MLM based emulators were also reported in previous studies (e.g. Kingston et al., 2008;Razavi et al., 2012).
The performance of the coupled MLMs in response to training sample size, however, varies from one MLM to another. of the identified behavioural models (i.e. a 1-3% decrease in average NSE) as compared to the ones identified using the 4000 samples. Only slight to no improvement was obtained in most of the evaluation years as a result of using behavioural models identified from the 4000 MC simulations as compared to the 95000 simulations when assessed using the available evaluation dataset and the streamflow evaluation metrics used in this study. Like most studies based on the GLUE methodology, the main focus of this study was, however, to get as much behavioural models as possible so as to encapsulate future uncertain conditions. The MLMs applied in this study and in other areas of application have both advantages and limitations. MLMs 5 are able to learn complex nonlinear system from a set of observations and usually yielding a high degree of accuracy as they are not affected by the level of understanding of the underlying processes in the system (Kingston et al., 2008). Furthermore, MLMs with the virtue of their generalization capability are relatively quick to run as simulations over an extended period of time are not required. However, since MLMs do not have any understanding of the modelled physical processes, they operate as black-box models with an accompanying dilemma on whether they would behave as intended under changing 10 future conditions (Olden and Jackson, 2002). Generally, MLMs have limited application in conditions that significantly deviate from historical norms. In this study, adequate size of training samples was used in order to represent different parts of the parameter dimensions. Furthermore, in many MLMs the notion of degrees of freedom is usually ignored when computing performance metrics during model training (Kuhn, 2008). Since these metric do not penalize model complexity (e.g. as in the case of adjusted R 2 ), they tend to favour more complex fits over simpler models. In some MLMs a 15 regularization approach is employed to adjust the cost function in such a way that the model learns slowly and thereby minimize overfitting (Nielsen, 2018). In this study, for example, the L2 regularization was used with the NNET model.
In and year 2014, respectively (Table 4). The Nash-efficiency computed using the row streamflow data (NSE) gives more emphasis to high-flow than low-flow values, while the one computed using the log-transformed data (LnNSE) gives more structure that is transposable in both space and time (e.g. Clark et al., 2011;Kavetski and Fenicia, 2011 Mekanik et al. (2013) observed better performance of NNET as compared to KNN. A similar inconsistent result was also observed in another study focused on monthly streamflow 10 forecasting with a higher cumulative ranking of NNET as compared to KNN under nonlinear conditions (Modaresi et al., 2018). However, the later was better in reproducing the observations under linear condition; and they concluded that the variability in relative performance of the MLMs may be attributed to the differences between study sites, data sets, and structure of the MLMs as well as whether the relationship between the predictor and predicted variables is linear or nonlinear.
The main challenges with KNN appear when data are sparse, although this problem can be partly overcome by choosing the 15 number of neighbours adapted to the concentration of the data (Burba et al., 2009).
In this study, different trials were conducted in order to assess effects of the model structure and hyper-parameter values and thereby to get the optimal MLMs (result not shown). For example, the NNET model with multiple hidden layers resulted to lower performance than the one with single hidden layer. This result is consistent with the general notion, that for many applications a single hidden layer is adequate to model any nonlinear continuous function (e.g. Hsieh,2009;Snauffer, et al., 20 2018). Similarly, use of a linear activation function has yielded NNET models with better accuracy as compared to the commonly used sigmoidal function. Efficiency of the emulators also depends on their respective hyper-parameters values.  In this study, the concept of equifinality was employed for parameter identification and uncertainty analysis, i.e. ensemble of behavioural models were identified with subsequent application for streamflow prediction at different quantile 5 values. In other studies focused on the concept of optimality, machine learning methods were used to directly estimate prediction uncertainty based on MC based uncertainty or historical model residuals from an optimal model. For example, in the MLUE method Shrestha et al., 2014) MLMs were trained using MC-based uncertainty with subsequent application of the trained MLMs to directly predict model output uncertainty associated with new input datasets.
Similarly, clustering and machine learning techniques were used to estimate the prediction uncertainty associated with a 10 process model through analysis of its residuals during uncertainty estimation based on local errors and clustering (UNEEC) . In further study, the UNEEC approach was extended in a way that it can explicitly take into account for parametric uncertainty (Pianosi et al., 2010). Similarly, Wani et al. (2017) have effectively applied instancebased learning using KNN in order to generate error distributions for predictions of an optimal model. Generally, the UNEEC and its variants are computationally more efficient than those based on the equifinality concept since in the former

15
case only a single model run is required during the forecast period. Uncertainty analysis using emulators coupled to the residual-based GLUE is also expected to entail less computational cost as compared to those coupled with GLUE LoA and its variants.
In previous emulator based uncertainty analysis studies, the residual-based GLUE methodology was coupled with the MLMs (e.g. Yu et al., 2015). Here, we used the limits of acceptability concept in order to overcome some of the limitations associated with the residual-based approach. The original formulation of the GLUE LoA is, however, too strict for use in 5 identification of behavioural models and it may result to rejection of useful models and thereby making type II error. In order to minimize such errors, one of the commonly used approaches was to relax the limits (e.g. Blazkova and Beven, 2009).
However, in previous study it was observed that relaxing the limits was not a feasible option in simulations that involve time series data with dynamic observational error characteristics as in the case of continuous rainfall-runoff modelling. Relaxing the limits beyond 25% while keeping the threshold pLoA at 100% have yielded to the inclusion of non-behavioural models, 10 leading to very low performance during the validation period. When it comes to time efficiency of the emulators Accordingly, in an attempt to balancing between type I and type II errors, the time-relaxed limits of acceptability approach was introduced (Teweldebrhan et al., 2018). This approach was employed in this study and it relaxes the strict criterion of the original formulation that demands all model predictions to fall within their respective observation error bounds. When using this approach, the minimum threshold for the percentage of time steps where model predictions are expected to fall 15 within the limits is defined as a function of the level of modelling uncertainty.
A combined likelihood measure based on the persistency of model realizations in reproducing the observations within the observational error bounds (pLoA) and a normalized absolute bias (Score) was used in previous study focused on snow data assimilation (Teweldebrhan et al., 2019). The Score values were rescaled with due consideration to pLoA, whereby the two efficiency measures were given equal importance in estimating the final weight of each model. In this study, the acceptable 20 models were first identified based on pLoA only and the Score was used to weigh the relative importance of the acceptable models in predicting the quantile streamflow values. Another trial that involved selection of the top 100 best performing models using a combined likelihood with equal weights given to pLoA and Score yielded relatively low validation result as compared to using pLoA alone for the identification of behavioural models (result not shown). This can be attributed to the difference in nature of these likelihood measures. pLoA considers only the percentage of time steps where the model 25 predictions have fallen within the observation error bounds. This renders pLoA to be less sensitive to the variability in relative performances of the model between time steps. On the other hand, Score can be highly affected by predictions of few time steps that are very close or too far from the observed value, albeit within the limits. The predictability of independent variables varies from one to another. Thus, the application of emulation methods to predict pLoA in this study provides a further insight on the potential and scope of the standard emulator, i.e. NNET and the additional emulators used in 30 this study, i.e. RF and KNN to predict response surfaces other than the residual-based likelihood measures that were applied in previous studies.

Conclusions
Three machine learning models (MLMs), i.e. Random forest (RF), K-Nearest Neighbours (KNN), and an Artificial Neural-Network (NNET) were constructed to emulate the time consuming MC simulation and thereby overcome its computational 35 burden when identifying behavioural parameter sets for a distributed hydrological model. Two sets of MLMs were trained using the randomly generated uncertain model parameter values as covariates, and two efficiency criteria defined within the realm of the limits of acceptability concept as target variables. One of the efficiency criteria used in this study was a measure of model persistency in reproducing the observations within the observation error bounds (pLoA), while the other one was based on a normalized absolute bias (Score). The coupled MLMs and time-relaxed limits of acceptability approach employed in this study were able to effectively identify behavioural parameter sets for the hydrological model. The MLMs were able to adequately reproduce the response surfaces for the test and validation samples with an R 2 value of 0.7 to 0.92 for the test dataset, although the evaluation metrics have shown variability both between the MLMs and the analysis years. The sensitivity analysis conducted using the in-built algorithms of the three MLMs have yielded comparable order of precedence in relative variable importance when trained using pLoA and Score as target variables. The result was generally consistent with the one obtained from previous study conducted using the residual-based GLUE methodology. The catchment response parameters of the hydrological model, i.e. c1, c2, and c3 have shown higher relative importance as 20 compared to the snow and water balance parameters. Thus, the efficiency of MLM based emulators in doing sensitivity analysis for computationally expensive models was also further proven in this study.
Data availability. The underlying hydrologic observations for this analysis were provided by Statkraft AS and are proprietary within their hydrologic forecasting system. However, the data may be made available upon request.

25
Competing interests. The authors have no conflict of interest.