Modeling and interpreting hydrological responses of sustainable urban drainage systems with explainable machine learning methods

Sustainable urban drainage systems (SuDS) are decentralized stormwater management practices that mimic natural drainage processes. The hydrological processes of SuDS are often modeled using process-based models; however, the application of these models can be challenged by insufficient data regarding the physical properties and drainage processes of the catchment. To address this problem, this study proposes a machine learning method to directly learn the statistical correlations between the hydrological responses of SuDS and the preceding rainfall time series collected at sub-hourly time 10 scales from observation data. A feature engineering method is developed to facilitate the application of machine learning algorithms to model high-dimensional time series data. A local feature attribution method is used to explain the basis of each discharge prediction, i.e., the contribution of rainfall of each time step is identified for each prediction. These local explanations are then combined to investigate the correlations between the input and output variables learned by the models and infer the modeled hydrological processes. The proposed methods are applied to two SuDS catchments with different sizes, SuDS 15 practice types, and data availabilities in the U.S. for discharge prediction. The resulting models have high prediction accuracies (the Nash–Sutcliffe model efficiency coefficient (NSE) > 0.70 and the coefficient of determination (R) > 0.70 for all models). The local explanations of the predictions are analyzed and the results show that the discharge predictions are mostly affected by rainfall from the most recent time steps, and that the rainfall–discharge correlations learned by some models are not realistic. These explanations are further used to decompose hydrographs and identify the runoff source. This study shows that machine 20 learning methods are a useful tool for predicting the hydrological responses of SuDS, and the explanation methods can provide insights on the structures and input–output correlations learned by the models, both of which enhance the understanding of the modeled catchment processes.

. It is essential to be able to predict the hydrological performance of 30 SuDS and understand the involved hydrological processes to support their design optimization, planning, and related policy making (Pappalardo and La Rosa, 2020;Yang and Chui, 2018).
A number of numerical modeling methods of various complexities have been developed for SuDS (Elliott and Trowsdale, 2007;Liu et al., 2014). The simplest methods are perhaps those developed based on empirical equations for computing the drainage impact of different land use types in terms of peak flow or runoff volume. For instance, the rational method and SCS 35 runoff curve number method are modified and used in Montalto et al. (2007) and Damodaram et al. (2010) to study the effectiveness of SuDS at catchment scales. Empirical equation-based methods can be useful in preliminary designs to rapidly estimate some key performance metrics of SuDS. However, these methods poorly reflect detailed SuDS design variations. For example, it is often unclear how to vary the curve number in the SCS method for different green roof substrate depths (Fassman-Beck et al., 2016). The relatively coarse temporal and spatial resolutions of empirical equation-based methods can also limit 40 their application.
Process-based models are another approach to modeling SuDS, in which physically based or empirical equations are used to characterize the involved hydrological processes. SuDS are typically represented in process-based models as hydrological functional units, whose properties are defined using a set of parameters. Commonly used models, including SWMM and MUSIC, are reviewed in Eckart et al. (2017) and Elliott and Trowsdale (2007). A process-based model can ideally be set up 45 for a catchment using only its measured physical properties (Refsgaard and Storm, 1990). However, not all parameters are measurable or can be measured at a reasonable cost. Model calibrations are often required and critical for SuDS (Platz et al., 2020). For example, in the SWMM, the initial soil moisture deficit parameter of SuDS is often determined through calibration (Rosa et al., 2015). However, model calibration can be difficult and subject to considerable uncertainty (Schuol and Abbaspour, 2006). 50 The application of process-based models faces several challenges. First, it can be difficult to choose a suitable model to represent SuDS. SuDS designs can substantially vary in terms of their installation location, layer composition, and materials.
Unfortunately, models that are capable of modeling all of these design variants may not exist. Modular-form models, such as GIFMod, represent SuDS as a collection of hydrological functional components arranged in 1D or 2D grids, which allows different design variants to be modeled . However, these models are arguably more complex and do 55 not necessarily produce more accurate predictions despite the higher data requirements for the model setup. Second, the complex hydro-environmental processes of SuDS and surrounding environments are difficult to model using existing models.
For instance, the SWMM does not account for macropore flow in the SuDS soil layer (Niazi et al., 2017), and models that assess the performance of SuDS in shallow groundwater environments (Zhang and Chui, 2019) and cold climates (Johannessen et al., 2017) are limited. Third, the assumptions used in process-based models may be invalid in some cases due to unknown 60 issues related to construction, maintenance, or physical property changes during a SuDS service life. For instance, the permeability of porous pavement changes over time due to factors such as clogging (Yong et al., 2013), but the processes that contribute to clogging and its associated impact on infiltration processes remain poorly understood. Complex environment processes or human errors, such as changes in ecological processes (Levin and Mehring, 2015) and inferior constructions, are often unknown to modelers and thus difficult to predict. Other model types that rely less on assumptions are therefore preferred 65 under certain circumstances.
Machine learning methods, also referred to as data-driven modeling, predictive modeling, and statistical learning, may be used to learn the statistical correlations between the SuDS state variables of interest from observed data (Solomatine and Ostfeld, 2008). Machine learning methods have been widely used in various fields in hydrology (Maier and Dandy, 2000), such as rainfall-runoff modeling (Worland et al., 2018), evapotranspiration forecasting (Karbasi, 2018), and groundwater 70 modeling (Jang et al., 2020).
Machine learning methods have only been used in a few studies to investigate the hydrological processes of SuDS. For instance, multiple linear regression methods were used in Eric et al. (2015) and Khan et al. (2013) to predict the hydrological effectiveness of SuDS, such as runoff volume reduction, based on factors such as inflow volume, antecedent soil moisture content, and SuDS implementation levels. Eric et al. (2015) concluded that the linear regression approach is useful for 75 predicting the hydrological performance of SuDS in large catchments with thousands of lots and potentially eliminates the need to build complex process-based catchment models. Li et al. (2019) used neural networks models to predict the peak flow and runoff volume resulting from a rainfall event on a university campus with SuDS implementations based on rainfall event characteristics, such as rainfall depth and event duration. Hopkins et al. (2020) used linear regression models to predict the timing and volume of runoff resulting from a rainfall event in urban catchments with and without SuDS implementations. The 80 studies mentioned above focused on predicting the long-term or rainfall event-level hydrological performance of SuDS.
However, there is currently insufficient literature on the application of machine learning methods to model the hydrological responses of SuDS at regular time steps, e.g., daily, hourly, or sub-hourly. Yang and Chui (2019) showed that machine learning methods, such as deep learning methods and random forest methods, are useful for predicting the runoff response of SuDS at sub-hourly time scales, provided that the model's input variables are appropriate. However, a method to derive these input 85 variables was not described in their study.
The lack of popularity of machine learning models in SuDS-related studies may be explained by several factors. First, the hydrological responses of SuDS are controlled by relatively long-term hydrometeorological conditions. Thus, modeling the responses of SuDS at fine temporal scales requires a high-dimensional hydrometeorological time series to be used as input, which is difficult for machine learning algorithms that are not specifically designed for modeling data sequences (Nielsen, 90 2019). Machine learning methods may also not offer clear advantages over equation-based methods when applied to study the performance of SuDS at the rainfall event level. It is therefore desirable to use a machine learning method that can handle high-dimensional time series data to model the hydrological responses of SuDS at sub-hourly time scales.
Machine learning models are also often criticized for a lack of transparency because the basis for each prediction can be difficult to analyze . Such information is important for model diagnosis, understanding the involved 95 processes, and supporting decision making (Lundberg and Lee, 2017). To explain the basis of model predictions in hydrology, previous studies have analyzed the rules used in model trees for making predictions (Solomatine and Dulal, 2003), examined the temporal evolution of hidden values in neural networks (Kratzert et al., 2019), referred to similar past events using instancebased learning (Wani et al., 2017), and used permutation feature importance methods to assess the importance of the predictors in the machine learning models (Schmidt et al., 2020). However, many of the explanation methods used in previous studies 100 were ad-hoc and therefore not easily transferred to other studies. Most of the explanation methods used in the current hydrological literature are also global interpretation methods, which are useful for understanding general model-learned relationships (Murdoch et al., 2019). For instance, these methods can be applied to compute feature importance scores, identify important interactions between features, and assess the statistical significance of predictors (Murdoch et al., 2019). However, in cases where the hydrological processes of SuDS are modeled at sub-hourly time scales, it may be useful to explain the basis 105 of each prediction to better understand the involved processes and increase the transparency of the machine learning approaches. Local feature attribution methods may be useful for this task because they can provide explanations regarding why a specific prediction is made by assessing the contribution of each feature to the prediction (Sundararajan and Najmi, 2020).
There is a limited amount of publicly available data concerning the hydrological processes of SuDS because they represent 110 relatively new technologies and monitoring is often conducted by the local authorities and other interested parties. A lack of data is also common for other data types that are useful in urban hydrological studies, such as the soil moisture content and soil temperature (Schaffitel et al., 2020). It may be therefore useful to demonstrate the applications of machine learning methods to predict SuDS discharge based on preceding rainfalls because rainfall and discharge are the most commonly monitored features in SuDS sites and several rainfall-discharge datasets are available online (e.g., the United States Geological 115 Survey Water Data for the Nation, https://waterdata.usgs.gov).
This study aims to evaluate the usefulness of explainable machine learning methods in modeling and interpreting the hydrological responses of SuDS to rainfall at sub-hourly time scales. Specific model training methods are developed to facilitate the modeling of high-dimensional time series data. The prediction accuracies of the resulting machine learning models are compared with the commonly used linear regression models and process-based hydrological models. The basis of 120 each discharge prediction is explained using local feature attribution methods and attributed to the rainfalls from different time steps using a method proposed in this study. The outcome is applied to analyze the model structure and hydrological processes being modeled.

Local and global methods
Let the real-valued function : ℝ → ℝ be the machine learning model to be explained. The input variable of is ≔ ( , , … , ), which is an -dimensional vector of numerical features. Let ≔ ( , , … , ) denote a sample of . For a particular , local feature attribution methods aim to quantify the contribution of each to the model prediction ( ) (Janzing et al., 2019). In contrast, global interpretation methods, also known as dataset-level interpretation methods, aim to understand 130 the overall structure of that leads to different predictions when assumes different values (Ahmad et al., 2018;Guidotti et al., 2019). More information on the various interpretation methods can be found in Molnar (2021).

Shapley values and their applications in feature attribution
The problem of distributing payoffs among players has been extensively studied in game theory. In coalitional games with transferable payoffs, coalitions (i.e., cooperative groups) can be formed by individual players and can generate payoffs to be 135 distributed among their members. Shapley (2016) showed that a unique payoff profile exists for every coalitional game that satisfies a set of desirable axioms, which makes the payoff assignment "fair." Desirable axioms are discussed in detail later in the paper. The function that maps a coalitional game to the desired payoff profile is known as the Shapley value. For a coalitional game of players, the payoff assigned to player by the Shapley value, ∅ , is computed as ( 1)  140 where is a permutation of the players, which specifies the order that each player joins the coalition, ℛ is the space of all player permutations, is the set of players that join the coalition prior to player , and : ∈ ( ) → ℝ is a set function that maps every non-empty subset of players (i.e., each member of power set ( ) that is not empty) to a real number, where gives the total payoffs generated by a coalition. Therefore, ∅ can be interpreted as the expected marginal contribution of player (i.e., ( ∪ ) − ( )) to coalitions that are randomly formed by sequentially selecting players. More information 145 on the Shapley value can be found in Osborne and Rubinstein (1994).
There are considerable similarities between the tasks of attributing ( ) to each and distributing payoffs among players in coalitional games. If a feature attribution task is formulated as a coalitional game, then each is regarded as a player and its contribution to ( ) can be fairly determined using the Shapley value. However, cannot be used as in Eq. (1) because it is not defined for the subsets of { , , … , }. Lundberg and Lee (2017) thus defined using the observational conditional 150 expectation, which is the expected value of ( ) when the feature values of in are known, as in where ( ) represents an intervention that sets the feature values of in to . The difference between the two 155 expectations is that the observational expectation is the expected value of ( ) conditioning on = , and the interventional expectation is the expected value of ( ) when is set to , and the remaining feature values not in are taken from without conditioning on = .
After is defined, Eq. (1) can be used to compute ∅ ( , ), which can be considered as the contribution of to (x). This feature attribution method was proposed by Lundberg and Lee (2017) and is termed SHAP (SHapley Additive exPlanations). 160 The two definitions of in Eqs. 2 and 3 result in two versions of ∅ ( , ), which are termed observational and interventional SHAP values, respectively. Lundberg and Lee (2017) showed that many local feature attribution methods, such as LIME (Ribeiro et al., 2016) and DeepLIFT (Shrikumar et al., 2017), compute feature attributions that are approximations of the SHAP values.
Both the observational and interventional SHAP values satisfy a set of desirable axioms, which is ensured by the properties 165 of the Shapley values . (1) Local accuracy: the sum of ∅ ( , ) matches ( ), which is given as where ∅ ( ) is (∅), i.e., the expected value of ( ). However,  suggested that both observational and interventional SHAP values are useful. They claimed that the observational SHAP values are "true to the data" because they are effective in identifying the true correlations between the features and the modeled outcome, whereas the interventional SHAP values are "true to the model" because they do not credit the features that are unused by the model. The observational SHAP value of generally measures the value of knowing to 180 predict the outcome, and the interventional SHAP value of corresponds to the expected changes in the model prediction when the feature is set to the . Both the observational and interventional SHAP values are used in this study and the results obtained using the two methods are compared.

Modeling hydrological responses of SuDS using machine learning methods 185
Let denote the hydrological response of a SuDS site at time step and denote the time series of the hydrometeorological conditions and other factors measured on and before time step . The hydrological responses of SuDS at sub-hourly time scales are interested in this study because they are consistent with the typical temporal resolutions chosen for SuDS monitoring.
can be represented as a function of , given as where is a random vector of multiple elements, where is the rainfall depth recorded at time step − for the preceding time step and -represent measurements of the other antecedent hydrometeorological factors. The goal of machine learning is to approximate based on the observed samples of the input-output pairs ( , ). 195

Feature engineering methods
The hydrological responses of SuDS can be affected by relatively long-term hydrometeorological conditions that occurred in the past.
can thus be a long time series of hydrometeorological condition measurements. As pointed out by Nielsen (2019), many machine learning algorithms are not designed for modeling time series data. Therefore, long time series are often converted into lower-dimensional features that are then used as the input variables of machine learning models. The input 200 variable transformation process is known as feature engineering and is expected to produce higher-quality models (Kuhn and Johnson, 2019). The task of machine learning then becomes finding a function that predicts based on features ( ), where is an arbitrarily defined function that transforms into lower-dimensional features or features that are useful for predicting . Previous studies often treated as a function that extracts the summary statistics from a rainfall time series, 205 such as the duration and depth of a rainfall event (Yang and Chui, 2019). However, the information contained within the summary statistics may be insufficient to predict at sub-hourly time scales.
A rainfall time series may be represented by aggregated rainfall depths recorded during different intervals, which can be of lower dimensions. The rainfall time series recorded between time steps − and − can be represented by a rainfall depth , as 210 In this case, ( ) ≔ , , , , … , , , … , . However, an approach to optimally define the set of ( , ) pairs to create rainfall depth features is not known a priori; thus, various sets of these values must be tested.
This study proposes a simple method to systematically select cut points along the time axis, which form a series of intervals for defining , . As shown in Fig. 1a, the selection of cut points is controlled by three hyperparameters, , , and . : 215 a cut point is placed between time steps − and − − 1 such that the rainfall data recorded prior to − are considered irrelevant for predicting . : the rainfall data recorded between − and − 0 are considered to be most relevant for predicting , so a cut point is placed between every two neighboring time steps within this interval. : − 1 cut points are placed between − − 1 and − , such that the neighboring cut points correspond to intervals whose lengths roughly form an arithmetic sequence. 220  (1) is modified, such that is set to 0 for , with > . (4) This option is similar to option (3), except that is set to + 1 for , with > .
Representing a rainfall time series using a set of , can reduce the data dimension at the cost of losing information regarding the temporal rainfall distribution. Note that fewer cut points are selected for rainfalls in the long-term past (e.g., a 230 few days), implying that they play less important roles in predicting . This is reasonable considering the relatively fast response time of SuDS (DeBusk et al., 2011). Gauch et al. (2020) also showed that the hydrometeorological time series recorded in the long-term past can be represented using a coarser temporal resolution in machine learning models built for rainfall-runoff modeling without deteriorating their prediction accuracy. In the proposed method, the three hyperparameters and aggregation options control the aggregation level and approach by which rainfall data recorded at different time steps are 235 aggregated. The optimal hyperparameters and options are determined using the resampling technique as described below.

Contribution of rainfall at each time step to runoff predictions
The SHAP value of a rainfall depth feature , for a particular input specifies its contribution to the prediction of in the machine learning model and is denoted as ∅ , ( ). ∅ , ( ) can be distributed among the rainfall data to determine the extent to which rainfalls at each time step contribute to the prediction of . Let However, can be used in multiple rainfall depth features in the machine learning model. Its overall contribution to 245 predicting is the sum of the contributions associated with each rainfall depth feature. ( ) is the contribution of the rainfall depth datum recorded at time step − and can be computed following where is the set of ( , ) pairs used to define the rainfall depth features ( , ) of the machine learning models, is the subset of with ≤ ≤ , is the rainfall depth recorded at − for input sample , and , is the rainfall depth recorded between − and − .
After the contribution of rainfall at each time step is computed, the average rainfall age ( ) that contributes to the runoff 255 prediction weighted by importance can be computed as The absolute value of ( ) is used here because this term can sometimes be negative, and it is not logical to assign a negative weight to the rainfall age . Thus, ( ) should only be interpreted as the average age of the rainfalls that control the runoff prediction, but not the average time that the stormwater spends in the catchment prior to draining, i.e., water age (Walker and 260 Krabbenhoft, 1998). Lundberg et al. (2020) showed that the SHAP values of multiple samples can be combined to achieve a global understanding of the model structure and system processes. In this study, the contribution of rainfall at each time step to the runoff prediction, ( ) through ( ), is computed for each sample. The contributions obtained for various collections of samples 265 can be combined to better understand the model structure and rainfall runoff correlation for specific regions in the input domain. For instance, the input samples that correspond to high flow can be collectively analyzed if high flow predictions are of interest, and the global model structure can be inspected by analyzing the explanations for all samples. The average contribution associated with a set of samples can be characterized by its arithmetic mean because contributions are additive (Eq. (4)). The average contribution of to the runoff prediction among the input samples in set ( ( )) can therefore be 270 computed using

Combining local explanations to understand model structures and rainfall-runoff correlations
where | | is the number of elements of .
( ) can occasionally be negative in some models; positive and negative ( ) values may therefore cancel out when summing the values from multiple samples. The absolute value of ( ) thus indicates the importance of for the runoff 275 prediction and can be used when aggregating the contributions of multiple samples. The average importance of for the runoff prediction among samples in set ( ( )) can therefore be computed according to

Other global interpretation methods
There exist some commonly used global feature importance measures for different types of machine learning models. Three 280 global feature importance measures that are specifically designed for decision tree-based models are considered in this study to compare with . A typical decision tree has a flowchart-like structure with a root node that connects to a number of leaves (i.e., terminal nodes) via internal-node chains. Each leaf holds a prediction and each non-leaf node specifies a test, where an input sample is compared with a threshold value of the feature stored at that node. The test result determines which internal node or leaf will be next visited. The prediction made for is the value of the leaf it ultimately selects. A detailed introduction 285 to tree-based models can be found in Myles et al. (2004) The three measures considered in this study are gain, cover, and frequency. The gain of a feature is the relative value of the improvement evaluated at all of the nodes that split on this feature. In the tree-based models used in this study, the improvement of a split reflects the reduction of the objective function that the machine learning algorithm tries to minimize. The gain of a feature can be roughly regarded as its associated improvement of the prediction accuracy (Chen and He, 2020). The cover of 290 a feature is the relative value of the coverage associated with all of the nodes that split on this feature. For the runoff prediction problems examined in this study, the coverage of a node represents the number of samples from the training dataset that reach this node when passing the entire dataset through the trees. The frequency of a feature is the relative number of times that this feature has been used in the tree nodes to determine the splits. Computation details of these three measures can be found in Chen and Guestrin (2016) and Chen and He (2020). 295

XGBoost algorithm
XGBoost (Chen and He, 2020) is an open-source machine learning software library adopted in this study to train the machine learning models. The resulting models are gradient-boosted trees (Friedman, 2001). XGBoost is selected for its improved regularization methods, high computational efficiency, and ability to achieve state-of-the-art results on many machine learning tasks (Chen and Guestrin, 2016;Chen and He, 2020;Nielsen, 2016). Another reason for using XGBoost is that both the 300 observational and interventional SHAP values of the resulting models can be efficiently computed using the TreeExplainer algorithms proposed by Lundberg et al. (2020) using the model tree structures. The computation of the exact SHAP values can be highly computationally expensive for many other types of machine learning models. The XGBoost algorithm is briefly introduced below and more information can be found in Chen and Guestrin (2016) and Mitchell and Frank (2017).
A gradient boosted trees model built by XGBoost is an ensemble of decision trees, in which , the prediction for an input 305 , is the sum of the predictions of individual trees (Chen and Guestrin, 2016), given as where is a decision tree that maps to the values assigned with the tree leaves, is the number of decision trees (also known as the number of boosting iterations), and ℱ is the functional space of all possible regression trees.
The trees are built to minimize the objective function within the following equation 310 where is the observed outcome of an input , is the number of samples of ( , ) pairs used as the training dataset, and Ω is a regularization function that penalizes complex trees. There are a number of hyperparameters used in Ω that specify how much penalization a decision tree should receive. Other hyperparameters, such as and the maximal tree depth, are also used in XGBoost to control the model structure and learning behavior during training. A complete list of the hyperparameters can 315 be found in the XGBoost software documentation (Chen and He, 2020). The XGBoost hyperparameters and feature engineering hyperparameters are optimized together using a hyperparameter optimization process described below.

Nested cross-validation and Bayesian optimization for training and testing machine learning models
Both the feature engineering hyperparameters (described in Sect. 2.2.2) and XGBoost hyperparameters must be set prior to training a machine learning model. In this study, the optimal hyperparameters that minimize the prediction errors are found 320 using a nested cross-validation (CV) procedure with Bayesian optimization, from which the model prediction accuracy is also assessed.
The nested CV is implemented as follows. (1) The dataset of the observed ( , ) pairs is split into datasets of roughly equal size, which are termed folds. (2) During an outer CV iteration, the model with the optimal hyperparameters trained on − 1 folds is tested on the remaining fold. The outer CV iteration proceeds until all of the folds have been used for testing. 325 The procedures to identify the optimal hyperparameters during each outer CV iteration are introduced next. (3) The test errors are used as estimations of the generalization error of the machine learning algorithm, which is the expected prediction error of the unseen data.
During each outer CV iteration, an inner CV procedure is used to identify the optimal hyperparameters, where the dataset of − 1 folds is further split into folds. The effectiveness of the set of candidate hyperparameters is then assessed by the 330 validation error of the resulting models, which is the mean prediction error associated with each fold when the remaining − 1 folds are used for model training. A function that maps a set of hyperparameters to a validation error can thus be defined.
The optimal hyperparameters that minimize the prediction error can be found using various function optimization algorithms.
The function optimization problems are solved using Bayesian optimization methods in this study, which are sampleefficient algorithms for solving black box optimization problems (Shahriari et al., 2016). The sample-efficient property of 335 these methods allows near-optimal solutions to be found with a relatively small number of function evaluations. This is particularly useful for hyperparameter optimization, in which the validation errors can be expensive to compute (Snoek et al., 2012). In this study, the decision variables of the optimization problems are six XGBoost hyperparameters and three to five feature engineering hyperparameters depending on the case studies, as described in the following section. The lower and upper bounds of these variables can be found in the source code provided in the Code availability section. Bayesian optimization 340 methods may be treated as black-box methods that automatically find the optimal solution within the specified search space of the decision variables within the given computational budget. The searching behaviors of the methods are determined by a set of parameters. Their parameter values used in this study are also provided in the source code given in the Code availability section. A detailed introduction to Bayesian optimization can be found in Frazier (2018).
However, it should be noted that hyperparameter optimization processes face the risk of overfitting the model selection 345 criterion (Cawley and Talbot, 2010). During each inner CV iteration, multiple models are trained on the same dataset with different sets of candidate hyperparameters, and their prediction accuracies are evaluated on the same validation dataset. The model with the highest prediction accuracy may generalize poorly to the unseen data because it may simply exploit the randomness of the validation dataset without learning the true correlation between the input and output variables. It is therefore important to test whether the hyperparameters found during the inner CV iterations are still useful for the unseen data. Thus,350 in this study, the validation error associated with each set of candidate hyperparameters evaluated during the optimization processes is compared against its test error computed for the test dataset of the corresponding outer CV iteration. A positive correlation of the validation and test errors indicates means that the good hyperparameters identified during the inner CV iterations are likely to also be good for the unseen data. More information on the resampling techniques for testing machine learning models can be found in Kuhn and Johnson (2013) and Hastie et al. (2009) . 13

Case studies
Two SuDS sites with different drainage areas, SuDS practice types, and data availabilities are examined in this study. Machine learning models are built for each site to model the rainfall-runoff correlations. The prediction accuracies of these models are compared against linear regression models in addition to a process-based model for one site. The basis of the predictions is analyzed using the interpretation methods described above, and the obtained explanations for the two sites are compared to 360 validate the model structures and better understand the rainfall-runoff correlations.

Study sites
Study site 1 is located on Washington Street, Geauga County, Ohio, U.S., hereafter referred to as "WS." Multiple types of SuDS were built in WS to treat stormwater runoff generated by a nearby commercial building and parking lot, as shown in Fig. 2a (Darner et al., 2015). Runoff from approximately half of the commercial building roof (i.e., an impervious area of 316 365 m 2 ) drains into a rain garden with a surface area of 37 m 2 . The 762 m 2 parking lot was constructed using porous pavements to allow infiltration. Study site 2 is located in the Shayler Crossing Watershed (SHC) in Clermont County, Ohio, U.S., as shown in Fig. 2b. The SHC is a sub-watershed of the East Fork Little Miami River Watershed. The SHC drainage area is approximately 0.92 km 2 (Hoghooghi et al., 2018) and the land use type is primarily residential. The SHC drainage system consists of conduits, channels, 375 detention ponds, dry ponds, and wet ponds (Lee et al., 2018a). In the SHC, stormwater runoff indirectly generated by connected impervious areas (e.g., sidewalks) is treated by the nearby pervious areas, which are termed buffering pervious areas and have similar functions to grass filter strips (Lee et al., 2018b).
Rainfall and discharge time series data are available for both sites. In WS, a 10-min-resolution rainfall time series is available from on-site monitoring between 2009 and 2013. The outflow from WS was collected by three flumes, inside which 380 the water levels were measured every 1 min and recorded every 10 min and when the water level changed. Flumes 1, 2, and 3 respectively collect the surface runoffs from the parking lot, overflow from the surface layer of the rain garden, and underdrain flows from the parking lot. The water levels are converted into discharge measurements using stage-discharge rating curves.
Recorded flow data from 2009-2013 are available. The onsite monitoring was conducted by the United States Geological Survey, as described in more detail in Darner and Dumouchelle (2011) and Darner et al. (2015). In the SHC, 10-min-resolution 385 rainfall time series from 2009-2010 are available, in addition to a 10-min-resolution discharge time series measured at the outlet between July and August 2009. The dataset used in this study is the same as in Lee et al. (2018a) and Lee et al. (2018b), in which more details on the dataset can be found.
Process-based models can be difficult to develop for both sites. In WS, the physical properties and exact design of the different drainage system elements are not precisely known (Darner et al., 2015). The rain garden is also not isolated from the 390 gravel storage layer of the porous pavements, which permits an unknown amount of stormwater from the rain garden into the underdrain system of the porous pavement. However, commonly used process-based models are mostly designed to model SuDS with standard designs and may not be directly applicable to WS. In the SHC, the main challenge lies in the heavy workload and uncertainties in estimating the model parameters that characterize the complex drainage system. The SHC can be divided into multiple subcatchments connected by the drainage network, and a number of parameters must be determined 395 for each catchment. In particular, the portions of impervious area that are directly and indirectly connected to the drainage network must be specified to accurately represent the flow paths of each subcatchment. The pervious areas should also be subdivided into areas that treat or do not treat runoff (Lee et al., 2018a). The task of the sub-area division requires substantial effort, considering the relatively large number of subcatchments involved. The numerous parameters associated with each subarea (e.g., surface roughness, depression storage) can only be determined via model calibration, which is subject to 400 considerable uncertainties.
The two study sites differ in several aspects, including size, SuDS practice type, drainage systems configurations, and data availability. Both sites also face challenges in the application of process-based models and are thus chosen as typical examples of the various SuDS projects to demonstrate the wide applicability of the proposed methods. The configurations of the two study sites are compared in Table 1. More example applications of other SuDS sites can be found in the link provided in the A drainage network was built to collect stormwater runoff generated in different areas of the watershed. The drainage network consists of conduits, channels, detention ponds, dry ponds, and wet ponds. The indirectly connected impervious areas near the residential buildings are first treated by buffering the pervious areas.

Data availability
Flow data collected at the three flumes between 2009 and 2013 are available. The flow rates are derived from water level measurements using the stage-discharge rating curves. The water level at each flume was measured every minute and recorded at 10-min intervals and when the water level changed. A 10-min-resolution rainfall time series recorded between 2009 and 2013 is available.
Flow data collected at the SHC outlet at 10min intervals from July to August 2009 are available. A 10-min-resolution rainfall time series recorded between 2009 and 2010 is available.
Challenges to developing process-based models The physical properties of the catchment and SuDS are not precisely known. Unknown leakage from the rain garden to the storage layer of the porous pavement is difficult to represent in commonly used process-based models.
Many parameters are required in processbased models to characterize the drainage processes of the complex drainage system. A considerable amount of data on the physical properties of the catchment is required. Some parameters can only be determined through model calibration, which is subject to large uncertainties.

Hydrological models built in previous studies
The authors are not aware of any processbased model built for this site. Lee et al. (2018a) built a process-based model for this site using SWMM software (Rossman, 2015).
Rainfall-runoff models are built for WS and SHC. In WS, the output variable is the flow rate of the total runoff collected by the three flumes recorded at regular 10-min intervals. The input variables include the 10-min-resolution rainfall time series recorded prior to runoff, the month in which the runoff occurs (optional), and the accumulative rainfall depth recorded since the beginning of monitoring (optional). The optional features are considered to account for the possible evolving performance of the SuDS during its service life (Yong et al., 2013) and potential seasonality of the SuDS hydrological properties (Muthanna 415 et al., 2008). Whether the two sets of optional features should be included is controlled by the two binary feature engineering hyperparameters, in addition to , , and described in Sect. 2.2.2. The rainfall-runoff data recorded during 2010-2013 for the warm season (i.e., April to October) (Darner et al., 2015) are used for modeling in this study. Fig. 2c shows the temporal distributions of the available data. Flows were not observed at flume 1 throughout the monitoring period and are therefore not shown. The gaps in the runoff data in Fig. 2c correspond to missing values owing to technical issues or extended dry periods 420 with no records. However, the exact cause of each missing value is unknown. Because overflow from the rain garden occurred infrequently (Darner et al., 2015), the missing flow rates of flume 2 are assumed to be 0. The total discharge is then computed for cases in which the flow rates of flume 3 are available. A large gap in the rainfall data can be observed in the summer of 2013. The rainfall time series recorded before the large gap is used for modeling in this study, and the few missing values are assigned to 0. 425 The feature engineering and XGBoost hyperparameters are automatically optimized using the Bayesian optimization methods and the performance of the resulting models is evaluated using a nested CV procedure, as described in Sect. 2.2.7.
For WS, a 5-fold CV is used for both the inner and outer CV iterations. A total of 142 independent rainfall events are identified, defined as rainfall events separated by at least a 24-h dry spell (Guo and Senior, 2006), with 38,835 discharge records during the wet periods (i.e., periods of rainfall lasting at least 24 h), in which the rainfall time series is available for at least the previous 430 10 days. CV folds are created for the 38,835 rainfall-runoff records using the rainfall event-grouped stratified sampling method (Zeng and Martinez, 2000). The data associated with the same rainfall event are grouped and contained only within a single fold during each of the outer or inner CV iterations, and the peak discharge associated with the rainfall events in each fold roughly follows the same distribution. This is to prevent data leakage and ensure that the data in each fold are representative (Kuhn and Johnson, 2013). 435 In the SHC, the output variable is the watershed outlet discharge measured at 10-min intervals, and the input variable is the rainfall time series proceeding the discharge measurement. Only two months of runoff data are available with 8,921 discharge records. There are eight missing discharge records during the two-month period, which are excluded from the modeling. The nested CV procedure is not used due to the small dataset size; instead, the dataset is split into training, validation, and test datasets that each contain at least one large runoff event. The Bayesian optimization methods are then used to identify the 440 optimal feature engineering and XGBoost hyperparameters that minimize prediction error on the validation dataset when the model is trained on the training dataset. The training and validation datasets are then combined and the machine learning methods with the optimal hyperparameters are applied. The resulting models are then tested on the test dataset.
The prediction accuracy of the XGBoost models is compared against linear regression models. For each site, the resampling and hyperparameter optimization methods for training the XGBoost models are also applied to derive linear regression models. 445 The only difference in the training processes between the two model types is that only the feature engineering hyperparameters are used when fitting linear regression models to the data.
For the SHC, the process-based model developed by Lee et al. (2018a) is also compared with the machine learning models built in this study. Their model was built using SWMM software, in which the SHC was divided into 191 subcatchments and the drainage processes in each subcatchment were characterized by further division into sub-areas, such as indirectly connected 450 impervious areas and pervious buffering areas. Lee et al. (2018a) used a considerable amount of data concerning the SHC physical properties to develop this spatially refined model. After testing the model prediction accuracies, the basis of each runoff prediction is analyzed using the methods described above (i.e., derivation of the rainfall contribution at each time step to the runoff prediction). The explanations of many samples are then combined in different ways to understand the model structure and modeled processes. (1) The explanations of all of 455 the samples are aggregated using Eq. (14) to derive the global feature importance. The feature importance measures are then compared with the gain, cover, and frequency. The feature importance of WS and SHC is also compared to examine whether the feature importance is associated with the modeled hydrological processes. (2) The explanations of the samples are grouped and aggregated based on the predicted discharge to investigate whether the prediction of low and high flows are determined by the same factors, and whether the model structures that lead to these predictions are logical. 460 The basis of the individual runoff predictions is analyzed at the runoff event levels. The rainfall contributions at different time steps are available for each runoff prediction. This information can then be used to decompose the hydrograph into subhydrographs associated with the rainfall at different time steps. This resolves some of the hydrograph separation tasks (Pelletier and Andréassian, 2020), such as identifying the runoff volumes associated with a rainfall event and determining the portions of flow contributed by "old water" (i.e., water stored in the catchment prior to the rainfall event) (Cartwright and Morgenstern, 465 2018). The average age of the rainfall contributing to runoff prediction at each time step is also computed using Eq. (12).

Prediction accuracy of the machine learning models
The Bayesian optimization methods are implemented to automatically identify the optimal feature engineering and XGBoost hyperparameters that minimize the prediction errors of the validation datasets. It is necessary to check whether the prediction 470 accuracies achieved for the validation datasets can be generalized to the unseen data when applying the optimal hyperparameters (i.e., whether or not the model selection criterion is overfitted) (Cawley and Talbot, 2010). Fig. 3 shows the prediction errors associated with each set of candidate hyperparameters obtained during the inner and outer CV iterations in shows the results obtained during an outer CV iteration when a specific aggregation option was applied for creating the rainfall 475 depth features. The inner and outer CV errors are positively correlated for all of the outer CV iterations and aggregation options, which is desirable. This result confirms that the hyperparameter optimization process indeed found hyperparameters that perform well on unseen data and the model selection criterion is not overfitted. Similar positive correlations between the validation and test errors are observed in the SHC when different aggregation options are adopted for creating the rainfall depth features. 480

485
The prediction accuracies of the various WS and SHC models are compared in Fig. 4. The RMSE, Nash-Sutcliffe model efficiency coefficient (NSE; Nash and Sutcliffe, 1970), and coefficient of determination (R 2 ) of the predictions on the test datasets are compared, except for the SWMM model developed by Lee et al. (2018a), which was tested on a part of its training dataset due to insufficient data. The prediction accuracies of the XGBoost models, i.e., NSE > 0.7 and > 0.7, can be considered satisfactory, considering that they were relatively easy to set up and that it was impossible or very difficult to build 490 process-based models for either site. The XGBoost models for both sites consistently outperform the linear regression models (LM), suggesting that more sophisticated machine learning algorithms (e.g., XGBoost) are able to better capture complex rainfall-runoff correlations than simple linear regression methods. The SHC XGBoost models have comparable prediction accuracies to SWMM, although the former were built with considerably less data.

Figure 4 Prediction accuracies of the various models built for the Washington Street SuDS site (WS) and Shayler Crossing Watershed (SHC). The prediction accuracies are evaluated in terms of the root-mean-square error (RMSE), coefficient of determination (R 2 ), and Nash-Sutcliffe coefficient of efficiency (NSE). The RMSE units are L/s for WS and m 3 /s for the SHC. Each
data point in the figure shows the prediction accuracy of a model tested on the associated test dataset. All of the SHC models were tested on the same dataset, and nested cross-validation was not implemented due to the small sample size. The SWMM model for

SHC was built by Lee et al. (2018a).
sites. Thus, all of the aggregation options may be useful. In practice, an ensemble of models produced using different feature engineering methods may be used.
Each data point in Fig. 4 shows the result obtained for a machine learning model that was trained and tested on a pair of 505 training and test datasets. In WS, considerable variations in the prediction accuracies can be observed when a machine learning method is applied to different training and test datasets owing to the different resulting model structures and model performance assessment criteria resulted from different test datasets. The variations indicate that the sample distributions in the different versions of training and test datasets appreciably differ, even though a stratified sampling method is used to balance the sample distribution in the different folds. The imbalanced sample distribution is associated with the limited number of samples used 510 for the model training and evaluation, which implies that the four years of rainfall-runoff data still contained an insufficient number of samples for the different regions of the input space in the machine learning methods examined in this study. For instance, only a few high-flow points were observed each year, which may be insufficient for the training machine learning models to provide accurate high-flow predictions. Even fewer samples were available for training and testing the SHC machine learning models; thus, the uncertainties of the prediction accuracies may be even larger. 515 Given that the uncertainties in the prediction accuracies are not negligible, the model assessment results should be interpreted carefully. It may be useful to examine whether the model structures that lead to each prediction are logical, which is examined in the following sections.

Understanding the global model structures for rainfall-runoff modeling
The predicted discharge for a specific input sample can be decomposed into the rainfall contributions of different time steps 520 using Eq. (9) to Eq. (11) based on the SHAP values assigned to the rainfall depth features. The absolute value of these contributions specifies the importance of the rainfall for the runoff prediction. The contributions and importance associated with all of the samples can be averaged to understand the global model structures for making discharge predictions using Eq.
(13) and Eq. (14). Fig. 5a and 5b respectively show the average importance and contributions of the rainfall at each time step in the past for predicting the discharge at the current time step in the different models of WS. Each subplot displays the results 525 obtained using a specific aggregation option for creating the rainfall depth features and a specific definition of the SHAP values (i.e., Eq. (2) or Eq. (3)). Fig. 5a shows that in all models, the discharge prediction is mostly affected by more recent rainfalls (note that the x-axis is on a pseudo-logarithmic scale and each time step is 10 min). The rainfalls recorded prior to 100 time steps in the past (i.e., 16.7 h) have almost no impact on the discharge prediction. This result is reasonable considering the small WS catchment size. However, Fig. 5b shows that on average, the more recent rainfalls have a negative contribution to the 530 discharge prediction. The negative values indicate that the more recent rainfalls can have a negative contribution to the runoff predictions across the samples in the machine learning models. This contradicts the conventional process-based models, where rainfall positively affects the runoff generation processes. Each subfigure of Fig. 5 shows the results obtained for different versions of the models derived in the outer CV iterations of each aggregation option. Across different versions of the models, the rainfall at each time step is generally found to have 540 similar a contribution and importance to the discharge prediction, which indicates a high degree of similarity between the rainfall-runoff correlations learned by these models. However, the rainfall can be assigned notably different importance and contributions values when different aggregation options are adopted, which implies that the rainfall-runoff correlations are modeled differently in these models. Schmidt et al. (2020) referred to the existence of various possible model structures of machine learning models as equifinality, which is an important concept in hydrological modeling . 545 Both the observational and interventional SHAP values are used to compute the rainfall contributions and importance values. As mentioned in Sect. 2.1.2, the resulting values should be interpreted differently. The contributions resulting from the observational SHAP values can be interpreted as the expected difference in the predicted discharge when a particular rainfall is observed/not observed, accounting for the presence/absence of the other rainfall measurements. The importance is the absolute value of the contributions, which measures the extent of the effect. The contributions that result from the interventional 550 SHAP values are the expected prediction changes when rainfall is set to a specific value, accounting for the presence/absence of other rainfall measurements. The importance is also the absolute value of the contribution. As shown in Fig. 5, the contributions and importance values resulting from the interventional SHAP values have larger variabilities among the different outer CV iterations, which suggests that the interventional SHAP values are more sensitive to the small model structure variations. Nevertheless, the observational and interventional SHAP values produce similar results in terms of the 555 relative importance of the rainfall at different time steps for the runoff prediction.
The average rainfall contribution and importance for the discharge prediction are also computed for SHC, as shown in Fig.   6. The runoff predictions are mostly affected by the rainfalls observed in the periods approximately nine time steps in the past (i.e., 1.5 h). In contrast, the discharge prediction of the SHC is mostly affected by the rainfall from the past 1 time step (i.e., 10 min). Because the SHC is considerably larger than WS, it is expected to have a longer concentration time (i.e., longer 560 required time for stormwater to travel through the catchment). It is therefore reasonable that the models show that the discharge predictions in the SHC are mostly affected by rainfall from the more distant past than those in WS. It is also worth noting that the contribution and importance assigned to the rainfall do not vary smoothly across the time steps, which indicates that the models find rainfall in some periods to be more important than other periods, which is not realistic. These undesirable model behaviors, which were not observed in the WS models, might be caused by the insufficient data used in the model training. 565  (14) is also computed for rainfall 580 of each time step. A normalization procedure is applied in which the importance of the rainfall of each time step is divided by the sum of the average importance of all features (including both the rainfall depth features and the other features). Fig. 7 compares the various feature importance measures of rainfall of each time step of the WS models. In general, the feature importance is more unevenly distributed across time steps when gain metrics are used than the cover and the frequency metrics.
This result suggests that although the frequency of the rainfall of each time step used by the models to make predictions is 585 similar, the more recent rainfalls are associated with greater contributions to improving the prediction accuracy. Results of the two SHAP value-based feature importance measures are generally similar to that of the other commonly used importance measures. These similarities confirm the validity of the SHAP values in identifying the important features. In addition, feature importance measures computed using various methods are also found to be relatively similar in the SHC.

610
It is interesting to note that the recent rainfalls in Fig. 8b have a negative contribution to the discharge prediction in small to medium discharges (i.e., 0-80 th percentiles). This means that the discharge predictions would be higher if information regarding the recent rainfall depths were unavailable. The implication is that the models have certain threshold values, such that the recent rainfalls have non-negative contributions to the discharge predictions only if they are larger than a given information from the relatively distant past. These properties correspond to a conceptual rainfall-runoff model that differs from other commonly used models in hydrology, such as the explicit soil moisture accounting models (Beven, 2012), where rainfall cannot have a negative contribution to runoff. In machine learning models, however, the catchment is associated with a basic discharge prediction that can be negative, and the rainfalls are compared to some threshold and will only have a positive contribution if they are larger than the threshold and vice versa. This representation can be roughly expressed as Eq. (17), 620 which has a similar form to linear regression models.
Discharge prediction = basic prediction + (rainfall depths − thresholds) * transformation factors, This model can be referred to as a company acquisition model, in which the company (i.e., basic prediction) is in the business of acquiring companies of various values (i.e., rainfall at different time steps), which can be negative. The total value (i.e., discharge prediction) of the merged company is determined by the size of each company and how each company interacts. 625 Negative rainfall contributions were also found for the low-flow predictions in some of the models built using other rainfall depth feature aggregation options for WS and the SHC.

Hydrograph decomposition according to the contribution of rainfall of each time step to runoff prediction
The contribution of runoff at each time step to the discharge prediction can be useful for hydrograph decomposition, as each discharge prediction equals the sum of the SHAP values of all of the input sample features in addition to some constant bias (i.e., local accuracy properties described in Eq. (4)). As an illustration, Fig. 9 shows the decomposed hydrographs for large, medium, and small runoff events predicted by a machine learning model of WS. The rainfalls collected at 10-min intervals are 635 aggregated to rainfall depths recorded between the past 0-1 h, 1-2 h, and so on. The contribution associated with each rainfall depth record is also accordingly aggregated. Fig. 9 shows that runoff is mostly affected by the rainfall that occurred in the past 1 h regardless of the runoff event magnitude, especially for the peak discharge. However, this hydrograph decomposition method has some limitations. First, negative contributions are assigned to the rainfalls, which are difficult to interpret. Second, there is a constant bias term in the hydrograph, which does not correspond to any element in the commonly used conceptual 640 rainfall-runoff models. Third, a model might use features that are not derived from rainfall (e.g., temperature) as a predictor, which will also be assigned with contributions to the runoff predictions when SHAP methods are used. It is unclear how to represent the contributions of factors other than rainfall when decomposing a hydrograph. Nevertheless, the hydrograph decomposition results shown in Fig. 9 are still useful for understanding why a given prediction is made by the model and to what extent the rainfall of each time step contributes to the model runoff prediction. 645   . 10 shows the average age of the rainfall that controls the discharge predictions of a large, medium, and small runoff 650 event in a WS model, which was computed using Eq. (12). The average rainfall age generally decreases when rainfall occurs and increases when rainfall stops. The average age is affected by rainfall depth. For instance, in the large runoff event, the average age of the rainfall around the 11 th hour only decreases to approximately 2.5 h due to the small size of the rainfall compared with the large rainfalls observed at the beginning of the event. This result implies that large rainfalls generally affect the runoff generation process for longer periods than small rainfalls, which is reasonable. The average age computation method 655 defined in Eq. (12) is similar to the end-member mixing model used in hydrology to determine the runoff sources (Burns et al., 2001). The average ages computed in this study may also be useful for determining the runoff sources. Considering the small size of the WS (0.0011 km 2 ), the runoffs associated with an average age older than 1 h may be partly contributed by stormwater that slowly passes through the soil layers of green roofs and bioretention cells. This study, however, does not compare the hydrograph decomposition and average age of the rainfalls derived for different runoff events using different models. A systematic multi-event and multi-model comparison is therefore left for future research. 665 It would also be meaningful to compare these quantities derived using the proposed methods against those from other commonly used approaches in process-based modeling and tracer and isotope hydrology.

Conclusions
This study proposes an explainable machine learning method to predict and interpret the hydrological responses of sustainable urban drainage systems (SuDS) to rainfalls at sub-hourly time scales. The proposed methods are applied to two SuDS 670 catchments of different sizes, SuDS practice types, and data availabilities to predict discharge and produce models with good prediction accuracies (NSE > 0.7 and > 0.7 for all models). These models consistently outperform linear regression models and have similar prediction accuracies to a process-based model for one study site. The local feature attribution methods, i.e., the SHAP values, are then adopted to explain the basis of each discharge prediction. The contribution of the rainfall of each time step to each discharge prediction is further derived. The average contribution of rainfall at different past 675 time steps to the discharge prediction is then calculated using all of the input samples and used as a feature importance measure.
The feature importance measure is useful for analyzing the model structure and provides insights into the runoff generation process. In all of the machine learning models, the discharge prediction is found be substantially affected by the rainfalls in the past few time steps, which is reasonable considering the small catchment sizes. In models trained using only one month of data, fluctuations are detected in the importance derived for the rainfalls across different time steps. The unrealistic fluctuation 680 patterns reveal the insufficiencies of the model structures. This study further combines the explanations of low-and high-flow predictions to understand the model behavior for input samples from different regions of the input domain. A conceptual rainfall-runoff model is identified in these models, in which the final discharge prediction is the sum of the basic prediction and contributions associated with the rainfalls of different time steps. However, the rainfall contributions can be negative, which distinguishes this representation from commonly used processed-based models. This study briefly shows that a 685 discharge prediction can be decomposed into the contributions of rainfall at each time step, which can be useful for hydrograph separation and determining the runoff source.
It was difficult to build process-based hydrological models for the two SuDS catchments examined in this study due to insufficient information regarding the physical properties and drainage processes. This study designs a simple feature engineering method to facilitate the application of the commonly used machine learning algorithms to model rainfall-runoff 690 correlations at sub-hourly time scales. In this method, rainfall depth features extracted from high-resolution rainfall time series are used as input variables of the machine learning models, instead of the original time series. The depth features are calculated as the rainfall depths measured over specific time intervals, the number and locations of which are controlled by a few hyperparameters. Bayesian optimization methods and a nested cross-validation procedure are adopted to derive the optimal feature engineering hyperparameters along with the hyperparameters required by the specific machine learning algorithms. 695 The model training process is automatic and only requires that the range of possible hyperparameter values be defined. In this study, the hyperparameter ranges are set to be the same for the two study sites. The proposed methods are also tested on a few other SuDS catchments with different configurations, and the prediction accuracies are generally found to be satisfactory. The source code used in this research and the additional cases studies can be found in the Code availability section.
Local feature attribution methods are used in this study to explain the basis of each prediction. These explanations are found 700 to be useful for checking the validity of the machine learning models, i.e., whether the input-output correlations learned by the models make sense for the various input samples. The feature importance measure derived from the local explanations is compared with commonly used global feature importance measures and found to generate similar results. The local explanations are also used in applications such as the development of conceptual rainfall runoff models, hydrograph separation, and the identification of runoff sources. Overall, the application of local feature attribution methods increases the transparency 705 of machine learning predictions. These methods can be useful for detecting errors in the model structures and inferring the hydrological processes being modeled from the data.
Future research may proceed in the following directions. First, this study designed a feature engineering method to reduce the dimension of input variables, i.e., rainfall time series. However, feature engineering methods can be designed arbitrarily and it can be computationally expensive to identify the optimal methods. Future studies can thus explore the application of 710 machine learning methods that are specifically designed for modeling high-dimensional time series data, such as the long short-term memory (LSTM) networks in deep learning. Second, the methods proposed in this study are only applied to model the correlations between rainfall time series and discharge in a few U.S. sites. The proposed methods should therefore be tested in more sites worldwide to model the correlations between other variables, although this may require the development of new feature engineering methods. Finally, it may be useful to compare the hydrological processes inferred by the machine learning 715 models to those obtained using conventional methods. For instance, the hydrograph decomposition methods presented in this study can be compared with commonly used process-based modeling approaches.

Data availability 725
The data of the two study sites examined in this study is obtained from the United States Geological Survey (