Large-sample assessment of spatial scaling effects of the distributed wﬂow_sbm hydrological model shows that ﬁner spatial resolution does not necessarily lead to better streamﬂow estimates

. Distributed hydrological modelling moves into the realm of hyper-resolution modelling. This results in a plethora of scaling related challenges that remain unsolved. In light of model result interpretation, ﬁner resolution output might implicate to the user an increase in understanding of the complex interplay of heterogeneity within the hydrological system. Here we investigate spatial scaling in the realm of hyper-resolution by evaluating the streamﬂow estimates of the distributed wﬂow_sbm hydrological model based on 454 basins from the large-sample CAMELS data set. Model instances were derived at 3 spatial 5 resolutions, namely 3 km, 1km, and 200m. The results show that a ﬁner spatial resolution does not necessarily lead to better streamﬂow estimates at the basin outlet. Statistical testing of the objective function distributions (KGE score) of the 3 model instances show only a statistical difference between the 3km and 200m streamﬂow estimates. However, results indicate strong locality in scaling behaviour between model instances expressed by differences in KGE scores of on average 0.22. This demon-strates the presence of scaling behavior throughout the domain and indicates where locality in results is strong. The results of 10 this study open up research paths that can investigate the changes in ﬂux and state partitioning due to spatial scaling. This will help further understand the challenges that need to be resolved for hyper resolution hydrological modelling. 380 This study answered where locality in results are strong due to spatial scaling effects. Future research should conduct an in-depth assessment of basins where differences in streamﬂow estimates and lateral ﬂuxes are large due to spatial scale. This will lead to a better understanding of why and under what conditions locality in spatial scaling related issues occur.

. Basin locations of the CAMELS data set with in green the successful model runs. In blue failed runs due to parameter or run time errors and in red due to missing streamflow observations. Basemap made with Natural Earth.
ues. The precipitation variable is disaggregated to the modelling grid using the second-order conservative method to ensure consistency of the total volume of precipitation across spatial scales. The temperature variable is disaggregated with the environmental lapse rate and the Digital Elevation Model (DEM) used by the hydrological model. The variables required by the De Bruin method are disaggregated using the (bi)linear method and subsequently used to calculate potential evapotranspiration. The code for all these pre-processing steps is included in the Jupyter Notebooks made available with this manuscript 115 (DOI:10.5281/zenodo.5724512).

Model Experiment Setup
2.2.1 The wflow_sbm model (v.2020.1.2) Wflow_sbm is available as part of the wflow open source modeling framework (Schellekens et al., 2020), which is based on PCRaster (Karssenberg et al., 2010) and Python. Figure 2 shows the different processes and fluxes that are part of the 120 5 https://doi.org/10.5194/hess-2021-605 Preprint. Discussion started: 1 December 2021 c Author(s) 2021. CC BY 4.0 License. wflow_sbm hydrological concept. The soil part of wflow_sbm model is largely based on the Topog_SBM model (Vertessy and Elsenbeer, 1999), that considers the soil as a "bucket" with a saturated and unsaturated store. For channel, overland and lateral subsurface flow a kinematic wave approach is used, similar to TOPKAPI (Benning, 1995;Ciarapica and Todini, 2002), G2G (Bell et al., 2007), 1K-DHM (Tanaka and Tachikawa, 2015) and Topog_SBM (Vertessy and Elsenbeer, 1999). Wflow_sbm has a simplified physical basis with parameters that represent physical characteristics, leading to (theoretically) an easy linkage of 125 the parameters to actual physical properties. Topog_SBM is mainly used to simulate fast runoff processes during discrete storm events in small catchments (< 10 km2) (evapotranspiration losses are ignored). Since evapotranspiration losses and capillary rise were added to wflow_sbm, the derived wflow_sbm approach can be applied to a wider variety of catchments. The main differences of wflow_sbm with Topog_SBM are: -The addition of evapotranspiration and interception losses (Gash model (Gash (1979)) on daily time steps or a modified 130 Rutter model on subdaily time steps (Rutter et al., 1971(Rutter et al., , 1975).
-The addition of a root water uptake reduction function (Feddes et al., 1978).
-The addition of capillary rise.
-The addition of glacier and snow build-up and melting processes.
-Wflow_sbm routes water over an eight direction (D8) network, instead of the element network based on contour lines 135 and trajectories, used by Topog_SBM.
-The option to divide the soil column into any number of different layers.
-Vertical transfer of water is controlled by the saturated hydraulic conductivity at the water table or bottom of a layer, the relative saturation of the layer, and a power coefficient depending on the soil texture (Brooks and Corey, 1964).

140
We derived 3 model instances at varying spatial model resolution that cover a 3km, 1km, and 200m grid. The parameter sets were derived using the hydroMT software package (Eilander and Boisgontier, 2021). The data sources for deriving parameter sets are open-source global data sets. These include topography, surface water, landcover & landuse, soil, meteo, and river gauge data. The PTF to estimate soil properties is based on Brakensiek et al. (1984). An overview of the data and references are provided in Table 1.

Model Runs & Calibration
While for most parameters of the wflow_sbm model a priori estimates can be derived from external sources as explained above, a single non-distributed parameter needs to be calibrated for each basin: the saturated horizontal conductivity often expressed as a fraction (KsatHorFrac) of the vertical conductivity. To do this calibration the model simulations (3km, 1km, and 200m) Figure 2. Overview of the different processes and fluxes in the wflow_sbm model (Schellekens et al., 2020). The calibration procedure finds an optimal parameter value based on the KGE objective function of streamflow estimates at the basin outlet. A single non-distributed parameter was calibrated, the saturated horizontal conductivity fraction (KsatHor-Frac). This parameter cannot be derived from external data sources because it compensates for anisotropy, unrepresentative 155 point measurements of the saturated vertical conductivity, and model resolution (Schellekens et al., 2020). Increasing this pa-rameter leads to an increased base flow component and reduces peak flow and flashiness. We calibrated the models to match model setups of those used by the users of the hydrological model.

eWaterCycle platform
This research was conducted within the eWaterCycle platform (Hut et al., 2021). eWaterCycle follows by design the FAIR 160 principles of data science (Wilkinson et al., 2018) and allows high level communication with models regardless of programming language through the Basic Modelling Interface (Hutton et al., 2020

Benchmark Selection
To select basins where model performance is at least reasonably well, we applied a statistical benchmark to beat. The use of a benchmark allows for better interpretation of objective function based results (Garrick et al., 1978;Pappenberger et al., 2015; 170 Schaefli and Gupta, 2007;Seibert, 2001;Seibert et al., 2018;Knoben et al., 2020). We adopt, in part, the same methodology for statistical benchmark creation as Knoben et al. (2020). The benchmark is created by calculating the mean and median of streamflow observations per calendar day of the 10 year evaluation period and the results are compared to streamflow predictions from the model. KGEs are calculated for the observed streamflow versus these mean and median values. The mean is expected to better represent larger basins with more stable flow regimes whilst the median better covers the flashiness of 175 smaller headwater basins. This benchmark serves as a lower boundary for the model predictions: if the model, for any of the three resolutions, has a lower KGE than either the median or mean flow, it is considered not suited for this study and removed from the list of basins.

Comparison of Streamflow Estimates
To provide more context to the results in terms of general model performance, we compared the streamflow estimates from 180 wflow_sbm to those of the study by Knoben et al. (2020). The study by Knoben et al. (2020) ran 36 conceptual models using the Modular Assessment of Rainfall-Runoff Models Toolbox V1.0 (MARRMoT) (Knoben et al., 2019a, b) on the CAMELS data set. First, we calculated the mean of the 36 models for each basin. Next, we ensured a match between the basins under investigation by both studies.
The inter-model (instance) comparison of the streamflow estimates in this study is assessed using a cumulative distribution 185 function (CDF). We applied the Kolmogarov-Smirnoff test (Kolmogorov, 1933;Smirnov, N.V., 1933) to test if the differences between the KGE score distributions of the model instances are statistically relevant. Allowing the acceptance or rejection of the hypothesis stating that the differences in streamflow estimates at various spatial resolutions will be small.

Results
The results in this section are based on the KGE objective function at the basin outlet for the evaluation period. Results based on 190 the Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970) are available in the repository (DOI:10.5281/zenodo.5724576).The first part describes the effect of spatial scaling on model calibration, model evaluation and terrain characteristics. The second part describes the streamflow analyses results for the CAMELS data set.

The effect of calibration on streamflow estimates
To illustrate how model calibration effects each model instance, we first show the calibration curves of a single basin (ID:14301000).

195
To avoid presentation bias we selected a basin with moderate performance and only show the last year of calibration. for KsatHorFrac values in relation to model resolution or geographic location were found after calibration.

The effect of spatial scale on terrain characteristics
We illustrate the effect of spatial scaling on the parameter set of the 3 model instances by showing the difference in topography and drainage density for 3 basin. To avoid presentation bias, the basins were sampled based on poor streamflow performance (ID:06878000), moderate performance (ID:02231342), and good performance (ID:06043500). Figure 4a shows the probability density function (PDF) of the height distributions of the model instances for each of the 3 basins, Figure 4b shows the slope 210 distribution, and Figure 4c the profile curvature distributions.
The height distribution of the DEM in Figure 4a shows, most clearly for basin ID 02231342, how the representation of the highest altitudes is underestimated by the 3km model instance (orange) compared to the 200m instance (green) and to a lesser extent the 1km instance (red). Essentially, at coarser spatial resolution the terrain is flattened at high altitudes. An opposite effect is shown in this basin for the lower altitudes where the finer resolution instances better capture gentle slopes that are 215 flattened at coarse resolution. This effect is also detectable in the slope and profile curvature PDFs shown in Figures 4bc. As can be expected from the height distributions, the slope of the 200m instance has more gentle and steep sloping topography than the 1km and 3km instances. This is shown by the narrower slope distribution for the coarse spatial resolution that broadens with finer resolution. The differences in the mean slope of the basins between model instances is marginal, e.g. 0.00019 m*m-1 for basin ID 02231342. The profile curvature in Figure 4c indicates whether a slope is linear (values close to 0), concave (values 220 smaller than 0), and convex (values larger than 0). The 3km and 1km instances show similar slope geometries. With the 3km instance having slightly more linear slopes. At the finest resolution (200m) the slopes geometry shifts from linear slopes to either convex or concave curvature profiles.
In addition to topography we calculated the drainage density for each of the model instances defined as total river length divided by basin area. The results in Table 2 show small differences between the model instances for each of the 3 basins.  that equal 0 indicate linear slope geometry, smaller than 0 concave slope geometry, and larger than 0 convex slope geometry.

The effect of spatial scale on streamflow estimates
The same 3 example basins are used to illustrate the differences in streamflow estimates between model instances for the evaluation period. Only the last year of the evaluation period is shown in Figure 5.  In the case of poor performance, Figure 5a, it may occur that the model instances are overestimating streamflow during peak flow. The best performing model instance has the smallest peak flow estimates which in many cases is the coarsest spatial 230 resolution instance (3km). Of the 454 basins, 3km instance has the lowest peakflow estimates in 279 occurrences of which the model is best performing 148 times. The example in Figure 5b shows that when the 1km model instance is best performing this often occurs in conjunction to relatively similar performance of the 200m model instance. In 78.7% of the cases where the 1km instance is best performing the difference in KGE score with the 200m instance is smaller than 0.1. The final example in Figure   5c illustrates when the finest spatial model resolution (200m) is best performing and the coarsest (3km) least performing. The 235 200m model instance best captures peak flow and the receding limbs of the hydrograph.

Benchmark selection
We calculate a statistical benchmark to determine basins where the streamflow estimates of the model instances are deemed adequate for further analyses. The best performing type of climatology of calendar day benchmark, either mean or median, is depicted in Figure 7a. Figure

Streamflow estimates of model instances
The KGE score results for the evaluation period are shown in Figure 5 in the form of an Cumulative Distribution Function (CDF). The KGE score distribution of the mean of 36 hydrological models from Knoben et al. (2020) is included and is 250 referred to as "MARRMoT mean results". Of note is that comparison between studies is not one-on-one due to differences in model run periods, forcing, and numerical solvers. We can, however, obtain information about general model performance between both studies.   When we consider only the wflow_sbm instances, approximately 64 % of the results of the model instances are higher than 0.50 KGE score and of those 18 % are higher than 0.75 KGE score. The distributions cross at multiple points, for example at the bottom 10 % of the distribution the 3km instance has the highest and the 1km the lowest KGE score. At 40 % of the distribution and lower the 200m instance is followed by the 1km and 3km instances in terms of highest KGE score.
Next, we apply the Kolmogorov-Smirnov (KS) statistic to test whether the CDF of the model instances statistically differ 265 from each other, for a given p-value of 0.05. The KS-test results in Table 3 show that the difference between 3km and 200m model instances is statistical relevant at a p-value of 0.01. On a large-sample, this means that increasing the spatial model  The CDF does not provide information at a basin level. To gain insight into the spatial distribution of the KGE scores of

Discussion
We applied an initial benchmark based on streamflow observations for basin selection to identify basins where streamflow estimates are deemed adequate for further analyses. This does not imply that excluded basins are less relevant. Instead it implies that an in-depth model assessment is required to understand why the model is not able to simulate adequate streamflow 285 estimates in these basins. In addition to the benchmark, we added a layer of context by including results from the study by  Knoben et al. (2020). This is an imperfect comparison due to differences in inputs, numerical solvers, and simulation period.
However, the results do provide information on general model performance. The results show that the wflow_sbm streamflow estimates are inline with estimates of the mean of the 36 MARRMoT models. The spread of results is smaller for the 36 MARRMoT models which is due to averaging and likely the more extensive calibration routine of the conceptual models.

290
To improve future comparative work, we advocate for the creation of model output storage guidelines that use the CAMELS data set as case study area. These guidelines should encompass the differences between hydrological models, such as distributed and non-distributed modelling grids. A first step is the inclusion of distributed data sources in the CAMELS data set, e.g. meteorological data. We further propose the use of a model experiment environment, such as eWaterCycle (Hut et al., 2021), to generate model results. This allows for similar pre-processing of inputs, standardization of outputs, and reproducible 295 modelling studies. An example of how to apply these steps using the eWaterCycle platform is provided in Jupyter Notebooks that supplement this publication (DOI:10.5281/zenodo.5724512). The ease of setting up a model experiment and storing output is an incentive for users to store model results while conducting extensive modelling studies even when results are deemed not suitable for publication.
At the start of the study we hypothesized that differences between model instances would be small due to quasi-scale We recognize that large-sample assessments obscure variations in simulations between instances due to the sample size. On 315 a basin level we find that local scaling issues are in effect throughout the spatial domain. This is depicted by the differences between the KGE scores of the model instances ( Figure 8c). On average these is a 0.22 KGE score difference with extremes of more than 0.5 KGE score difference at multiple basins. We find that the 1km instance is best performing in basins where the difference between minimum and maximum KGE score of the 3 instances is small (Figure 8c). We attribute this partly to the calibration routine finding a better optimal parameter value (KsatHorFrac) as results are often close to those of the 200m 320 instance. For the best performing 3km or 200m model instances (Figure 8d) there are only small geographical trends of best model performance in the South and Appalachia.
We conducted a terrain analyses (Figure 4) to identify changes in terrain characteristics due to spatial resolution that might explain the differences between model instance streamflow estimates. Minor effects are depicted by the profile curvatures present at 200m resolution. Slopes are less linear at fine resolution than at coarse resolution. The effect on the hydrological 325 response is, however, expected to be small as stated by Bogaart and Troch (2006). Similarly small changes are found for the differences in drainage density between model instances. This confirms that the drainage network up-scaling method of We applied the same meteorological forcing products and downscaling routine for each model instance. This ensured that for example the total volume of precipitation remained consistent across scales. A coarse grid cell contains a volume of precipitation that is equally redistributed over the equivalent amount in size of finer grid cells. In reality this redistribution 340 of water might not be equal across the finer grid cells and therefore scaling behaviour is introduced due to the locality of precipitation. This has an effect on the streamflow estimates as the locality of precipitation directly influences hydrological processes that are dominant at different locations (e.g. hill slope). It is of interest to investigate this effect on streamflow estimates and to determine the role of native spatial meteorological forcing resolution. research would also benefit from including basin and climate characteristics of the CAMELS data set in order to find statistical relationships explaining locality in results.

Conclusions
The aim of this study was to analyse the effects of spatial scaling on the streamflow estimates of the distributed wflow_sbm hydrological model. Distributions of model instance KGE score results were tested for significant differences. A spatial distri-370 bution assessment was conducted to derive spatial trends from the results. The main findings of the study are the following: -The difference in the distributions of streamflow estimates of the wflow_sbm model derived at multiple spatial grid resolutions (3km, 1km, 200m) is only statistically significantly different between the 3 km and 200 m model instances (P<0.05). This confirms at an aggregated level the hypothesis that differences between model instances are small due to quasi-scale invariant parameter set and process descriptions that remain constant across scale in the hydrological model.

375
-Results show large differences in maximum and minimum KGE scores with an average of 0.22 between model instances throughout the CONUS.
-There is no single best performing model resolution across the domain.
-Changes in terrain characteristics due to varying spatial resolution influence the lateral flux partitioning of the wflow_sbm model and might be an important cause for differences in streamflow estimates between model instances.

380
This study answered where locality in results are strong due to spatial scaling effects. Future research should conduct an in-depth assessment of basins where differences in streamflow estimates and lateral fluxes are large due to spatial scale. This will lead to a better understanding of why and under what conditions locality in spatial scaling related issues occur.
Code and data availability. The software that supplements this study is available at: https://github.com/jeromaerts/eWaterCycle_example_notebooks.