Hyper-resolution PCR-GLOBWB: opportunities and challenges of refining model spatial resolution to 1 km over the European continent

. The quest for hydrological hyper-resolution modelling is already on-going for more than a decade. 10 While global hydrological models (GHMs) have seen a reduction in grid size, thus far they never have been consistently applied at hyper-resolution (<= 1km) at the large scale. Here, we present the first application of the GHM PCR-GLOBWB at 1 km over Europe. We thoroughly evaluated simulated discharge, evaporation, soil moisture, and terrestrial water storage anomalies against long-term observations, and subsequently compared results with the ‘established’ 10 km and 50 km resolutions of PCR-GLOBWB. Subsequently, we could assess the 15 added value of this first hyper-resolution version of PCR-GLOBWB as well as assess scale dependencies of model and forcing resolution. Eventually, these insights can help understanding current challenges and opportunities of hyper-resolution models and formulating model and data requirements for future improvements. We found that for most variables epistemic uncertainty is still large and issues with scale commensurability exist with respect to the long-term yet coarse observations used. Merely for simulated discharge we can confidently 20 state that model output at hyper-resolution improves over coarser resolutions due to better representation of the river network at 1 km. However, currently available observations are not yet widely available at hyper-resolution or lack sufficiently long timeseries, which makes it difficult to assess the performance of the model for other variables at hyper resolution. Here, additional model validation efforts are needed. At the model side, hyper-resolution applications require careful revisiting of model parameterization and possibly implementation of more 25 physical processes to be able to


Introduction
Climate change is projected to impact patterns and intensity of precipitation and temperature world-wide, resulting in increased occurrence of hydrological extremes and, in turn, to rising hazard probability and risk globally (Dottori et al., 2018;Hirabayashi et al., 2013;Winsemius et al., 2016;Ward et al., 2020b;van der Wiel et al., 2019). To explore large-scale climate adaption options and mitigation measures, as well as the efficacy of risk 5 management plans, output from global hydrological models (GHMs) is increasingly employed as a basis for discussion and policy-making. Examples are the Aqueduct Flood Analyzer (Aqueduct Global Flood Analyzer, 2021;Ward et al., 2020a), the Aqueduct Water Risk Atlas (Hofste et al., 2019), the WWF Water Risk Filter (WWF, 2022), and water2invest (Straatsma et al., 2020; water2invest web service, 2021).
One of the main advantages of GHMs is that their outputs are readily available and provide spatially and 10 temporally consistent estimates of hazard probability and risks (Bierkens, 2015). However, global climate change impacts are often evaluated at the level of river basins or at country-scale. Therefore, it is critical that the use of GHMs, which typically employ a relatively coarse spatial resolution (≥ 5 arc-min, equivalent to around 100 km 2 at the Equator), is fit for purpose and the current GHMs are used as intended (Ward et al., 2015). Nevertheless, the actual impact of hydrologic hazards happens at the local level and may vary greatly within countries which 15 pushes the boundaries of applicability of the current generation of GHMs. To capture these important spatial variations and to eventually make GHMs "locally relevant" (Bierkens et al., 2015) and usable by local water managers (Beven and Cloke, 2012), it is hence the current understanding that models need to move towards "hyper-resolution" Bierkens et al., 2015), that is to a spatial resolution below 1 km 2 (30 arcseconds). With hydrologic hazards already having increased and projected to increase further (see e.g. Winsemius 20 et al. (2016), Alfieri et al. (2017), Samaniego et al. (2018), van der Wiel et al. (2019), and the work of the World Weather Attribution initiative (Kew et al., 2021;Philip et al., 2019)), it is paramount to better understand their impact not only for entire economies, but also more granular on the sub-national to community level. Once actionable local model output becomes available, large-scale and target-oriented adaptation and mitigation measures can be devised. 25 Global models have their own raison d'etre as discussed in Ward et al. (2015), for instance analyses over datasparse areas or continuous trans-boundary simulations, thereby complementing bespoke local and country-scale models which may already be able to run at hyper-resolution. Moving towards hyper-resolution may potentially improve these analyses and, as a result, their meaningfulness and applicability. However, due to issues with, inter alia, model parameterization as well as computational demand, running GHMs at a meaningful and actionable 30 spatial resolution over large extents was considered to be a "grand challenge" (Bierkens et al., 2015;Wood et al., 2011).
With computational power increasing and input data sets becoming available at ever finer spatial resolutions, the spatial resolution of current GHMs has increased in past years too. Currently, most GHMs can be run globally at a resolution of 5 to 6 arc-min according to Bierkens et al. (2015). However, a grid cell at a resolution of 5 arc-min 35 still translates to roughly 10 2 km 2 at the Equator. Recently, various modeling attempts were made to further increase spatial resolution or to pave the way towards hyper-resolution. For example, the HydroBlocks model (Chaney et al., 2016) was coupled with remotely sensed data to obtain soil moisture estimates at 30 m spatial resolution over the continental United States of America (CONUS) (Vergopolan et al., 2020(Vergopolan et al., , 2021. While HydroBlocks brings the simulations to an effective 30 m resolution, it still takes advantages of grouping regions 40 based on hydrological similarity to reduce computational demands. O'Neill et al. (2021) applied and evaluated the hydrological model ParFlow (Maxwell et al., 2015) configured over the CONUS at 1 km spatial resolution covering the years 2003 until 2006, constituting the first-of-its-kind studies covering such an extensive area at 'true' hyper-resolution. Aerts et al. (2021) assessed changes of model accuracy of the hydrological model wflow_sbm (Schellekens et al., 2020) at multiple resolutions below 1 km 2 for the CAMELS dataset (Newman et 45 al., 2015;Addor et al., 2017) again solely covering the CONUS. While the spatial resolution applied is 'hyperresolution', the lack of either spatially-continuous simulations over large extents or short simulation periods make it challenging to define a substantial benchmark that is representative to the realm of GHMs. Other studies already proposed scalable routing networks (Thober et al., 2019) and measures how to seamlessly model across scales (Samaniego et al., 2010 which are essential when moving towards hyper-resolution. 50 On a more general note, the call for "hyper-resolution" is strongly driven by an expected increase of model accuracy and consequently applicability. One hypothesis thereby is that particularly locations with a small upstream area will profit as more interactions will be simulated there and that representation of the storage and response times in these regions are improved. The question whether this is really the case depends also on the ability to populate the many fine resolution cells with accurate parameters Beven and Cloke, 5 2012). It is believed that this is not possible without additional locally specific information and as such Beven et al. (2015) concluded that hyper-resolution modelling will not achieve the necessary accuracy. Thus, we first need to establish an understanding to what extent accuracy of GHMs is improved when only their resolution is increased while still employing globally available datasets currently used for parameterizing GHMs. Even though models at different resolutions should ideally use bespoke parameterization and input data when applied at these three 10 resolutions, we kept both as unaltered as possible to be able to better separate the impact of spatial resolution.
A second avenue towards improved model accuracy is the use of more accurate meteorological forcing. In studies using GHMs, it is shown that particularly the quality and spatial resolution of meteorological variables, notably precipitation, can have a pronounced impact on model accuracy Towner et al., 2019;Biemans et al., 2009). Hence, a second question is raised whether it is needed to go all the way to "model hyper-resolution" 15 or if using "forcing hyper-resolution" with current GHMs already suffices. This would reduce the need for updated model parameterization and design as well as greatly shorten run times and reduce model data storage requirements. A locally relevant spatial resolution could then be achieved using post-processing tools, for instance.
The main objective of this paper is therefore to increase our understanding of scale dependencies of both the 20 model and the meteorological input and formulate ways forward towards meeting the above-named "grand challenge". To that end, we developed a 30 arc-seconds (~1 km) version of the existing 10 km PCR-GLOBWB model (Sutanudjaja et al., 2018) for the entire European continent and applied it over the period 1981 until 2019. As such, it constitutes a first-of-its-kind effort to set-up, run, and evaluate a GHM at hyper-resolution over multiple decades. We acknowledge that this constitutes only a continental-scale application of a global model, but study 25 set-up and choice of data sets has been made such that both methods and findings can potentially be transferred to the global scale. The main reason to focus on Europe only is the data richness there improving our ability to better create and evaluate models. The model was used to test how a stepwise increase in spatial resolution (30 arc-min/~50 km, 5 arc-min/~10 km, 30 arc-sec/~1 km) impacts model accuracy. To test the impact of forcing resolution, the model simulations at the different spatial resolutions were forced with ERA5-Land data (Muñoz-30 Sabater et al., 2021) at the same spatial resolutions or coarser. Finally, we combine both experiments to disentangle the relative importance of the model and forcing resolution. Except for the precipitation forcing, all simulations make use of the same input and parameter sets and are downscaled or upscaled only if necessary. In our analysis, we focus not merely on simulated discharge as done by Aerts et al. (2021), for instance, but also on terrestrial water storage anomalies, evaporation, and soil moisture to provide a more holistic picture of model performance 35 than what is typically the norm (see O'Neill et al., 2021). Such a very first thorough analysis of a hyper-resolution GHM is in the first place not intended to claim that hyper-resolution should be the 'new normal' for GHMs, but rather to gather opportunities and challenges of the scale dependencies of refining the spatial resolution of the model and its forcing. These insights may help identifing the most effective model development pathways and advancing current efforts to produce spatially consistent and locally relevant estimates at (sub-)kilometer scale. 40

Study area and input data
In addition to the already existing global schematizations of PCR-GLOBWB at 30 arc-min (roughly 50 km cell length at the Equator) and 5 arc-minutes (roughly 10 km cell length at the Equator), respectively, we developed a novel 30 arc-seconds (roughly 1 km cell length at the Equator) schematization. 45 The study period for which each run was executed is 1981 until 2019 at a daily timestep. However, to reduce storage requirements and due to the timestep of most observations, our evaluations focused on the monthly timestep except for discharge analysis where both daily and monthly timestep were used. Each of the model resolutions 1 was forced with precipitation at the model resolution or coarser, yielding in total six combinations of model and forcing spatial resolutions (see Table 1). For all three spatial resolutions, identical model input data and parameter sets are used to guarantee comparability 5 across set-ups. The 'default' resolution is 1 km, meaning that all model development including runoff, soil, and groundwater processes, was geared towards this resolution and all other resolutions are derivatives. A detailed overview over datasets used for model parameterization and forcing input can be found in A1. Appendix A, including the scaling techniques used to derive them at various resolutions (i.e., 1 km, 30 km, and 50 km). It should be mentioned here that resampling the default 1 km data to coarser resolution affects model schematization 10  although greatest care was taken to minimize this effect. Hence, the impact of using different model resolutions is not assessed in isolation but rather in its bigger context of how the model schematization would appear for a given model resolution. Each run includes water demand and use for irrigation, industry, livestock, and household as well as accounts for reservoir storage based on the GranD database . Note that none of the runs were calibrated to not have differences in parameterization affect the 15 comparability across different model resolutions and forcing resolutions.

Model evaluation
To fully grasp the impact of an increase in spatial resolution on hydrological states and fluxes, various model output variables were compared to observations: simulated discharge, total evaporation, terrestrial water storage anomaly, and upper layer soil moisture. When selecting the observational data sets, we preferred a long record 20 over spatial resolution due to two reasons: i) because a long observational record is needed to establish a benchmark of hyper-resolution as intended here, potentially also capturing interannual variability and hydrological extremes; ii) the coarser-scale simulations greatly reduce the added value of fine-resolution observations. Additionally, we benchmarked the computational demand of the various runs performed. While discharge was evaluated at locations, the other variables were evaluated at the water province level (Straatsma et 25 al., 2020). Hereby water provinces smaller than the coarsest cell size of the observational data used (here: the original 3-degree footprint of the GRACE/GRACE-FO data, see section 2.2.4) were merged with the adjacent province having the longest common border. While the use of these water provinces, which serve as a 'common denominator' across the multiple spatial resolutions used, allows us to make a fair comparison across the different model set-ups, it also has the downside that some level of detail is lost, particularly for the simulations with finer 30 spatial resolution. Nevertheless, there are still sufficient water provinces to make robust conclusions.
It should be noted that by evaluating the 1 km hyper-resolution outputs at the water province level, we can only assess the overall performance of these simulations, rather than site-specific improvements. While it would be preferable to evaluate the model simulations at the native 1 km resolution this is difficult due to the coarse spatial resolution of some of the validation datasets. 35

Simulated discharge
To evaluate simulated monthly discharge, we employed data from the Global Run-off Data Centre (GRDC). To assess the accuracy of monthly simulated discharge, we determined the Kling-Gupta Efficiency (KGE; Gupta et al. (2009)). Based on Knoben et al. (2019), a value of KGE ≥ -0.41 indicates that the model improves upon the mean flow benchmark.
For meaningful analyses, we used only GRDC stations from our dataset which met the following requirements: i) the timeseries of a station extends at least until 1991; ii) the timeseries of a station contains at least 10 years of data; iii) the catchment area of a station is at least 400 km 2 , which is equivalent to four upstream cells for the 10k 5 resolution. This yielded in total 564 stations. Since this large number of stations did not allow manual matching of their locations and the model river network, we determined the corresponding model cell by searching for the smallest deviation between mean observed and simulated discharge in proximity of the location as specified by GRDC. This radius was determined by testing, and we consider it a balanced distance between sufficient search space and allocating discharge stations over non-realistically large distances. 10 For all runs, we evaluated the KGE distribution as is and additionally compared KGE values as function of upstream catchment area per station. This way we expect to obtain a better picture whether finer spatial resolution has a more marked impact on upstream stations compared to downstream stations. To analyse this, we categorized all stations depending on their catchment area with a minimum catchment area of 400 km 2 due to the abovedescribed selection criteria. Stations with a catchment area falling in the 25 % quantile were considered upstream 15 stations, those falling in the 75 % quantile downstream stations, and all remaining ones midstream stations. One hypothesis is that a finer model spatial resolution improves especially discharge simulations in upstream areas due to the refined channel network and thus better representation of upstream processes.
Per station, the change of obtained KGE between two different runs can be compared by computing the KGE skill score (KGEss) based on the subsequent equation (Towner et al., 2019): 20 Here, KGEa is the KGE of the model run under consideration, while KGEref is the KGE of the reference run for which we use the 50k_50k run. Positive KGESS indicates improved skill, whilst a negative score represents a decrease in skill.
Thus far, we focused on monthly discharge results only as it reduces file size and makes the required analyses 25 manageable. Moreover, the runs did use the simpler routing scheme implemented in PCR-GLOBWB, named 'accuTravelTime', which has limitations on accurately simulating discharge since it relies on a characteristic method and fixed flow velocities. For daily discharge simulations, particularly in larger channels, a kinematic or dynamic wave approximation is needed. We did not apply these as it would lead to very large computation times at 1 km resolution under the current PCR-GLOBWB code. Nevertheless, we think it may add to the evaluation if 30 we also assessed how fast model processes, for which a short temporal resolution is crucial, profit from an increased spatial resolution of the model, as argued by Melsen et al. (2016). To that end, we compared KGE values obtained with both daily and monthly output per catchment area category.

Upper soil moisture
For validating the average soil moisture of the upper layer, we compare simulations with remotely sensed ESA-35 CCI soil moisture data v06.1 (Dorigo et al., 2017;Gruber et al., 2019). The native temporal resolution is daily and the spatial resolution is 0.25 degree (roughly 25km at the Equator). It is noteworthy that the observed data is representative for the upper 5 cm of the soil column whereas simulated upper soil moisture from PCR-GLOBWB is an average of the first 30 cm. It is also due to this difference that applying the KGE across all evaluated variables is not feasible. It should additionally be mentioned that an older version of ESA-CCI soil moisture data is one of 40 the input data for GLEAM evaporation data (see section 2.2.3). Nevertheless, we can still use the EDA-CCI data as indicator for actual soil moisture dynamics. By computing the spatial mean per water province per time step for both observation and simulation, the resulting timeseries were used to derive the relative root-square-meanerror (RRMSE) per water with the subsequent equation: Here, RMSE(obs, sim) is the root-square-mean-error between observation and simulation timeseries, and σ(obs) the standard deviation of the observation timeseries.

Total evaporation
Evaporation makes up a significant factor in in the terrestrial water cycle as well as water balance, strongly 5 determining water availability. Accurate simulation of evaporation is therefore paramount. Unfortunately, there is only little independent observation data available with sufficient temporal and spatial variation and extent.
Therefore, to benchmark simulated total evaporation per water province, we used data from the remote sensingbased dataset GLEAM version 3.5a (Global Land Evaporation Amsterdam Model; Gonzalez Miralles et al., , Martens et al., (2017)), covering the period from 1980 to 2020. GLEAM v3.5a has a native monthly 10 temporal resolution and a spatial resolution of 0.25 degree (~25 km). Since the algorithms implemented in GLEAM aim much more on correctly simulated correct evaporation than PCR-GLOBWB, and because GLEAM employs different (remotely sensed) input data sets, the choice to use a partly model-based benchmark is defensible. Similar as for upper soil moisture evaluation, the RRMSE was determined per water province.

Terrestrial water storage anomaly 15
By validating model total water storage against remotely sensed terrestrial water storage, it is possible to evaluate the model's storage dynamics. As such, it provides a more holistic validation including the entire hydrological system state.
Here, simulated terrestrial water storage anomaly was validated against JPL-Tellus GRACE/GRACE-FO (Kornfeld et al., 2019) data for the period 2002 until 2019. While the native spatial resolution of the data is at 3 20 degrees, the data used here is at 0.5 degree. Again, monthly averages were computed for both observation and simulations to obtain the RRMSE per water province.

Simulated discharge
We see that simulated discharge, as expressed by the KGE value averaged over all GRDC stations per water 25 province, improves greatly with finer model spatial resolution (Figure 1, Figure 2; Table 2). While the 50k_50k run yields a negative median KGE values, this picture improves for finer resolutions with the 1k_1k and 1k_10k runs comfortably exceeding the accuracy threshold of -0.41 as defined by Knoben et al. (2019). Higher skill with finer resolution is in line with previous research (Altenau et al., 2017), although this is should not be expected to be the case for all stations due to locality effects in the scaling process (Aerts et al., 2021). As in our case only 30 around 7 % of all stations do not show an improvement at all but roughly 70 % of the stations receive their highest KGE values for the 1k_1k run ( Figure B1), results nevertheless indicate a robust relation between model resolution and accuracy of simulated discharge.  Results furthermore indicate that the impact of forcing spatial resolution is present but limited to the 1k runs, and 5 even there only the distribution of KGE values contains slightly higher values (as expressed by the whisker extents in Figure 2) while mean KGE values are close to identical. Overall, results suggest that model resolution is a more important determinant of the accuracy of simulated discharge. Due to the minor differences found across forcing resolutions, the subsequent discharge analyses focus on results from the 1k_1k, 10k_10k, and 50k_50k runs.
When analysing results per catchment area category, they also show that KGE values increase when moving to 1 10 km hyper-resolution, with the magnitude of improvement depending on upstream area ( Figure 3). In line with previous results and our hypothesis, employing finer resolutions than the reference 50 km has the most pronounced effect at upstream locations where the 50 km model is seemingly less capable of reproducing observed discharge. For midstream and downstream stations, KGEss values show the greatest improvement for the 1k_1k run, and hence a clear beneficial effect of hyper-resolution hydrological modelling. In fact, simulated discharge at 1 km 15 improves over either 10 km or 50 km or both at around 93 % of the GRDC stations assessed (see Figure B1). In addition to our hypothesis that hyper-resolution is beneficial in upstream areas, these results suggest that a better representation of smaller streams contributes to more accurate discharge simulations at large.  Another hypothesis we tested is whether the accuracy of fast processes simulated at daily resolution, such as discharge generation, improve compared to longer aggregation periods, for example months, with finer spatial 10 resolution. This is in line with Melsen et al. (2016) who argues that 'the calibration and validation time interval should keep pace with the increase in spatial resolution'. Based on Figure 4, this hypothesis cannot be confirmed for the PCR-GLOBWB model. We do, however, want to stress that these results do not mean that moving to hyper-resolution does not yield improvements for shorter temporal intervals at all. It is highly likely that the simplistic routing schematization used in this experiment limits the improvement of daily discharge dynamics. 15 Using more advanced routing schemes was not feasible due to the resulting high computation demand. It can thus be assumed that the here presented daily discharge estimates do not fully represent the potential benefits which could be achieved with hydrological simulations at hyper-resolution.

Upper soil moisture
To assess the model skill in reproducing upper soil moisture, we analysed RRMSE values obtained per water province ( Figure 5). Overall, we find areas with consistently low and high RRMSE values. Particularly the UK, Ireland, southern Sweden and Norway, the Rhine-Meuse delta, and a larger area centred around Austria have high RRMSE values. Large parts of Spain, Portugal, and Finland as well as northern Sweden show overall good model 5 performance as indicated by low RRMSE values. Comparing RRMSE values across both model resolutions ( Figure 6A and Table 2) reveals that the 10k runs, when 10 benchmarked against the 1k run, produce overall lower RRMSE values and are thus more accurate, although differences between median values are overall small especially between the 1k_1k and 10k_10k runs. It is in line with our expectations that the 50k run has the highest RRMSE values. Figure 6B and Figure 6C compare RRMSE values across forcing resolutions, indicating that employing coarser forcing resolution decreases model accuracy for the 1k runs while not having significant impact on the 10k run (see also Table 2)a similar pattern to that 15 found in the discharge evaluation.
It should be noted, however, that observed and simulated soil moisture values are based on different soil depths, which may have a not further quantifiable impact on analysis results. The fact that a similar pattern is found when assessing timeseries variability as expressed by the coefficient of determination R 2 indicates that the relation between observation and simulations is well-represented also when assessing RRMSE values ( Figure B2). 20 Ranking the accuracy of the runs by their mean RRMSE value, we see that the 10k runs perform best, followed 5 by the 1k_1k run ( Table 2). The fact that the 10k_10k runs clearly outperform the 1k_10k and 1k_50k runs indicates that a too coarse forcing resolution can eradicate potential benefits of finer model resolution, and that employing a coarser model resolution with a (relatively) fine forcing resolution may lead to similar accuracy.

Total evaporation
Results indicate that model accuracy is lowest for the 1k runs, followed by the 50k run (Figure 7, Figure B3A; 10 Table 2). The 10k runs show overall best performance in resembling evaporation estimates from GLEAM. We furthermore find that RRMSE values are smaller and range between 0.22 and 1.78, indicating overall better agreement between simulation and observation than for surface soil moisture. The deterioration of model accuracy at 1 km scale is visible over the entire European continent except for Great Britain where we see improvement when moving towards 1 km model resolution. 15 In line with the soil moisture evaluation, employing a coarser forcing resolution increases RRMSE values for the 1k runs, and hence decreases model accuracy, but has no impact on the 10k runs ( Figure B3B, Figure B3C; Table  5 2).

Terrestrial water storage anomalies
As with all previous evaluations, evaluating terrestrial water storage (TWS) anomalies indicates that coarser forcing resolutions yield reduced model accuracy, but in this case the impact is smaller than for previously assessed variables ( Figure B4B, Figure B4C; Table 2). 10 Figure 8 shows that there are distinct hotspots (Scandinavia, the UK and Ireland, Italy, and water provinces along the Atlantic and Mediterranean coast) where model accuracy is limited, regardless the model resolution or forcing resolution employed. Furthermore, the same pattern occurs as with total evaporation, namely that overall RMSE values are lowest for model resolutions coarser than 1 km, in that case the 50k run ( Figure B4A; Table 2). These findings, again, may be related to the fact that the large original footprint of the GRACE data of 3 arc-min or 15 roughly 300 km cell length, despite downscaling it to 0.5 degree, favours coarser resolutions. In fact, the match in spatial resolution may be explaining some of the good performance of the 50 km run. Consequently, the (potentially) added value of finer model resolutions may not be captured when using GRACE data for evaluation.

Model parameterization
In the debate of "hyper-resolution", the role of model parameterization is a major topic. Here, we did not attempt to improve model parameterization or replace parameterized processes with empirical scaling relations when moving from coarse to fine resolution. This allows us to make inferences about the importance of scale-consistent 10 model parameterization and where using the thus far default approach (simple parameter upscaling and changing forcing resolutions) may reach its limits.
As an example, we assessed in more detail the parameterization for the groundwater response times, which has a major impact on groundwater storage dynamics. Our hypothesis is that it needs to be adjusted for finer spatial resolutions as it does not (yet) correctly scale with the underlying drainage network and therefore results in the 15 lower accuracy when evaluating TWS anomalies of the 1km runs (section 3.4). Consequently, model response, particularly with respect to regional groundwater dynamics, may be too slow at model resolutions finer than the thus far default 10 km resolution, and hence the groundwater response is out-of-sync with observations. A lower correlation (expressed here as the coefficient of determination R 2 ) may thus result in high RRMSE values. To further analyse this, we categorized each water province according to its R 2 and RRMSE values ( Figure 9). Indeed, results suggest that R 2 is an overall good predictor of RRMSE: water provinces with a high R 2 tend to have low RRMSE values and vice versa, accounting for in total 66 % to 68 % of all provinces across all runs. More importantly is here that for the 1 km run we find the greatest median RRMSE value. Consequently, issues related 5 to correct scaling of the recession parameter may indeed contribute to the not-improvement at finer model resolutions.

Figure 9: Plot of R2 and RRMSE values per water province, categorized in four quartiles based on their median R2 and RRMSE values, respectively, for 1k_1k, 10k_10k, and 50k_50k runs (from left to right). Values in corners of
10 quartiles represent the fraction of data points in the respective quartiles.
Another example how the question whether using a parameter or equation can affect model accuracy is the accumulation of water storage over time, particularly in snow-dominated areas where the downscaled temperature at 1 km is mostly below the freezing point. Again, such accumulation decreases model accuracy of simulated TWS anomalies. In these areas, PCR-GLOBWB does not yet contain enough physical processes to re-distribute 15 snow to other 1 km cells (e.g., glaciers or avalanches). Indeed, correcting for snow cover of the 1k_1k run had a positive effect when evaluating TWS anomalies (Table 3) for two water provinces (not surprisingly located in and around the Alps, see Figure B5), strongly suggesting that the current approach to calculate snow cover and changes thereof may locally limit model accuracy for hyper-resolution runs.

Scale commensurability of simulated and observed data
Scale commensurability is an important aspect when comparing simulated and observed data, when (in)validating a model (Beven et al., 2022). In line with previous research (Gleeson et al., 2021) commensurability errors are 25 reduced for increasingly finer spatial resolutions, which translates to more accurate river networks and in turn less spatial aggregation, when evaluating simulated discharge.
For most of the other evaluated variables, however, results tend to be favourable for coarser model resolutions. A clear answer what causes these ambiguous results is currently, however, not possible as epistemic uncertainty is large: either the observations, the model schematizations derived at coarser resolutions from the default 1 km 30 model data, the mismatch in spatial resolutions between observations and model, or all of them may lead to mixed results and additional efforts need to be taken to reduce this uncertainty.
Currently, the difference between the finest resolution of the model (1 km) and of observational datasets that cover the entire European domain and have a sufficiently long record (25 to 50 km) is undoubtedly great. However, as only recent satellite missions, such as the ESA Sentinel mission or commercially-driven ones, can produce data 5 at the (sub)kilometre-scale, it will still take some time until the observed period is sufficiently long for robust long-term model evaluations. For example, surface soil moisture data collected by Sentinel1B started in 2015 only, whereas the here used ESA-CCI data covers the period 1978 to 2020. Similarly, MODIS16A ET data provides sub-kilometre scale evapotranspiration data starting in 2001. Even though this is already a longer track record, it would mean disregarding 20 years of simulations which may be too short for establishing a baseline 10 understanding of the effect of hyper-resolution.

Computing and data storage demand
Another important aspect of hyper-resolution hydrological modelling is the computational demand as well as the numerical scheme used.  As the underlying numerical scheme of PCR-GLOBWB has not been changed for computationally more demanding hyper-resolution simulations, it is clearly sub-par. Besides using ever more powerful highperformance computers (HPCs), additional efforts should be taken to improve model parallelization capacities as implemented in comparable models (Kollet and Maxwell, 2006;Kollet et al., 2010;Bierkens et al., 2015). 25 Evidently, the challenge of hyper-resolution hydrological modelling is not only one of data and parameters, but also one of model design which needs to be following most recent software and hardware developments, for instance distributed memory parallel modelling (Verkaik et al., 2021(Verkaik et al., , 2022, asynchronous many-tasks (de Jong et al., 2022(de Jong et al., , 2021 or running the simulations on GPUs (Shaw et al., 2021) or XPUs in general.
Once shorter run times are achieved, it can have direct positive effects on model results too, especially discharge. 30 With the thus far default model design, we had to employ the 'accuTravelTime' routing scheme, the largest simplification of the shallow water equations. This renders particularly daily discharge simulations potentially inaccurate and does not make full use of a hyper-resolution river network as relevant physical processes are not included. Once a speed up of the model is realized by considering one of the above-mentioned options, it could be feasible to employ more sophisticated versions of the shallow water equations. But also for other model 35 variables, an improved numerical scheme will be crucial when processes replace parameterization at 1 km resolution.

Conclusion and recommendations
In this study, we assessed challenges and opportunities of refining both model and forcing spatial resolution from 50 km to 10 km to hyper-resolution 1 km, aiming to increase our understanding of scale dependencies of both the 40 model and the meteorological input. Our work shows that too coarse forcing resolution can eradicate benefits of model hyper-resolution. Hence, if forcing is only available at coarse spatial resolution, applying a hyper-resolution model will not provide any performance gain and only increase computational demand. Furthermore, results show that model resolution is greatly defining overall output accuracy: if model skill at a given model resolution and coarse forcing resolution is already low (or high), employing finer forcing resolution will only yield limited improvement. Consequently, future work should be aimed at advancing both model and forcing data. For the 5 latter, the scale gap between currently achievable model resolution and available meteorological data sets needs to be closed.
Even though the here presented hyper-resolution model is not just the same old model with more grid cells but populated with as much high-resolution parameter data as possible, there is still room for model improvement.
Currently, the used parameterization design is not fully scalable from the current default of 10 km to 1 km. Clearly, 10 some parameters or sub-grid approaches should be replaced by actual data at when moving to hyper-resolution. As exemplified in section 4.1, model accuracy at 1km spatial resolution may be lower than necessary due to an unfit model parameterization. Considering the growing number of openly available datasets it is needed to carefully review model parameters and their values currently used. In some instances, it may be better to include additional physical processes to capture the immense natural heterogeneity which can be resembled at the 1 km 15 scale Beven and Cloke, 2012), while making sure that including these processes and fine-scale data result in consistent parameterizations across scales (Samaniego et al., 2010). It is only when model parameterization is improved and more processes are implemented at hyper-resolution, that the full potential of a hyper-resolution model grid can be employed. Then, we expect that model realism will improve and epistemic uncertainty will decrease, leading to overall improved accuracy. 20 While finer model resolution has a clear positive effect on simulated discharge, answering the question of scale commensurability when employing long-term observational datasets with coarse spatial resolutions remains challenging (see section 4.2). Following up the baseline study presented here, a second step in (in)validating the simulated results is recommended in which finer-resolution observational data should be employed to reduce scale-related commensurability uncertainties (Beven, 2018(Beven, , 2016. The availability of 1 km model output over 25 large extents now offers a great opportunity to make full use of the benefit of these novel data. In that context, the need to use water provinces to link across spatial resolutions could be lifted when merely assessing hyperresolution model output, allowing the model evaluation to be performed at the individual grid scale.
As discussed in section 4.3, performing hyper-resolution simulations is still pushing the boundaries of what is currently possible. Despite advances made in both software and hardware acceleration, access to high-30 performance computers with sufficient CPU, memory, and disk space remains vital for hyper-resolution simulations at the continental-scale, let alone for global application. Democratizing this access will therefore be crucial to enable researchers world-wide to advance hyper-resolution hydrological modelling. Additionally, the numerical scheme of any model attempting continental-scale hyper-resolution simulations needs to be on-par with the task at hand. This is not to say that all models need to be refactored entirely, but undoubtedly optimized code 35 and some form of advanced cyberinfrastructure  needs to be in placepossibly supported by research software engineers (Hut et al., 2017) to deal with the high storage and computational demand of not only running the simulations but also storing both input and output data as well as analysing results.
All in all, the here presented results should not be considered the final outcomes of a development towards global hyper-resolution hydrological modelling. For one, they are rather a first effort to map both opportunities and 40 challenges of continuous simulations at 1 km resolution over large extents, with lessons learned for iterative openended model evaluation and subsequent improvement of model parameterization and configuration (Gleeson et al., 2021;Bierkens et al., 2015). Since the study here is at the continental scale -yet with global model configurations and global data sets -, the findings made should be confirmed (or disregarded) in a global-scale study, ideally using high-resolution observation data sets for benchmarking. Now that a modelling and evaluation 45 workflow is established, the threshold for performing such a study is lowered. Since most hyper-resolution studies are done over data-rich regions, it will be revealing to see how hyper-resolution models perform over data-scarce areas. As such, the results here should be seen as first steps in the realm of hyper-resolution with many more to follow.
Despite the currently limited availability of observation dataset at matching spatial resolution and the still very demanding computations, we share the positive outlook of O'Neill et al. (2021) and are convinced that on-going data collection efforts, advanced satellite missions, and ever-powerful computers will eventually result in hyperresolution becoming the default in the foreseeable future, and in turn providing globally continuous yet locally relevant results. 5 Data and code availability. Model output was evaluated using the 'pcrglobwb_utils' package (Hoch, 2021). As the original model output requires a lot of disk space, only the output from the evaluation step is provided under https://doi.org/10.5281/zenodo.6390219. For the original output, please contact the authors.
Author contributions. JMH, EHS, and MB designed and supervised the experiment. EHS prepared the model data and performed the analyses with critical feedback from all co-authors. JMH developed the code for model 10 analysis and evaluated model output. JMH led the manuscript writing process with contributions from all coauthors.
Competing interests. The contact author has declared that neither he nor his co-authors have any competing interests.

A1. Appendix A
This part provides some brief description about the PCR-GLOBWB model input used in this study. For more extensive details, we refer to previous PCR-GLOBWB model studies, particularly the ones at 50 km resolution van Beek, 2008;van Beek and Bierkens, 2008) and 10 km resolution (Sutanudjaja et al., 5 2018) as well at 1 km resolution (Sutanudjaja et al., 2011).

Meteorological forcing
As the meteorological forcing input, PCR-GLOBWB requires daily spatial fields of precipitation and temperature, as well as reference potential evaporation. For this study, reference potential evaporation is calculated using the Penman-Monteith method following the FAO guidelines (Allen et al., 1998) that requires net radiation, wind 10 speed, and vapor pressure deficit (see van Beek (2008) for details).
For all of the aforementioned required meteorological fields, we downloaded the ERA5-Land dataset (Muñoz-Sabater et al., 2021), which are provided at 6 arc-minute and hourly resolution. We then resampled the ERA5-Land precipitation and temperature variables, to daily spatial fields at 5 arc-minute resolution so that they can be used for our 10 km model runs. We also resampled the required variables for the reference potential evaporation 15 calculation at 5 arc-minute resolution and used it as the reference potential evaporation input of our 10 km model runs.
For our 50 km model run, we upscaled or averaged the 5 arc-minute daily precipitation, temperature, and reference potential evaporation variables to their 30 daily arc-minute fields. For 1 km model runs, we downscaled the daily precipitation and temperature fields from 5 arc-minute to 30 arc-second using the monthly lapse rate fields from 20 Sutanudjaja et al. (2018). We refer to Sutanudjaja et al. (2011) for more details about the downscaling procedure. The downscaled 30 arc-second temperature fields were also used to calculate the reference potential evaporation fields based on the Hamon-method (Hamon, 1963) that requires only daily mean temperature as its input. We then used these Hamon 30 arc-second fields to downscale the 5 arc-minute fields of the Penman-Monteith reference potential evaporation to 30 arc-second resolution. 25

Land surface: soil, and cover, and topography
For soil parameterization, the maps from SoilGrids250 (Hengl et al., 2017) were used in this study. This is different than previous PCR-GLOBWB studies that used the Digital Soil Map of the World (DSMW; FAO, 2007). An obvious advantage using SoilGrids250 from is its sufficiently fine resolution at 0.002 arc-degree resolution (0.12 arcmin, about 250 m at the equator), while DSMW may not be suitable for model grids finer than 5 arc-30 minute resolution (see e.g. Batjes, 2012).
The PCR-GLOBWB model cannot directly use the soil information of SoilGrids250, which not only has a finer spatial resolution than 1 km PCR-GLOBWB, but also specifies only some general attributes such as soil texture.
Here we transformed these attributes of SoilGrids into soil hydraulic properties, such as water holding capacity, field capacity and wilting point, using the pedotransfer functions from Balland and Arp (2005) and Balland et al. 35 (2008). These pedotransfer functions allow the estimation of bulk density and related soil-hydraulic properties at any given depth, which is required to link the layer information from SoilGrids to the two-layer schematization in PCR-GLOBWB. We derived these soil properties at the spatial resolution of 0.002 arc-degree and subsequently upscaled and resampled them to the various coarser PCR-GLOBWB model resolutions used in this study, i.e. 30 arc-second (~1 km at the equator), 5 arc-minute (~10 km) and 30 arc-minute (~50 km). 40 For the standard parameterization of the land cover the following data sets were combined: the map of Global Land Cover Characteristics Database (GLCC) version 2.0 (Loveland et al., 2000) with the land cover classification following Olson (1994a, b) and the parameter sets from Hagemann et al. (1999) and Hagemann (2002). For the extent of irrigation areas, the map of Global Food Security-support Analysis Data (GFSAD) version 1.0 (Teluguntla et al., 2016) was used. The GLCC and GFSAD maps are available at 30 arc-second resolution (~1 45 km). Hence, for 1 km model runs, only one land cover type exists per cell. For the resolutions of 10 km and 50 km, land cover types were divided into four land cover types consisting of tall natural vegetation, short natural vegetation, non-paddy irrigated crops, and paddy irrigated crops (i.e. wet rice). Irrigation land cover types (i.e. paddy and non-paddy), including their crop calendars and growing season lengths, were parameterized based on the data set of MIRCA2000 (Portmann et al., 2010) and the Global Crop Water Model . Detailed description can be found in the key PCR-GLOBWB model description literature (Sutanudjaja et al., 5 2011(Sutanudjaja et al., 5 , 2018. For topographical related input, we made use of the state of the art Multi-Error-Removed Improved-Terrain Hydro digital elevation (MERIT Hydro DEM, Yamazaki et al. (2019)) that is available at 3 arc-second resolution (~90 m at the equator). This is different than the previous PCR-GLOBWB studies (Sutanudjaja et al., 2018;) that used previous generation DEM datasets, such as HydroSHEDS (Lehner et al., 2008) or GTOPO30 10 (Gesch et al., 1999). The 3 arc-second MERIT Hydro DEM was upscaled to the resolutions 30 arc-second (1 km), 5 arc-minute (10 km) and 30 arc-minute (50 km). Yet, we also used its original 3 arc-second resolution elevation values to obtain several sub-grid variability parameters that influences various schemes in PCR-GLOBWB, such as for runoff-infiltration partitioning, interflow, groundwater recharge and capillary rise, as well as evaporation processes (van Beek, 2008;van Beek and Bierkens, 2008;Hagemann and Gates, 2003;Todini, 1996). 15

Groundwater
In PCR-GLOBWB, groundwater discharge (also commonly known as groundwater baseflow) depends on a linear storage-outflow relationship, in which a groundwater recession coefficient field is calculated following the drainage theory of Kraijenhoff Van de Leur (1958) based on the drainage network density and aquifer properties. For the drainage density, we used the estimate from van Beek and Bierkens (2008). The aquifer properties were 20 estimated from the GLobal HYdrogeology MaPS (GLHYMPS) dataset of Gleeson et al. (2014). These datasets are reasonably sufficiently fine for modeling at 1 km resolution and can be upscaled to 10 km resolution and 50 km resolution.

Surface water routing: lakes, reservoirs and drainage/river network
PCR-GLOBWB also includes lakes and reservoirs that are taken from the Global Lakes and Wetlands Database 25 (GLWD) of Lehner and Döll (2004) and from the Global Reservoir and Dam Database (GRanD) of Lehner et al. (2011). Reservoirs are processed as follows: for every resolution, we rasterize GranD shapefiles containing information about reservoir surface areas and capacities. During the rasterization process we need account for many factors, such as reservoir locations to the drainage networks (at different resolutions), number of reservoirs within pixels. If a pixel contains more than one reservoir (which is very likely in coarser resolution), we merged 30 their surface areas and capacities and treated them as one reservoir. Consequently, the overall physical properties of reservoirs should not be overly different when moving from finer to coarser resolutions and thus their impact of flow estimates can be considered to be small.
For the drainage network maps, we made use of the existing 30 arc-minute and 5 arc-minute maps from Sutanudjaja et al. (2018), and the 30 arc-second one from the HydroSHEDS (Lehner et al., 2008). Note that the 35 MERIT Hydro, which we used as the source of DEM, provides a drainage/river network at 3 arc-second resolution only and, therefore, cannot be used for this study.

A2. Appendix B
Appendix B provides additional plots and analyses of model results.
In Figure B1, the improvement of KGE values per stations are mapped. To that end, it was assessed whether the KGE value for the 1k_1k run was higher than for either the 10k_10k run or the 50k_50k run, or higher for both runs, or not. Results indicate that KGE values improve at large, with only around 7 % of the stations showing no 5 improvement at all when refining model resolution. 10 Figure B2 until Figure B4 provide scatter plots of R 2 and RRMSE values obtained by evaluating various model variables with observational datasets at the water province level. While panel A compares the impact of different model resolutions (1 km, 10 km, 50 km), panels B and C compare the impact of different forcing resolutions: B) 1 km model resolution with 1 km forcing resolution against 1 km model resolution and 10 km and 50 km forcing resolution, respectively; C) 10 km model resolution with 10 km forcing resolution against 10 km model resolution 5 and 50 km forcing resolution,  5 Figure B5 depicts the categorization of each water province conditional on their R2 and RRMSE values. Four categories were defined: with low and high R2 values and low and high RRMSE values. The median R2 respectively RRMSE value was used to define 'low' and 'high' values. It is shown that there is not really a consistent spatial patterns where which category dominates. Additionally, this figure shows those provinces for which we analysed the impact of removing snow cover in the evaluation of TWS anomalies in section 4.1. 5 Figure B5: Categorization of water provinces according to their R2 and RRMSE values. Grey hatching refers to water provinces for which snow cover correction of TWS anomaly were effective (see section 3.4).