The Great Lakes Runoff Intercomparison Project Phase 4: the Great Lakes (GRIP-GL)

three FIELD_CAPACITY TOPSOIL, SNOW_SWI_MAX, MAX_MELT_FACTOR, using a x , x 14 , x ) SAT_WILT TOPSOIL, SNOW_SWI_MIN, MIN_MELT_FACTOR, respectively, to make sure that one is always than The baseﬂow coefﬁcients, BASEFLOW_COEFF TOPSOIL and PHREATIC, are derived from parameters x 4 and x 11 to allow a logarithmic sampling. This table is taken from Mai et al. (2020) (Appendix C C2 therein) and augmented by the weight generating parameters ( r 1 to r 8 ) that are used to describe the model structure.

S1 Comparison of gridded ERA5-Land SWE estimates and ground-truth SWE observations from CanSWE database The ERA5-Land SWE estimates (Muñoz Sabater, 2019) were compared with the SWE observations available in the version 2 of the CanSWE dataset . CanSWE combines historical manual (snow surveys) and automatic SWE measurements collected across Canada by different provincial and territorial agencies as well as hydro-power companies.
CanSWE is an evolving dataset. Please refer to the following repository to check for version updates of the dataset: https: 5 //10.5281/zenodo.4734371.
In this study, only manual SWE measurements were used to make sure that the snow data would be available over the period from 2000 to 2017. In Ontario and Quebec, manual SWE data are collected on a biweekly to monthly basis. The times series of ERA5-Land SWE estimates were extracted at the grid cells corresponding to the snow observation locations available in CanSWE. Only stations with at least 34 observations (i.e., on average at least two observations per year during the study period  ERA5-Land shows good performances (KGE larger than 0.65) at 57% of the stations (Fig. S1B). These stations are mostly located in the Ottawa River watershed, in the Lake Superior watershed and in the northern part of the Lake Huron watershed (Fig. S1A). These regions are characterized by the largest mean winter SWE in the Great Lakes regions (Fig. 2 in the main manuscript). More contrasted results, including stations with performances less than medium (KGE lower than 0.48), are 15 found in the Canadian portions of the Lake Ontario and Lake Erie watersheds. There regions are characterized by milder winter conditions than the northern of the Great Lakes region with frequent occurrences of winter precipitation near 0°C (Mekis et al., 2020) associated challenging conditions for snowpack models due to complex precipitation phase and midwinter melt resulting from rain-on snow events. Nonetheless, ERA5-Land remains a robust gridded SWE product in this region, especially when compared with other existing gridded SWE datasets such as the US SNODAS analyses characterized 20 by a strong underestimation of SWE in Ontario (King et al., 2020).
The scripts to derive the estimates presented here are available in the data repository (see A9_Scripts/compare_ERA5-swe_CanSWE) (Mai et al., 2022b).

S2 Visual depiction of the multi-objective multi-variable model analysis
The concept of non-dominance/ dominance in multi-objective analyses is used for the multi-objective multi-variate model 25 analysis (Sec. 2.9 main manuscript). This refers to the classic definition of that concept which is independent of the shape of the pareto front. Fig. S2 is provided as a visual explanation of what it means that a model dominates other models (panel A), a model is dominated by another model (panel B), a model that is not dominated by other models (panel C), the entire set of non-dominated models (panel D) forming the Pareto front (panel E). We visualized this in 2D picking two objective functions (here KGE regarding streamflow and AET) for demonstration purposes. In the study itself only 3D or 4D pareto fronts were 30 evaluated and reported on in Fig. 8 of the main manuscript. The 3D and 4D examples would, however, be harder to visualize intuitively. All these concepts can (mathematically) be applied for n-dimensional problems though.

S3 Additional information about participating models
In this section additional information will be provided about the participating models. Besides additional descriptions about model setup and calibration detail, a table of the parameters used for the model calibration is provided for each model. The

S3.1 LSTM-lumped
The final LSTM model is an ensemble of 10 models with different random seeds. All models share the same architecture of 256 hidden states and use an input sequence length of 365. That is, a single day of discharge is predicted from a sequence of the previous 365 input time steps. We optimized the model parameters with an Adam optimizer (Kingma and Ba, 2015) on an loss function that approximates the NSE (Kratzert et al., 2019) for 20 epochs at a learning rate of 5 × 10 −4 and another 10 epochs Figure S2. The concept of non-dominance and dominance of multi-objective problems. The example of a two-dimensional calibration problem is chosen. Both objectives (x-axis and y-axis) are assumed to be minimized and hence the "Utopia" point is located at the origin for simplicity. For demonstration purposes the first objective is chosen to be 1 − KGEQ where KGEQ is the model performance regarding streamflow and the second objective is set to be 1−KGEAET where KGEAET is the model performance regarding actual evapotranspiration. at a learning rate of 1 × 10 −4 To regularize the model, we further applied dropout to the LSTM output with probability of 0.4 (Hinton et al., 2012). We stabilized the training procedure by clipping gradients to a maximum norm of 1 and initializing the LSTM forget gate with a bias of 3 (Gers et al., 1999).
The LSTM ingests the following input variables to predict daily discharge: daily meteorological forcings:

50
total precipitation, minimum and maximum temperature, mean downward solar flux, specific humidity, surface pressure,

55
u/v components of wind, and 4 potential evapotranspiration calculated following the Priestley-Taylor equation from the FAO-56 guidelines (Allen et al., 1998), using minimum and maximum temperature, solar radiation, latitude, elevation, and day of year as inputs and assuming that ground heat flux is zero at daily time steps (see also Newman et al., 2015) static basin characteristics: 60 in total 30 static input variables were used as inputs in the LSTM setup (Tab. S2) climatologic characteristics were calculated based on the calibration period (2000 to 2010) static features are fed into the LSTM at every time step alongside the meteorological forcings (see: Kratzert et al., 2019) To ensure a model configuration that also works well in spatial validation, the LSTM hyperparameters (see Tab. S1) were tuned in a 5-fold cross-validation setting (Hastie et al., 2009) as follows: First, we randomly split the calibration basins into 65 five partitions and trained each hyperparameter configuration on all combinations of four partitions. We repeated this process with three different random seeds and tested each setup on the corresponding remaining fifth partition. Second, we calculated the average KGE across all seeds for each basin in the partition that was not trained on and then took the median across all these basins. Finally, given the best hyperparameter configuration, we trained an ensemble of ten models with different random seeds on the full calibration period of all calibration basins combined and averaged their predictions.

70
The model was set up using the NeuralHydrology Python library in version 0.9.11-beta3 (Kratzert et al., 2022). The configuration files and scripts to reproduce the LSTM setup and training are available in the data repository (see A9_Scripts/MachineLearning) (Mai et al., 2022b). Table S1: The hyperparameters calibrated for the LSTM-lumped model. The first three parameters are tested with the listed discrete values while the fourth parameter (epochs) was tested for a range of integer values in order to determine the best hyperparameter setting for this application. The best hyperparameter values found are highlighted in bold font. The parameter names are also the names of the NeuralHydrology configuration value. A brief description is given for each parameter.

Parameter
Parameter value/range Description  Temperate-or-sub-polar-needleleaf-forest Fraction of land covered by "Temperate-or-sub-polar-needleleaf-forest" Temperate-or-sub-polar-grassland Fraction of land covered by "Temperate-or-sub-polar-grassland" Temperate-or-sub-polar-shrubland Fraction of land covered by "Temperate-or-sub-polar-shrubland" Temperate-or-sub-polar-grassland Fraction of land covered by "Temperate-or-sub-polar-grassland" Mixed-Forest Fraction of land covered by "Mixed-Forest" Wetland Fraction of land covered by "Wetland" Cropland Fraction of land covered by "Cropland" Barren-Lands Fraction of land covered by "Barren-Lands" Urban-and-Built-up Fraction of land covered by "Urban-and-Built-up" Water Fraction of land covered by "Water" Continued on next page

S3.2 LBRM-CC-lumped
The Large Basin Runoff Model (LBRM) (Crowley II, 1983) was developed for applications to historical simulation and seasonal forecasting of the Great Lakes water balance. As such, prior to this study, it has primarily been evaluated for simulating whole-basin runoff into the lakes. It is a lumped conceptual model that simulates the propagation of rainfall to watershed outflow using a series of cascading tanks, with transfers between tanks represented through a mass balance coupled with linear 80 reservoir concepts. Calibrated model parameters include linear and partial linear reservoir coefficients to represent transfer between the cascading tanks, as well as a base temperature parameter that controls the potential evapotranspiration and an upper soil zone capacity parameter. Since the initial development of LBRM, updates have included the integration of an approach to approximate the Clausius-Clapeyron to mitigate the oversensitivity of evapotranspiration to temperature identified by Lofgren et al. (2011). This update, referred to as LBRM-CC-lumped in this manuscript, demonstrated by Lofgren and Rouhana (2016) 85 and implemented by Gronewold et al. (2017), was used for this study. LBRM-CC-lumped is currently in use as one of the models contributing guidance to the U.S. Army Corps of Engineers Detroit District's contribution to the internationally coordinated 6-month water level forecast (Fry et al., 2020). Tab. S3 provides a list of all the LBRM-CC-lumped model parameters which were optimized for this study.

S3.3 HYMOD2-lumped
HYMOD is a conceptual rainfall-runoff model introduced by Boyle et al. (2000). In this study, we use the modified lumped version of the HYMOD model, called HYMOD2. The model is referred to as "HYMOD2-lumped" in this study to follow the same naming convention for all models indicating whether models are lumped or distributed. The HYMOD2-lumped model is working in reliance on the probability distributed storage capacity concept proposed by Moore (1985), representing 95 the vertical soil moisture accounting process. HYMOD is originally designed as a lumped model utilizing Nash cascade for horizontal routing and a leaky linear reservoir to represent baseflow.
The modified HYMOD2-lumped version used in this study contains an improved parameterization for the evaporation process as proposed by Roy et al. (2017). Since snow is an important factor driving the water cycle in the basins considered in this model intercomparison study, we also included the Degree Day Snow model (Martinec, 1975) and rain-snow-partitioning 100 in the lumped version of HYMOD2. Potential evapotranspiration is derived from minimum and maximum temperature as well as the basin latitude using the Hargreaves-Samani method (Hargreaves and Samani, 1985).
The eleven parameters calibrated for this model are given in Tab. S4.  The GR4J model (Perrin et al., 2003) is a lumped water balance model that relates runoff to rainfall and evapotranspiration using daily data while the model contains two stores and originally has four parameters (X 1 to X 4 ). The GR4J parameter X 1 denotes the production store capacity, parameter X 2 is the inter-catchment exchange coefficient, X 3 is the routing store capacity, and X 4 is the unit hydrograph time constant. GR4J is used in concert with CemaNeige (Valéry et al., 2014) for the handling of snow. GR4J and CemaNeige are fully emulated within the Raven modeling framework  and 110 used as such in this study. The parameters, their ranges and where these parameters are located in a Raven setup is listed in Table S5. Please note that parameter x 5 for an estimate of annual average snow is usually not part of the CemaNeige model but was added here as Raven requires an estimate and none was at hand. Two additional parameters for rain-snow partitioning were added as only precipitation forcings instead of rain and snow forcings were available. Continued on next page

S3.5 HMETS-lumped
HMETS is a very simple, yet efficient, lumped conceptual hydrological model simulating all main hydrological processes including snow accumulation and snowmelt (Martel et al., 2017). HMETS us setup through the Raven hydrologic modeling framework which is able to fully emulate the original HMETS implementation. The parameters calibrated for HMETS are listed in Table S6.
120 Table S6: The parameters calibrated for the HMETS-lumped model. The parameters are uniformly distributed in the range given. The Raven table and parameter name can be used to locate the parameter in the Raven setup files. A two-layer soil model was used here. The TOPSOIL is the upper soil layer while PHREATIC is the lower soil layer. The two Raven parameters, SNOW_SWI_MAX and MAX_MELT_FACTOR, are derived using a sampled parameter (x 6 , x 10 ) and SNOW_SWI_MIN and MIN_MELT_FACTOR, respectively, to make sure that one parameter is always larger than the other.

Param. Range Unit Raven table
Parameter name

S3.6 Blended-lumped
The lumped Blended Model is a hydrologic model defined within the Raven hydrologic modeling framework . The Blended Model uses the weighted average of several chosen process implementations for key processes to simulate streamflow instead of using one single parametrization per process.

125
As an example, a (simplified) blended model could be defined as: where D 1 and D 2 could be, for instance, be two options for one process. For example, deriving infiltration could be performed once using the infiltration definition of HMETS (D 1 ) and once derived as defined in the HBV model (D 2 ). The infiltration outputs D 1 and D 2 are then weighted using w d1 and w d2 to derive the infiltration estimate Raven will use for the remainder of the simulation. A detailed description of all processes and process options as well as a model flowchart can be 135 found in the Supplementary Material of Mai et al. (2020) and in the Raven documentation .
The Blended Model was introduced by Mai et al. (2020), analysed in calibration mode by Chlumsky et al. (2021), and deployed across North America by Mai et al. (2022a). The exact same model setup was employed here. It uses three different options M i for the infiltration process, three options N i for quickflow, two options O i for evaporation, two options P i for baseflow, and three options Q i for snow balance. All other processes, i.e., convolution of surface runoff R 1 and delayed runoff 140 S 1 , potential melt T 1 , percolation U 1 , rain-snow partitioning V 1 , and precipitation correction W 1 , are used with one fixed process option. The remaining processes also have only one option, but none of them contains tunable parameters. They are merged to a "remaining" process X 1 , which will never appear in the sensitivity analysis because it is constant. When the first option of each of the processes M 1 , , and X 1 , is chosen and parameter x 35 is set to zero, the Raven setting emulates the HMETS model (Martel et al., 2017) perfectly. All other combinations are unnamed 145 models.
Details of process options and parameters can be found in Appendix C (Tables C1 and C2) in Mai et al. (2020). The list of parameters is added to the Supplementary Material herein (Table S7) for the convenience of the readers. Infiltration: Quickflow:  Baseflow: Snow balance: Potential melt: Percolation: x  The Blended-Raven model is the same as the Blended-lumped model setup using the common subbasin discretization provided through the common routing product. The parameters calibrated are the ones calibrated for the Blended-lumped model (Table S7 augmented by the routing specific parameters listed in Table S8. All parameters x 1 to x 35 and r 1 to r 8 used for Blended-lumped (Tab. S7) Routing: In GRIP-GL, the VIC and Raven routing model are coupled to simulate rainfall-runoff processes in the Great Lakes region.
The VIC Image Driver of version 5.1.0 is used, and Raven is employed as a routing module for VIC. It should be noted that the drainage from the bottom soil layer in VIC is named as "baseflow" (Gao et al., 2010); however, this is a misnomer that implies water directly enters the streams without being delayed in storage (Li et al., 2015). Thus, we use the term "recharge" instead of "baseflow" to represent this flux component in this context, which is consistent with the terminology used in Raven (Craig 160 et al., 2020;Craig, 2022).
In GRIP-GL, VIC model is built at the RDRS-v2 forcing grid-cells with a resolution of 15 km by 15 km. VIC requires DEM, soil, and land cover data for its parameterization. The input data for VIC are the sub-daily meteorological drivers from the RDRS-v2 forcing data set, i.e., precipitation, air temperature, atmospheric pressure, incoming shortwave radiation, incoming longwave radiation, vapor pressure and wind speed. The VIC-generated (quick) runoff and recharge fluxes at the grid-cell scale 165 are firstly aggregated into the sub-watershed scale, and then routed to the catchment outlet in terms of in-catchment routing and river channel routing processes.
Raven routing module consists of in-catchment routing and channel routing processes. VIC-generated runoff is firstly simulated by an in-catchment routing process, which employs a Gamma unit hydrograph to represent the time delay between runoff and reaching the stream. It then employs channel routing using the diffusive wave method to simulate flood wave propagation 170 through the reach. Baseflow is simulated using a linear storage model to delay the recharge component. If catchments contain lakes, the lake release is solved by a two-parameter lake-type reservoir model in Raven. The routing network for the Raven routing module is produced by BasinMaker (Han, 2021;Han et al., 2021), and inputs for routing are subbasin-averaged runoff and baseflow generated by VIC.
VIC has nine parameters for calibration and Raven routing module has five parameters for catchments with lakes and four 175 parameters for catchments without lakes for calibration (Tab. S9). These parameters are calibrated simultaneously using DDS optimization algorithm (i.e., the VIC and Raven are coupled in calibration, and the outputs of VIC are used as inputs for Raven routing). The KGE is employed as the calibration objective. The optimization is repeated for 20 trials, each with 1,000 calibration iterations, and the best results out of 20 are reported.

S3.9 SWAT-Raven
SWAT simulated vertical flux (total runoff) were fed to the lake and river routing product (Han et al., 2020), integrated into the Raven modelling framework , to route the both the vertical flux. The integrated SWAT-Raven model was then calibrated by calibrating SWAT-related and Raven-related parameters simultaneously within the given range as shown in Tab. S10. WATFLOOD simulated gridded vertical fluxes (runoff and recharge to lower zone storage) were fed to the lake and river routing product (Han et al., 2020), integrated into the Raven modelling framework , to route the both vertical fluxes.

185
The integrated WATFLOOD-Raven model was then calibrated by using WATFLOOD-related and Raven-related parameters 190 simultaneously within the given range as shown in Tab. S11.

S3.11 MESH-CLASS-Raven
The integrated MESH-CLASS-Raven model was then calibrated by using MESH-CLASS-related and Raven-related parameters simultaneously within the given range as shown in Tab. S12.

S3.12 MESH-SVS-Raven
The MESH-SVS-Raven model was calibrated by using MESH-SVS-related and Raven-related parameters simultaneously within the given range as shown in Tab. S13. Most of the calibrated parameters correspond to coefficients that are used to multiply the actual corresponding Model parameter values. This is done in order to limit the number of free parameters during actual parameter values. One other advantage of such an approach is that none of the MESH-SVS-Raven free parameters used in calibration are tied to a specific calibration or validation basin, but are applied to the whole region at once. Therefore, no spatial parameter transfer is required for the validation basins, as long as they are included in the region of interest. Among the MESH-SVS-Raven parameters, only the "RICK", "FLZCOEFF", "PWRC", "GASH" and "GASC" parameters (see Tab. S13) correspond to actual model parameter values, but here they correspond in the model to values that are fixed over the whole 210 region where the model is applied.
In order to use different parameter values for the agricultural and more natural areas (like grassland, forests, etc.), two different multipliers were used both for the horizontal and vertical model conductivities, as well as for the evapotranspiration resistance multipliers (see Tab. S13). This was done because of the artificial tile drains that are often installed in agricultural areas of the Great-Lakes basin (see Valayamkunnath et al., 2020). These tile drains allow to prevent the crops from being 215 flooded by quickly removing water that infiltrated into the soil column, but lead to significantly higher peak flows at the outlet of agricultural basins, in comparison to their natural regime. Because these tile drains are not explicitly represented in MESH-SVS-Raven, their effect on simulated flows was represented by allowing hydraulic conductivity values to be significantly increased in the model, as suggested by De Schepper et al. (2015). This is why higher ranges were allowed for the hydraulic conductivity multiplier values associated with agricultural areas, compared to the natural areas (see Tab. S13). However, since 220 MESH-SVS-Raven cannot simulate processes separately for these two types of areas inside a grid-cell (SVS soil parameters are valid over the whole grid cell), the actual multiplier value used to adjust SVS horizontal or vertical hydraulic conductivity consists of a weighted average of the agricultural and "natural" multiplier values, based on the fraction of the grid-cell covered with agricultural cover, and the rest of the pixel that is assumed as being "natural", in the sense that it is assumed not to contain any tile drains. The same was done for the evapotranspiration resistance multipliers, in order to give SVS some flexibility by 225 using different adjustments in agricultural and other (natural) areas, for these parameters.
Finally, because we rely here on multiplicative coefficients to adjust the actual parameter values, some restrictions had to be imposed on the final (after adjustment) actual model parameter values, to ensure that they could not go beyond values with physical meaning. For example, the Albedo final values have a maximum of 1, the 50% root depth values have to stay below the 95% root depth values (minus 5cm to ensure numerical stability), and the three soil moisture thresholds, namely the wilting 230 point, the field capacity and the saturation were respectively constrained below fractions of 0.5, 0.75 and 0.95. Because they were adjusted all together by the same multiplier, their relative order remained unchanged after adjustment and did not need to be constrained.  more CPUs, but probably not to a significant degree, since the six regions used here do not contain a high number of 10 km grid points (the maximum is 4455 grid-points for the Lake Superior region, while the minimum is 2132 for the Ottawa River region). Therefore, it is very challenging to calibrate GEM-Hydro-Watroute directly already because of its surface component.
Regarding the routing component of GEM-Hydro-Watroute (Watroute), it is also much slower than the Raven routing scheme. First, Watroute is a grid-based model that has a 1 km resolution in GEM-Hydro-Watroute, and that is not parallelized.

255
In the ECCC National Surface and River Prediction System (NSRPS; Durnford et al., 2021)  Watroute also implies a pre-processing phase of GEM-Surf outputs, which can be even more time-consuming than the model runtime itself. When including Watroute's pre-processing time, Raven was about 130 times faster than Watroute, for the basin 265 described above. However, when comparing Watroute to Raven runtimes on the Lake Huron region of this GRIP-GL project, Raven was able to perform the 10-y calibration period in 130s, while it took about 55h to do the same for Watroute, which represents a ratio of about 1500. It is true that in this case, Watroute was running over the full Lake Huron watershed, while Raven was only simulating the calibration and validation basins, which in total, represent about half of the Lake Huron region.
Moreover, during these 55h of the total Watroute runtime, about 76% was spent on the pre-processing, highlighting where 270 the emphasis should be put in order to decrease Watroute computation time. It was noticed that Watroute pre-processing time could be improved significantly, following some previous tests. But even if considering only Watroute true model computation time over half of Lake Huron region, Raven would still be about 200 times faster than Watroute in this case. The difference with the ratio of 30 mentioned above probably comes from the fact that the common Raven setup used for GRIP-GL does not correspond to the most possible detailed version of the model, because lakes smaller than 5 km 2 were removed from the setup, are represented with SVS directly, assuming a fixed ratio of 33% imperviousness for urban cover, based on Gaborit et al. (2013). But because of this, and in order to obtain GEM-Surf and MESH-SVS simulations that are as close as possible to favor parameter transfer between the two, the subgrid-scale lakes were neglected in GEM-Surf, which means that any grid-cell having a fraction of water lower than 100% was in fact assumed to contain 100% of land instead. This resulted in almost 100% of the calibration/validation basins studied here to be covered at 100% by land in both MESH-SVS and GEM-Surf. In terms 320 of streamflow impact, neglecting these subgrid-scale lakes in GEM-Surf only has a minor effect on GEM-Hydro-Watroute simulations across the Great Lakes, probably because of the relatively small role that lake evaporation plays in the overall water balance of the terrestrial watersheds of the Great Lakes (i.e., with the exception of the five Great Lakes themselves).
Moreover, since land evapotranspiration rates are generally similar to overlake evaporation in this region, replacing water surfaces with land surfaces does not have a strong impact on the resulting streamflow simulations in this region (not shown 325 here).
However, strong differences were noticed for some flow gauges, between the simulations obtained with the default versions of MESH-SVS-Raven and GEM-Hydro-Watroute. For some gauges, these differences were partly due to the fact that the Raven basin delineation was different than one used for Watroute. However, strong delineation differences between the two models was quite rare and only affected a few of the total 212 calibration/validation basins used in this project. Of course, differences 330 can arise from the different routing models themselves, because they might not represent the same river and lake network, and because they might not represent the same processes: for example, Raven uses a Gamma Unit Hydrograph convolution approach in order to route the surface runoff generated by the LSS, into the closest stream. In Watroute, surface runoff is assumed to enter the stream of the current grid-cell instantaneously, and the fact that Watroute assumes that a stream is present in every grid-cell can, to some degree, represent an approximation of the transport of surface/subsurface runoff between its 335 source and an actual stream. However, these routing model differences should only affect streamflow timing, and not the general streamflow bias, for basins having close delineation boundaries between the two routing schemes. This affirmation is especially true given that subgrid-scale lakes were neglected in the MESH-SVS and GEM-Surf surface simulations of this project (see above). Therefore, evaporation losses from lakes and rivers were not taken into account in both the Raven and Watroute simulations performed here, because such losses, when coupling Watroute or Raven to 340 are included in the aggregated surface runoff of a surface grid cell. Surface runoff over land can however not be negative with SVS, because the evapotranspiration losses are already removed from the SVS vegetation and soil storage compartments. And the main difference that can arise between Raven and Watroute simulations, when forced by the same surface fluxes and in terms of general flow bias, is due to the treatment of evaporation in the routing scheme. Therefore, in this case, because no evaporation loss is provided to either routing scheme, and because they both close their water balance almost perfectly, the 345 routing schemes can not be responsible for strong differences regarding their flow simulations' biases (again, assuming that they have similar delineations for a given watershed, which is generally the case here).
Regarding MESH-SVS and GEM-Surf simulations, differences existing between the two are well known and were already noticed during the GRIP-E project (see Mai et al., 2021). However, these differences were generally small and both models still produced very similar flow simulations in this case. After investigation, it was found that there was an issue regarding the 350 reading of vegetation cover fractions from the geophysical file, in MESH-SVS-Raven. As a consequence, MESH-SVS did not use the correct vegetation cover during GRIP-GL, and seems to have assigned the same vegetation fractions for all points inside a region, based on one of the points of the region. This issue did not occur during GRIP-E, because it is related to the new NetCDF format used in GRIP-GL for the MESH geophysical input file. This is a strong problem for the approach used here, because SVS vegetation classes are associated to specific vegetation parameters. Therefore, the calibrated parameters related to 355 vegetation properties (like LAI, stomatal resistance, root dephts, etc; see Tab. S13) that were obtained with MESH-SVS-Raven were not optimal for GEM-Hydro-Watroute.
This probably explains the strong flow bias differences between the default versions of the two models, for some regions, as well as most of the strong loss of performance when transferring the SVS and some routing parameters that were calibrated with MESH-SVS-Raven, into GEM-Hydro-Watroute. But despite this issue, the approach employed here to calibrate GEM-360 Hydro-Watroute seems promising, in view of the results obtained in this study. The fact that the parameters calibrated with MESH-SVS-Raven, when transferred into GEM-Hydro-Watroute, could still generally improve upon the default version of the model, is probably due to the fact the calibrated SVS parameters related to soil processes, for example hydraulic conductivity, were more appropriate than the default values used in GEM-Hydro, especially for agricultural areas with tile drains, that are not represented explicitly in the model, yet. However, work is under way to fix the issue mentioned above, as well as revising 365 the strategy for some calibration parameters, and restart all MESH-SVS-Raven calibrations, GEM-Hydro simulations, and the auxiliary variables' evaluation, in order to come up with a calibration strategy for GEM-Hydro-Watroute, that would allow to improve or maintain the performances regarding surface fluxes and auxiliary variables, in view of being able to couple a GEM-Hydro-Watroute model that was calibrated by targeting streamflow performances, with an atmospheric model, which is the long-term goal of this work.  Table S15 contains the list of the 212 gauging stations including their area, region they are located in, objectives they were selected for, and whether they were used for calibration or validation.  Ottawa River (OTT), and Lake Superior (SUP). The horizontal black lines separate the machine-learning based global LSTM model from the models that are calibrated locally and the models that are calibrated per region. The performance is quantified using the component r of the Kling-Gupta efficiency (KGEr) which determines the Pearson correlation between the simulations and observations. The median KGEr performance of each model for each of the four evaluation scenarios is added as labels to each panel. The thresholds for medium, good, and excellent performance classifications are added as labels to the colorbar. The performance regarding KGE can be found in Fig. 3 in the main manuscript. For the spatial distribution of these results as well as the simulated and observed hydrographs please refer to the website (www.hydrohub.org/grip-gl/maps_streamflow.html).   Continued on next page Continued on next page

S6 Common set of donor basins
The seven locally calibrated models (LBRM-CC-lumped, HYMOD2-lumped, GR4J-lumped, HMETS-lumped, Blended-lumped, Blended-Raven, and VIC-Raven) needed to specify a so-called donor basin for each of the 71 validation locations. The donor basin is a basin that was calibrated. The calibrated parameter set of the donor basin is then used to run the model deriving 385 streamflow at the validation location. All models used the mapping of donor basins given in Table S16. The mapping approach is explained in the main manuscript (Sect. 2.4).