Signature and sensitivity-based comparison of conceptual and process oriented models, GR4H, MARINE and SMASH, on French Mediterranean flash floods

The improvement of flood forecast ability of models is a key issue in hydrology, particularly in Mediterranean catchments that are subjected to strong convective events. This contribution compared models of different complexities, lumped GR4H, continuous SMASH and process-oriented MARINE. The objective was to understand how they simulate catchment’s hydrological behavior, the differences in terms of their simulated discharge, the soil moisture, and how these can help to improve the relevance of the models. The study was applied on two Mediterranean catchments in the South of France. The 5 methodology involved global sensitivity analysis, investigations of the response surface, calibration and validation, signature comparison at event scale, and comparison of soil moisture simulated with respect to the outputs of the surface model, SIM. The results revealed contrasted and catchment specific parameter sensitivity to the same efficiency measure and equifinality issues are highlighted via response surface plots. Higher sensitivity is found for all models to transfer parameters on the Gardon and for production parameters on the Ardeche. The exchange parameter controlling a non conservative flow component of GR4H 10 is found to be sensitive. All models had good calibration efficiencies, with MARINE having the highest, and GR4H being more robust in validation. At the event scale, indices of discharge showed that, the event based MARINE was better in reproducing the peak and its timing. It is followed by SMASH, while GR4H was the least in this aspect. SMASH performed relatively better in the volume of water exported and is followed by GR4H. Regarding the soil moisture simulated by the three models and using the outputs of the operational surface model SIM as the benchmark, MARINE emerged as the most accurate in 15 terms of both the dynamics and the amplitude. GR4H followed closely while SMASH was the least in comparison. This study paves the way for extended model hypothesis and calibration-regionalization methods testing and intercomparison in the light of multi-sourced signatures in order to assess/discriminate internal model behaviours. It highlights in particular the need for future investigations on combinations of vertical and lateral flow components, including groundwater exchanges, in distributed hydrological models along with new optimization methods for optimally exploiting, at the regional scale, multi-source datasets 20 composed of both physiographic data and hydrological signatures. 1 https://doi.org/10.5194/hess-2021-414 Preprint. Discussion started: 27 October 2021 c © Author(s) 2021. CC BY 4.0 License.


Hydrological Models
Three hydrological models of varying complexities are used for this study: GR4H (lumped and conceptual), SMASH (spatiallydistributed and conceptual), MARINE (process-oriented and spatially distributed). GR4H and SMASH are continuous models whereas MARINE is event based but its state is initialized with the outputs of the SIM operational surface model. All models represent a limited number of hydrological processes and some of their flow operators share similarities as analyzed later in 90 this study. Note that GR4H is the only model in this study with a "non conservative" flow operator. This section presents the formulation of all the models and their flow operators are detailed in appendix A This section also presents the calibration algorithm of each model, used to optimize their parameters in order to reduce the discrepancy between simulated and observed discharges at a catchment outlet. The objective function used for calibration is the classical NSE efficiency (given in section 3.4) that is adequate for the present flood modeling context. For all models, 95 considering J = 1 − N SE, the parameter calibration inverse problem reads: where the cost function J depends on the sought model parameters θ through hydrological model response. For each model, bound constrains are applied on the sought parameters using the same ranges as in sensitivity analysis (cf. section 3).
We consider a 2D-spatial domain Ω (catchment) covered by a regular rectangular grid of resolution ∆x. The unique constrain 100 applied to this lattice is that a unique point has the highest drainage area, that is catchment outlet, given flow directions. The time is denoted t > 0. The spatio-temporal rainfall and evaporation fields are respectively P and E, stepwise approximations over time steps ∆t are assumed.

GR4H model
The GR4H model (Mathevet, 2005) is a lumped continuous model that runs at the hourly time step and based on the GR4J 105 model formulation of Perrin et al. (2003). This model has been used in many studies such as flash flood modeling in four the percolation is routed linearly using a unit hydrograph UH2 of time base 2x 4 , the remaining 90% is initially routed using UH1 of time base x 4 and then using a non linear routing store of reference capacity x 3 . The ordinates of the UH are derived from their respective S hydrographs which also are functions of x 4 . A groundwater flow exchange term F from the reservoir which depends on both the actual level in the routing store R, the reference level of the non-linear routing store x 3 and a water exchange coefficient x 2 is taken into account in both flow components. Finally, the total stream flow Q is obtained as the sum 120 of the resulting flows from the routing reservoir Q r and the output of the UH2 Q d .  Table 2. Prior Information used to define parameter masks for SMASH parameters. The soil classes are defined from the soil texture using the Rawls and Brakensiek (1983) relations, from which the ks,and Sf are obtained. Only the first 4 (resp. three) parameters cp, ctr, v, ks (resp. cp, ctr, v) are calibrated as a result of the sensitivity analysis. The surface runoff is divided into overland flow and drainage flow, in both cases, the kinematic wave model is used assuming 175 a 1-dimensional kinematic wave which is approximated with the Manning friction law while the subsurface flow is based on the Darcy's law. The model formulations are given in Appendix A1.
Input data are sourced from information of surface topology, soil survey, vegetation and land use, and the model is initialized using soil moisture outputs of the SIM model. Finally, the model requires only five parameters to be calibrated for the whole catchment; three correction coefficients applied to the distributed maps of saturated hydraulic conductivity C k , the soil thick-180 ness C z , and the soil lateral transmissions C kss , the other two parameters include Manning-Strickler's friction coefficient for the river bed K D1 and for the flood plain K D2 . These correction coefficients are applied during the calibration process such that the absolute values of the parameter in question is modified while the spatial pattern as sourced is preserved. The use of physiographic maps to define spatial patterns is detailed after.
Calibration of the MARINE model is done by comparing the simulated and observed discharge with NSE as the objective function. The optimization algorithm in the case of this model is based on a gradient-based descent algorithm BFGS  Fletcher-Goldfarb-Shanmo) from multiple starting points (Roux et al., 2011). The gradient is evaluated by finite differences.
The calibration involves estimating for a given event the values of the three correction coefficients applied to the distributed maps of saturated hydraulic conductivity C k , the soil thickness C z , and the soil lateral transmissions C kss . The other two parameters include Manning-Strickler's friction coefficient for the river bed K D1 and for the flood plain K D2 .

195
SIM, acronym for SAFRAN-ISBA -MODCOU (Habets et al., 2008), is an operational modeling chain that simulates both flow of water and energy at the surface, as well as the flow of rivers and the major aquifers. It is forced by the atmospheric reanalysis from SAFRAN, uses ISBA to simulate the exchange of water and energy between the soil and atmosphere; and MODCOU as the hydrological model.
The two versions of the SIM model are used in the present study, that is SIM1 and SIM2. The first version SIM1 is used  2020) ), while the second version, SIM2 is used as the benchmark to compare the simulated soil moisture outputs of the SMASH, GR4H, and the MARINE model.
The first version, SIM1, uses the force-restore version of ISBA, ISBA-3L (Noilhan and Mahfouf, 1996;Noilhan and Planton, 1989) in which the soil is discretized into three layers corresponding to surface, root and deep zone. SIM2 on the other hand uses 205 the diffusive version of ISBA, ISBA-DIF (Decharme et al., 2011), with a vertical soil column discretization into a maximum of 14 layers. In the case of this study, the humidity of the root zone is considered as the sum of the humidities of the layers between 10 cm and 30 cm deep.
The two outputs (SIM1 and SIM2) available for this study are at a daily time step (06 UTC) and a spatial resolution of 8 km square grid.

Catchments
The study catchments (Gardon at Anduze and Ardeche at Vogue) are located in the Cevennes region, prone to flash flood and are influenced by a Mediterranean climate. Data types and sources are described in the next paragraph.
The Ardeche catchment at Vogue drains an area of 622 km 2 , and is exposed to intense precipitation events due to the 215 convection of humid sea air masses over the Cevennes mountain slopes (Eeckman et al., 2020). It presents a mixed geology, with metamorphic rocks and schist on the upper part of the catchment, and sedimentary plains downstream. The land cover is mainly mixed forest, natural grasslands and shrubs. The elevation varies between 1530 m at the upstream to 150 m downstream.
The depth of the soil in the catchment ranges between as low as 5 cm to as deep as 50 cm with an average depth of 28 cm. The soil texture is mainly sandy-loam with silt deposits downstream. The mean saturated hydrological conductivity is around 8.6 mm/hr.
The Gardon with its outlet at Anduze drains an area of 540 km 2 . It is well gauged and has a Mediterranean climate with a lot of intense rainfalls in the autumn and winter. It is characterized by the occurrence of flash floods and the highest rainfall rates in autumn, while the summer is mostly hot and dry (see Roux et al. (2011)). The catchment geology is mainly dominated by fractured metamorphic formation, classically the schistose, however there are some karstic zones around the junction of Saint

225
Jean and Mialet (Le Xuan et al., 2006). It has a highly marked topography consisting of high mountain peaks, narrow valleys and steep hill slopes. The vegetation is dense and composed mainly of beech, chestnut trees, holm oaks and conifers (Moussa, 2010). The elevation varies between 129 m at Anduze to 1202 m at the highest point. The average slope of the basin is about 20%, but can be up to 50% at the upstream. The soil (made of by silty-clay loam and sandy loam) has a mean thickness of around 28 cm and a mean saturated hydraulic conductivity 5 mm/hr.

Data
To provide a fair assessment of the models, the same input of discharge, rainfall and for the specific case of the continuous models (SMASH and GR4H), potential evapotransipiration (PET), are used. The hourly discharge have been extracted from the HYDRO database of the french ministry in charge of environment while the rainfall data from the radar observation reanalysis ANTILOPE J+1, which merges radar and insitu gauge observations, is provided by Météo-France. The interannual temperature 235 data is provided by the SAFRAN reanalysis and then used to calculate the potential evapotranspiration using the Oudin formula  Oudin et al. (2005). The rainfall and PET are at a spatial resolution of 1 km 2 square grid, and processed into hourly time step.
Spatial averages of the rainfall and PET over the catchments are used as input for the lumped GR4H model. The soil thickness and texture maps are derived from the surveys provided by the INRA and BRGM. Soil classes and consequently the suction, porosity and saturated conductivity are derived from the soil texture using the Rawls and Brakensiek (1983)

Numerical Experiments Methodology
Sequel to the motivations and the objectives of the study as outlined in the introduction, the following numerical experiments are designed to help answer the questions raised. The first experiment is designed to investigate the global sensitivity of the three models. Experiments for the calibration and validation of the models within the study period are then followed, and finally the methodology to compare the model performance at the event scale is described. The evaluation criteria used are also briefly 250 described.

Regionalized Sensitivity Analysis
The sensitivity analysis of the model parameters is based on the regionalized sensitivity analysis (RSA) approach (see Appendix 3.1). The aim is to identify the parameters to which the model is most sensitive to. In this section, we describe how the sensitivity analysis methodology is applied to each model. First, in the case of the GR4H model which is lumped, 10,000 simulation runs are done using randomly (uniform) generated parameters of the model within their range (see Table 1). The parameters are x 1 , x 2 , x 3 and x 4 . For the current and subsequent experiments, a threshold of 0.7 NSE (eq. 1) is used for the classification of the runs in to behavorial (runs with NSE ≥ 0.7) and non-behavioral (NSE < 0.7) groups. As noted by Beven (2001), the KS test can be very sensitive to small differences, and will thus report significant differences between the two classes. Hence, the magnitude of the KS statistics D, representing the 260 maximum difference between the cdf of the two classes is used to rank the parameters based on their sensitivity.

255
Secondly, in the case of the SMASH model, which is a fully distributed model at a spatial grid of 1 km 2 , classical reduction of the high dimensional control space is adopted using physiographic masks (as described in section 2.1.2). The sensitivity analysis is then performed on the resulting parameters. The two reduction approaches are described below: 1. In the first instant, the parameters of the model are taken as being spatially uniform, and therefore, the RSA is done 265 assuming one parameter set at a time for the whole catchment considered. The four parameters are: c p , c tr , v, and k s . The sampling of the parameters within their ranges specified in Table 3 and assuming uniform distribution is done for 10,000 Montecarlo simulation runs. The classical RSA described in section B is followed. The same threshold of N SE = 0.7 is used for the classification of the runs into behavioral and non-behavioral groups. The KS test statistics is then calculated to rank the parameters' sensitivity.

270
2. In the second approach, the dimension reduction is through the use of parameter regionalization approach, defined by prior information on the parameters spatial distribution from basin predictors. The predictors include soil texture, thickness, land cover and topography. Pedotransfer functions (eg. Rawl and Brakensiek) are then used to relate the model parameters and the basin predictors, and then an upscaling operator (arithmetic mean) is used to upscale from the predictor scale to the modeling scale (1 km 2 ). The soil thickness map is used for the production reservoir capacity c p , the 275 map of saturated conductivity for the Green and Ampt saturated conductivity k s , the map of the flow accumulation for the routing parameter v and the map of the slopes for the transfer parameter c tr . Afterwards, for each parameter, uniform random values within the specified range (see Table 3) is generated for each class. For example, if for a catchment, the classes of the soil thickness are five, then five random values are generated, each one for a class, yielding a total of five generated values per run for the c p parameter (the soil thickness being the predictor of c p ). The five values will 280 be mapped to their grid location in the parameter map. 10,000 simulations are then run and the corresponding NSE is calculated. In presenting the scatter plots (for identifiability investigations), average of the mask (raster values) weighted by the class percentage is shown since it is not possible to show the whole raster on the plot.
Lastly, the sensitivity of the five MARINE model parameters (see Table 4) is investigated using the NSE criteria. The eight events for Gardon and seven events for Ardeche are used for the analysis (see Table 5). 5000 runs were conducted as 285 previously done by Roux et al. (2011) and Boithias et al. (2017) for each event. The same threshold value of 0.7 is used and a classification is done for each event. Unlike the case of Boithias et al. (2017) and the references reported therein, where the result of the sensitivity analysis was used to choose calibration/validation events, our methodology here is basically to investigate the parameter sensitivity. The method for the choice of the calibration/validation events is described in section 3.2.

290
For each of the three hydrological models GR4H, SMASH and MARINE, parameters calibration is performed with their dedicated methods presented in section 2.1. Those methods enable adequate calibrations for each model as shown after. In order to perform fair comparisons, considering comparable amount of hydrological information learnt by the models in the calibration phase, the calibration and validation is done using a split-sample test procedure by dividing the data into two (Klemeš, 1986). Time series of 13 years at hourly time step is considered and the sub-periods of 7 years each for the calibration 295 and validation are defined as period 1 (1st August 2006 to 1st August 2013) while period 2 (1st August 2012 to 1st August 2019). Calibration is done first using period 1 and then validation on period 2, the reverse is then done in which period 2 is taken as calibration period while period 1 is taken for validation. For each calibration period, 1 year is used as the warm up period to initialize the model which is adequate for hydrological models as reported by Kim et al. (2018). In the case of MARINE, the events (see Table 5) are classified into two periods (similar to the continuous models) and a multi-event calibration and cross 300 validation is done. Similar multi-events calibration of MARINE has been carried out by Garambois et al. (2013). For all the calibrations, we used the NSE as the objective function.

Comparison of models at event scale
These experiments are designed to compare the three models at flash flood modeling. The inter-comparison involves the assessment of key indices of peak flow estimation as well as its timing and the internal fluxes simulated. In addition to the 305 NSE criteria, the percentage peak difference (PPD), peak delay (PD) as well as synchronous percentage of the peak discharge (SPPD) introduced in section 3.4 are used.
The "soil moisture" simulated by the model are also compared with outputs of the SIM model. While SIM outputs are at a spatial resolution of 8 km, those of SMASH and MARINE are at 1 and 0.5 km respectively. In the case of GR4 that is lumped, the resolution is at the scale of the catchment size. We compare the soil moisture by looking at the temporal evolution of the 310 spatially averaged outputs of each model, and how close it is to those of the SIM2 outputs, which in our case is the reference benchmark.
Specific flood events of return period higher than 2 years are chosen within the period of 13 years (2006-2019) for both catchments, They are given in Table 5. These selected events provided distinct characteristics in terms of the flood peak magnitudes, the volume of water exported, the number of peaks, the gradients of the rising and falling limbs as well as the 315 spatial and temporal patterns of the underlying precipitation events.
In order to provide a fair comparison, the same rainfall forcing and discharge data are used for all the models.

Performance Evaluation Criteria
In the course of all the calibration and the validation of the hydrological models used, the objective function used for the calibration is the widely used Nash and Sutcliffe efficiency criterion: which puts more weights on the high flows than on low 320 flows, and is adapted to our objective of assessing the ability of the model to simulate flash floods.
whereQ o is the mean of observed discharges, Q s(i) and Q o(i) are simulated and observed discharges at time step i respectively.
In the case of inter-model performance evaluation between the SMASH, GR4H and MARINE at event scale, other criteria 325 are used. They include: -The Kling-Gupta Efficiency (KGE) (Gupta et al., 2009) which provides an alternative to the NSE and gives balance to the correlation, flow variability and water balance.
, the Pearson correlation coefficient, evaluates the error in shape and timing between observed (Q o ) and 330 simulated (Q s ) flows, cov is the co-variance between observation and simulation and σ is the standard deviation, β = µs µo , evaluates the bias between observed and simulated flows where µ is the mean. α = σs σo , the ratio between the simulated and observed standard deviations, evaluates the flow variability error.
-Percentage Peak Difference: This criteria is given as P P D = Qp;sim Q p;obs and is mainly to judge the percentage of the observed peak predicted by the model, the duo must not coincide in time of occurrence.

335
-Peak Delay (PD): Given as t p;sim − t p;obs and simply computes the difference in time or delay between the simulated and observed peak -A more rigorous criteria in terms of safety is the synchronous percentage of the peak discharge (SPPD) that accounts for the ratio of the estimated discharge and observed discharge at the time of the observed peak discharge. It has been used first by Artigue et al. (2012) and then subsequently by Jay-Allemand et al. (2020)  Finally we also use as a metric, the runoff coefficient (CR).

Results and Discussion
The results obtained after conducting the numerical experiments described in section 3 are presented here, along side relevant discussions.
The calibration and validation efficiencies as well as the event signatures are also presented and discussed. Finally, the 345 comparison of the simulated soil moisture, as compared to the gridded outputs of SIM model are presented and discussed.

Sensitivity Analysis
The results obtained from the regionalized sensitivity analysis of the three models are presented in this section. Table 6. Sensitivity ranks of the SMASH model parameters (left) and GR4 (right) computed according to the Kolmogorov-Smirnov test statistics, D, accounting for the maximum distance between the behavioral and non-behavioral distributions. (1 is the most sensitive, 4 is the least sensitive). In the case of SMASH, the result obtained through dimension reduction using spatially uniform and masked parameters are  for the maximum distance between the behavioral and non-behavioral distributions.
(1 is the most sensitive, 5 is the least sensitive)  Figure 3 gives the results of the sensitivity analysis under spatially uniform parameter sets. In the case of the Gardon catchment, the scatter plot (first row) shows clear identifiability for the transfer parameter c tr . The two production parameters c p and k s shows the least identifiability, while the routing parameter v shows exclusive poor performance for small values. Under our tested methodology, peaky scatter plots for a parameter indicates a good identifiability. The scatter plots in the case of the Ardeche catchment shows a drop in performance for values of c p higher than 1200, below this value, both good and poor 355 performances can be obtained. In the case of the k s parameter, the scatter plot shows clear non-identifiability due to clear randomness throughout the parameter range. The transfer parameter c tr appears to be peaky for this catchment also. Finally, similar to Gardon, the routing parameter v shows significant drop in performance for small values.
The cumulative distribution of the behavioral and non-behavioral classes (second row) are based on the NSE threshold of 0.7. In the case of the Gardon catchment, c p exhibits flat slope for small (< 125) and high (> 1750) values with near uniform 360 distribution in between, while the distribution of the non-behavorial classes is uniform showing that poor NSE can be obtained throughout the parameter range. In the case of the c tr parameters, which is also the most sensitive, the slope is non-zero only within very small range (between 200 and 400), outside this range, all realizations are poor. Relatively flat slope is observed within this range for the non-behavioral realizations confirming the absence of poor realizations within the range. The KS statistics D is largest for c tr confirming that it is the most sensitive. For the case of Ardeche, although the scatter plot shows 365 that c tr is most identifiable due to its peakedness, the test statistics shows v to be the most sensitive and closely followed by c p . However, k s still remains the least sensitive.
NSE which gives more weight to high values. In the SMASH model, c tr controls the amount of the effective rainfall that is transferred for routing and thus affects the magnitude and timing of the peak flows.

Masked Parameters
The second RSA investigated is under the reduction of the control space through the use of masks due to high dimensionality resulting from the fully distributed nature of the model. As described in the methodology under section 3.1, the results are shown for the average of the mask (raster values) weighted by the class percentage as it is not possible to show the whole raster on the plots. The scatter plots and the cdf shown in the first and second rows of Figure 4 respectively have the same 375 interpretation as discussed and presented in the preceding paragraph 4.1.1. Peaky plots indicates identifiability and the larger the difference between the behavioral and non-behavioral classes, the more sensitive the parameter is. However, in this case there are some differences in the sensitivity of the model parameters. In the case of the Gardon catchment, c tr remained the most sensitive, but the routing parameter became the least sensitive. For Ardeche, v which hitherto was the most sensitive according to the KS test under uniform configuration, became the least sensitive, parameter while c p is the most sensitive.

380
The differences between behavioural and non-behavioural distributions are less pronounced than with the uniform strategy, possibly because the results shown are the average of all the raster values.

GR4
In the case of the GR4H model, the RSA results for both catchments are presented in Figure 5. For both catchments, the time base of the unit hydrograph x 4 is the least sensitive, while the ground water coefficient x 2 is the most sensitive. For the Gardon 385 catchment specifically, the size of the production reservoir x 1 is less sensitive compared to the exchange coefficient x 2 and the routing store capacity x 3 , whereas in the case of the Ardeche catchment, the sensitivity of x 1 is very close to that of x 2 , the capacity of the routing store x 3 beeing the third most sensitive.

MARINE
The result of the sensitivity analysis of the MARINE model for both catchments is presented in Figure 6 and the summary 390 of the parameter sensitivity ranks computed according to the KS test statistics D is shown in Table 7. The ranking of the parameters is event dependent for each of the two catchments. In the case of the Gardon, the coefficient applied to the lateral subsurface flow, C kss emerged as the most sensitive for all the events except the Nov 2011 flood. It is then followed by the coefficient applied to the soil thickness, C z . In other words, the three most sensitive parameters are related to the soil storage capacity. The two Manning-Strickler's friction coefficients for the river bed K D1 and the flood plain K D2 emerged as the least 395 sensitive in the ranking. In the case of the Ardeche catchment, different sensitivity ranks of the parameters are obtained. For this catchment, the correction coefficient C k of the hydraulic conductivity (infiltration) emerged as the most sensitive, which is then followed by C z . Unlike the case of the Gardon, C kss , along with K D1 are the least sensitive.
The flood events in the Gardon are all autumn events, however the October 2014 flood appeared entirely different in terms of the distribution of the behavioral realizations, because very few observations above the NSE threshold of 0.7 are obtained for 400 this specific event. Ardeche on the other hand has two events occuring in spring, while the rest are autumnal. There is however no significant observable difference between the distributions of these events.  For each catchment, the first row shows the scatter plot of the NSE efficiency; second row, the NSE cumulative distribution of the behavorial and non behavoiral classes indicating the Kolmogorov-Smirnov statistics D. Note: for each parameter, the point that is shown in the parameter space is the average of the mask (raster values) weighted by the class percentage for that specific run.   The fact that, these points lie within the hills/peaks, shows that the algorithms are able to locate 'sufficient minima', although   The result of the calibration of the SMASH model parameters is given in Table 8 for the two study catchments.    The class by class (mask) calibration efficiencies for the two periods vary for the two catchments but both are more than 0.7.

Calibration and validation
The resulting temporal validation efficiencies are also relatively high. Ardeche presents better calibration/validation efficiencies than the Gardon catchments. The maps resulting from the calibration are given in Figure 10 for both periods (P1 and P2) and pattern between the two periods rather than from the calibration algorithm.

GR4H
In the case of the calibration of the GR4H model on the two catchments, the parameters and efficiencies obtained both in calibration and validation are shown in Table 9. All calibration and validation efficiencies are higher than 0.7. In the case of Ardeche, there is relative stability/robustness in the calibration and validation efficiencies. The groundwater exchange coeffi-440 cient x 2 are positive in both calibration periods for the Ardeche (export), while they are negative (import) in the case of Gardon.
According to this model, positive values show water import, while positive values indicate water export.

MARINE
The calibration of this event-based model followed the procedure described in the methodology, the events were divided into period 1 and period 2 events and each group was calibrated together by minimizing the cost function of 1 − N SE using the 445 events. The resulting global efficiencies are presented in Table 10 for both catchments. Event specific NSE (not shown here) has an average of 0.87 and 0.78 for the Gardon events of period 1 and period 2 respectively. Subsequently cross validation was done using the calibrated parameters from each group. This is to ensure the same calibration validation approach is done for the three models. During the calibration of the model and for each set, at least 10 starting points in the parameter sets were tested in order to ensure that the optimization is not stuck to local optima within the response surface.

450
The period 1 and period 2 events of the Gardon catchment resulted in very similar values, except the C z parameter which is almost twice in period 1 compared to period 2. For the Ardeche, higher calibration efficiencies are obtained compared to the Gardon, although the parameters between the two periods are dissimilar.
Validation efficiencies in terms of the Nash are presented in Table 11 for both catchments. The efficiencies are event dependent. For Gardon, NSE as high as 0.91 is obtained and as low as 0.09 , with the average of 0.58 for the eight (8)

Comparison in calibration and validation
Considering the two continous models, SMASH and GR4H, the global efficiencies in time obtained in calibration are similar.
However, they are slightly lower than SMASH for the Ardeche catchment, while they are higher for the Gardon catchment.
The temporal validation efficiencies are however catchment dependent, they are higher in period 1 for GR4H compared to SMASH, while being lower in period 2. But summing up on both periods and on both catchments, MARINE has its efficiency 465 in validation decreased by around 25%, while SMASH and GR4H have a decrease of 5.2% and 4.8% respectively. With regards to the obtained parameters, the same decrease of the production reservoir capacity x 1 period two for the Gardon is obtained as observed with SMASH, Ardeche however presents a decrease of x 1 in the second period compared to the observed increase with SMASH. Both models however resulted in larger x 1 for the Gardon compared to Ardeche (although the difference is much larger with SMASH). The routing reservoir x 3 is however larger in the case of Ardeche. The ground

Comparison at event scale
In this section, the event scale performance of the models is compared. This is done through the signatures of the simulated discharge and the simulated soil moisture. While the simulated hydrographs are compared with the observed hydrographs through the computed metrics, the soil moisture is compared to the outputs of the SIM2 model.  Figure 12 compares the simulated discharges with the three models against the observed discharges for Gardon (left) and Ardeche (right). The performance of all the models seems to be fair, and the superiority of the models depends on the event.
It is however difficult to judge objectively from the figures of the hydrographs. In order to judge objectively, different metrics have been computed and are shown for both catchments in Figure 13 and 14. The performance of the models is therefore 490 judged and discussed according to these metrics in the following paragraphs.
First, looking at Figure 13 in terms of NSE, the superiority in performance of the models over one another is quite event On average however, the bias is the same for GR4H an MARINE.

510
Other indicators to objectively compare the models are shown in Figure 14 and given in Table C2 and Table C4 for Gardon and Ardeche respectively. In terms of the percentage difference in peak magnitude, PPD, MARINE model approximates the observed peak better than the other models for most of the events in the two catchments. The difference in the timing of the observed and simulated peak is also less observed with MARINE simulations, SMASH on average has less differences compared to GR4H. The percentage difference between the observed and simulated peak at the time of the observed peak 515 measured by the SSPD criteria indicates more accurate simulations with MARINE. SMASH is yet, more accurate than GR4H based on this criteria. This criteria is relevant because, it is important to know not only the difference between the observed and simulated peak, but also what peak is simulated at the time the observed peak occurs. Lastly, the runoff coefficient (CR), measures the ratio of the total flow over the total precipitation. SMASH gives the closest CR to the observations for most of the events in the two catchments compared to the other models, it is also the closest to the observations in terms of the average 520 of the CR for both catchments. GR4H closely follows, while MARINE is the least of the two models for both catchments.
Inferring from the results, the event based MARINE has better performance with regards to the peak simulation and timing, followed by SMASH. However, in terms of the volume of water exported and water balance, SMASH performed relatively better and is followed by GR4H.
Although both SMASH and GR4H models used the same conceptual production reservoir thickness, the production reservoir 525 in SMASH (used in this study) is filled according to the Green and Ampt infiltration function; (infiltration rate equals the rainfall intensity provided ponding doesn't occurs, when it does, the infiltration excess is transferred). GR4H on the other hand is based on the saturation mechanism in which rainfall excess occurs only after saturation. This, in addition to the distributed nature of SMASH, could partly explain why SMASH outperformed GR4H in terms of the indices of peak magnitude and timing.
This is despite the fact that GR4H, by construction, has more complexity in terms of processes represented and formulations 530 used, including a non conservative exchange term (parameter x 2 ) (see A1). MARINE, apart from the physical basis, processes represented, and complexities in the formulations, is simply calibrated over flood events only. The continuous models are however, calibrated on all the flows (both low and high) and would therefore perform better in terms of the volume of the flood.

"Soil moisture" comparison
The spatially averaged time series of the soil moisture predicted by the three models is shown in Figure 15. In case of the two 535 distributed models, SMASH and MARINE, the spatial average over the area of the catchment at the hourly temporal scale is shown. The spatial average of the soil moisture outputs of the two SIM products, SIM1 and SIM2 are also shown. In the case of SIM1, which is used for initialization of the MARINE model, the single value per event (spatial average) corresponding to the beginning of the event is shown, while for SIM2, which is used for comparison, the daily series (available for this study) is shown at 06:00hr of every day for the event duration.

540
First, the soil moisture output of SIM1 (shown at the beginning of every event) is always lower in amplitude compared to the output of SIM2. While the former discretizes the soil into three layers, the middle layer corresponding to the root zone, the later discretizes into 14 layers, the layers between 10-20 cm corresponding to the root zone.
Using the SIM2 series as benchmark for comparing the three models, MARINE performed best in terms of both the dynamics and amplitude of the soil moisture in both catchments. It is closely followed by GR4H model, while SMASH is the least in 545 comparison to the other models. To assess the goodness of fit between soil moisture series of the three models in comparison to that of the SIM2 (shown in Figure 15), Figure 16 summarizes the root mean square error (RMSE) on the eight(seven) events of Gardon(Ardeche), shown on the left and right of the figure respectively. For both catchments, MARINE is the most accurate (lowest RMSE), followed by GR4H (looking at the median). In the case of Ardeche (right), the 0.75 quantile is lower than the 0.25 quantile of the other two models.

550
Looking at the SMASH model, we see that in the case of Gardon catchment, the series remained relatively flat, and the response between rainfall events is very week. Better response are however observed in the case of Ardeche compared to Gardon. This could be possibly explained by the size of the calibrated production reservoir capacity c p of the two catchments.
Relatively large capacities of c p (1500 and 1200 mm for period 1 and 2 respectively ) for Gardon against (164 and 200mm) for Ardeche are obtained. The depletion of the smaller capacity production reservoirs after or between rainfall events would be 555 faster compared to the larger ones. Interestingly, GR4H calibration resulted in much smaller c p for Gardon (480 and 230mm for period 1 and 2 respectively) compared to SMASH.
The difference in performance in the soil moisture outputs could be explained by the complexities and processes represented in each of the models. MARINE, in addition to the surface flows (overland and in the channels), subsurface lateral transfers are represented using an approximation of the Darcy's law. Therefore, although evaporation is deemed negligible at the event 560 scale, thus not represented, the lateral flows contributes to the emptying of the soil reservoir and hence the faster and sharper decline between and after rainfall events. In addition to this, being a physical model, soil surveys are used as the basis for the soil depths (corrected by a multiplicative factor c z ). This makes the process and soil moisture variation potentially closer to the real physical phenomena, unlike in the other two models in which the depths are fully conceptual -and more or less free to vary in space.

565
Although both SMASH and GR4H are emptied by the same evaporation function (see Equation A2), GR4H soil reservoir is also emptied by a percolation leakage. This percolation leakage although weak, given the power law involved, is an added complexity in the model that might have resulted in the faster response between rainfall events compared to SMASH. The process of soil emptying of the SMASH (distributed) model is thus more likely to be weaker than that of GR4H (lumped).
In the case of Gardon, the soil saturation of SMASH is generally lower than GR4H for most of the events. This is likely 570 due to the size of the respective production reservoirs (1500mm for SMASH and 500mm for GR4H). Apparently, for the same rainfall signal, the soil moisture will be higher in the smaller sized reservoir. An emphasis of this can be seen in Ardeche catchment where SMASH soil moisture are higher for all the events. Interestingly, the production reservoir depth for this catchment is 160mm for SMASH and 300mm for GR4. Hence, SMASH saturation are higher (due to smaller capacity). The optimized reservoir depth from the model calibration therefore affects the accuracy of the soil moisture estimation.

575
The controlability of the models also is different, although all the three models use the outlet discharge as the variable of interest in the calibration, MARINE model has constraints on its parameters using field data (soil survey and vegetation and land use) both in terms of their spatial distributions and their magnitude, although the later is corrected using some coefficients during calibration. The production reservoir is constrained by the soil thickness map, the Green and Ampt parameter (porosity, hydraulic conductivity and suction) are all constrained using the soil classes derived from the soil texture. The subsurface trans-580 fer have also been constrained by the soil classes and finally the Manning friction in the kinematic wave routing formulation for overland flow by the land cover. This gives MARINE more constraints in its parameters thereby having parameters with some level of physical meanings rather than being simply artifacts of the calibration algorithm. The fact that SMASH uses the same maps during calibration doesn't offer as much constrains as in MARINE model. In fact, the use of the maps is only to reduce the high dimensionality resulting from fully distributed calibration. The constraints are thus applied only on the spatial 585 pattern rather than on their magnitudes as done with MARINE. Again, even the choice of the field data eg (soil surveys of thickness and texture) to use for the constraint on the spatial pattern of SMASH parameters is not as clear as that of MARINE, since the parameters of the later have some physical meanings compared to more conceptual nature of the SMASH parameters.
The least constrain applied in terms of spatial pattern is thus on the GR4H model which is lumped, and thus rely solely on the outlet discharge in the optimization process. Lastly, MARINE is also constrained using information from the SIM1 soil 590 moisture output for its initialization.
To investigate the temporal evolution of the soil saturation, Figure 17   maximum discharge for that event is considered, for simplicity, two considerations are made. First, the time of the observed peak is assumed to correspond to the time of maximum soil saturation. This becomes necessary as it is difficult to anticipate before hand, or to track the exact time of maximum saturation, given that 10,000 runs are made for a total period of 13 years with different parameters for each run, and the event duration is a very small fraction of that time. Secondly, the evolution of the available storage is considered between two time steps, the beginning of the event and the time of the maximum observed 610 peak. Knowing that the model is spatially distributed, spatial average of the moisture saturation is considered for the analysis.
This is coherent since the model parameters for this experiment are taken as spatially uniform. The events considered for this study are the same events described in Table 5.

625
The aim of the study was to understand how three models of varying complexities simulate the hydrological behavior of two flash flood Mediterranean catchments; Gardon at Anduze and Ardeche at Vogue, both located in the South of France.
The methodology involved the investigation of global parameter sensitivity of the models, their efficiencies in calibration and validation, and the assessment of key hydrological signatures at the event scale. Finally the soil moisutre simulated by the All the three models showed good calibration and validation efficiencies. Their performances were however, generally better 645 on Ardeche compared to Gardon. In calibration, MARINE achieved the highest efficiency and is followed by GR4H. Although all the three models showed decrease in efficiencies at temporal validation, GR4H was more robust. Regarding the parameter stability between the two periods, all the models showed some differences between the calibrated parameters of both periods.
At the event scale, seven events and eight events of contrasted behaviors on Ardeche and Gardon respectively were selected to compare the performance of the three study models on the simulated discharge and the soil moisture pattern. Indices of 650 discharge simulation showed that, the event based MARINE has better performance with regards to the peak simulation and timing, and is followed by SMASH. However, in terms of the volume of water exported and water balance, SMASH performed relatively better and is followed by GR4H.
Using the soil moisture output of the SIM2 model as benchmark for comparing the simulated moisture by the three models at the event scale, MARINE emerged as the most accurate in-terms of both the dynamics and amplitude of the soil moisture 655 in both catchments (recall MARINE soil water content is initialized with SIM1). It is closely followed by GR4H model, while SMASH is the least compared to the other models. The SIM2 product from SIM model revealed to be a valuable information to assess internal dynamics of model states.
The difference in the model performances could stem from differences in the levels of complexity of the models, the processes described and the constrains of the models, and thus highlights the need for future improvements in the models and 660 calibration methods. These include improvements on the components of vertical and lateral flow of the MARINE model, as well as those of the SMASH platform along with its calibration and assimilation algorithms. In addition to this, improved constraints on the patterns and magnitudes of SMASH parameters, including those of the Green Initially proposed for a minimal complexity description of catchment water balance functioning, based on empirical modeling, the "GR loss model" Edijatno and Michel (1989) considers a production reservoir P of maximum depth c p and water level h p and is recalled here for clarity. The neutralized rainfall and evaporation are respectively denoted P n and E n . If 675 P ≥ E then P n = (P − E) and E n = 0 and dh p = 1 − hp cp 2 dP n . If P < E then E n = E − P and P n = 0 and dh p = − hp cp 2 − hp cp dE n . Assuming a stepwise approximation of the inputs P (t) and E(t) the temporal integration of these ordinary differential equations, enabling analytical solutions (calculation given in Edijatno (1991)), as reported by Perrin et al. (2003) gives the infiltrating rainfall P p and the actual evapotranspiration from the reservoir store E p : As remarked in Jay-Allemand et al. (2020), h p is the water level of the production reservoir at the begining of a time step ∆t and P p and E p are the amount of water gained or lost over ∆t and used to update h p before the next time step.
This is the water balance scheme of GR4 where the state h p and parameter c p are respectively denoted S and x 1 .

A1.2 Green and Ampt infiltration
Applying Darcy's law, Green and Ampt (1911) proposed a simplified physical model for water infiltration from a ponded surface into a deep homogeneous soil with uniform water content. The Green and Ampt model approximates the curved soil moisture profiles of the wetting front that result in practice, and from solution to Richard's equations, as a sharp interface with saturation conditions θ = θ s above the wetting front, and initial moisture content θ = θ i below the wetting front. The initial 690 moisture content is assumed to be uniform over the entire depth. The infiltration i(t) writes as: Where r(t) is the rainfall rate (m/s), t p is the time to ponding (s), K s is the saturated hydraulic conductivity (m/s), θ is the change in the volumetric water content (m/m), ψ is the soil suction and I(t) is the cumulative infiltration depth (m).
This model is used in MARINE event-based model (Roux et al., 2011).

695
It is also implemented in SMASH, following the algorithm presented in (Chow et al., 1998) involving a classical Newton-Raphson algortihm to solve θ from non linear Green and Ampt integrated in time (Mein and Larson, 1973), and with parameters explained in 3. Hence, the production reservoir P of maximum capacity c p (and porosity is simply set to η = 1) is filled by the infiltrating rainfall obtained form Equation A3, and is emptied by the actual evaporation E p obtained from Equation A2.

A1.3 Transfer
The Transfer function is represented by a reservoir of capacity c tr and actual level h tr , and models the fast flow, it is supplied by the excess flow after production step (GR evaporation A2, infiltration A3). The time evolution of the actual reservoirs levels thanks to the mass conservation gives the flow rate q r from the fast reservoir at each time step such that: where h tr0 is the reservoir levels at the beginning of the time step.

A1.4 Routing
Given known flow directions, classically obtained from DEM, the cell to cell routing is done with a linear unit Gaussian hydrograph whose delay τ i from node i − 1 to node i is controlled by the routing velocity v i and the distance d i (see details in Jay-Allemand et al. (2020)).

A2.1 Production
The water balance is modeled with a production reservoir as described in section (A1.1) with equations A1 and A2, denoting the state S and parameter x 1 instead of respectively h p and c p .

715
A groundwater flow exchange term F from the routing reservoir which depends on both the actual level in the store R, the reference level x 3 and a water exchange coefficient x 2 is taking into account both flow components

A2.3 Linear Routing
The 10% of the effective rainfall P r resulting from the excess of the production and the percolation is routed linearly using a 720 unit hydrograph UH2 of time base 2x 4 , the remaining 90% is initially routed using UH1 of time base x 4 .The ordinates of the UH are derived from their respective S hydrographs which also are functions of x 4 A2.4 Non-linear routing Total stream flow is given by

A3.3 Surface flow
The surface runoff is divided into overland flow and drainage flow, in both cases, the kinematic wave model is used assuming a 1-dimensional kinematic wave which is approximated with the Manning friction law. The equation is thus: where h is water depth (m), t is time (s), x is space variable (m), r is rainfall rate (ms −1 ), i is infiltration rate (ms −1 ), S 0 stands for bed slope (mm −1 ) and n o is the Manning friction parameter (m 3 /m −3 ) .

Appendix B: Regionalized Sensitivity Analysis
Sensitivity analysis in hydrological modeling is defined as the investigation of the response function that links the variation in the model outputs to changes in the input variables and/or parameters (e.g. review bySong et al. (2015)). It allows the 745 determination of the relative contributions of different uncertainty sources to the variation in outputs using qualitative or quantitative approaches under a given set of assumptions and objectives.
A model can be sensitive to a parameter in two ways: a) the uncertainty in the parameter is propagated throughout the model thereby contributing in the overall model uncertainty. b) small change in the parameter results in significant change in the output because of the high correlation between the output and the parameter.

750
A common method of global sensitivity analysis is the regionalized sensitivity analysis (RSA). This is also called generalized sensitivity analysis and it is based on Monte Carlo simulations. Parameter values are taken here from uniform distributions within chosen ranges and then Monte Carlo simulations are run using the parameter sets and based on a defined goodness of fit criteria (GOF), the results are then classified as behavioral or non-behavioral based on a chosen threshold of the GOF criteria.
For each parameter, the difference between the cumulative distribution of the behavioral and non-behavioral sets is determined 755 using a quantitative Kolmogorov-Smirnov (KS) statistic 3. A significant difference means the parameter is sensitive, and the larger the difference, the more sensitive the parameter is. This method has been widely used in hydrological modeling and is easy to implement (Song et al., 2015), however the choice of the GOF criteria and threshold is highly subjective.
KS (x i ) = max y F y (y) − F y|xi (y) This approach to sensitivity is used in the course of the present study to investigate the sensitivity of the parameters of the 760 three models on the study catchments.
Appendix C: Tables of events comparisons