On constraining a lumped hydrological model with both piezometry and streamflow: results of a large sample evaluation

. The role of aquifers in the seasonal and multiyear dynamics of streamflow is undisputed: in many temperate catchments, aquifers store water during the wet periods and release it all year long, making a major contribution to low flows. The complexity of groundwater modelling has long prevented surface hydrological modellers from including groundwater level data, especially in lumped conceptual rainfall–runoff models. In this article, we investigate whether using groundwater level data in the daily GR6J model, through a composite calibration framework, can improve the performance of streamflow 5 simulation. We tested the new calibration process on 107 French catchments. Our results show that these additional data are superfluous if we look only at model performance for streamflow simulation. However, parameter stability is improved and the model shows a surprising ability to simulate groundwater levels with a satisfying performance, in a wide variety of hydrogeological and hydroclimatic contexts. Finally, we make several recommendations regarding the model calibration process to be used, according to the hydrogeological context of the modelled catchment.

and needs fewer data than a fully distributed one, allowing for many experiments simulating anthropogenic influence, but it is far from straightforward to implement on any catchment with few data.
These modelling schemes have shown noteworthy simulation abilities for both aquifers and streamflow. However, they have not been tested on large sets of catchments in various contexts to value their robustness and generalisation capacity. Moreover, groundwater simulation is, in most hydrological modelling studies, a side product of rainfall-runoff modelling. There is little 100 evidence on how the addition of groundwater data can actually help obtain a better streamflow simulation.
1.3 How are measured data used in hydrological modelling?
Most hydrological models are parametric and their parameters are calibrated using measured streamflow data (Roche et al., 2012). To find the best set of parameters with which to reproduce the streamflow time series, a calibration criterion, which is a function of measured and simulated -or forecasted -streamflow, is optimised, the most common one being the Nashby the modeller is necessary; Jian et al. (2017) used, in a catchment where only few streamflow measurements were available, river level data and added three new parameters to a hydrological model to simulate the rating curve. Whereas in-field mea-surements are not always common, satellite data are broadly available around the world and numerous studies have used them in hydrological models: Immerzeel and Droogers (2008) used satellite evaporation to calibrate the SWAT distributed model through a composite criterion and got a closer representation of actual evaporation and less equifinality in parameter deter-130 mination; Mostafaie et al. (2018) performed a multi-objective calibration using NSE for streamflow and total water storage from GRACE satellite data; Milzow et al. (2011) combined several satellite datasets -surface soil moisture, radar altimetry and total water storage -to calibrate a semi-distributed model in a catchment with few streamflow measurements through a Using other data than streamflow is less straightforward in empirical or conceptual models that do not explicitly simulate physical fluxes or states. A particular state of the model is generally linked to the available physical variable. In catchments 140 affected by snow and/or glaciers, related data -i.e. snow depth or glacier thickness -can be used in model calibration (Riboust et al., 2018;Tiel et al., 2020). Beyond calibration, extra data can be assimilated into the model to correct its trajectory during runtime; several studies showed an improved performance of hydrological models with assimilation of soil moisture (Aubert et al., 2003a, b;Oudin et al., 2003) or snowpack (Thirel et al., 2013).
As far as piezometry is concerned, distributed hydrological models are rarely calibrated using piezometry time series. Most 145 gridded models have a physical parametrisation: parameter values are, directly or indirectly, inferred from local properties measured in situ (Moreda et al., 2006) -for instance, topography, soil types, vegetation or geological properties. At a pinch, the parameter set can be adjusted, with a limited variation margin adapted to the physicalness of parameters, to better represent streamflow; but given the often large number of parameters to be adjusted, distributed models cannot be fully calibrated without suffering from equifinality (Beven, 1993). In these conditions, several studies underlined the possibility of calibrating 150 a distributed hydrological model using both piezometry and streamflow, with semi-automatic (Feyen et al., 2000;El-Nasr et al., 2005;Li et al., 2017) or automatic multi-objective calibration procedures (Khu et al., 2008). Lumped conceptual models, with a reduced number of parameters, are easier to calibrate directly without prior determination of the parameters. When a particular state of the model, in general a groundwater reservoir, can be coerced to a measured piezometry time series, calibration using both piezometry and streamflow is possible, generally through a linear composite objective function combining criteria on 155 streamflow and piezometry (Thiéry, 1988;Moore and Bell, 2002;Széles et al., 2020). Despite significant improvements in piezometry simulation, these studies found that adding piezometric information to the calibration process did not significantly impact streamflow simulation.

Scope of the paper
In view of the undisputed role of aquifers in low-flow dynamics in many catchments, it seems reasonable to try to improve the 160 performance of a hydrological model by adding piezometric data to the calibration process. However, most of the approaches reviewed in the previous section are difficult to implement due to a relatively large number of parameters and because the performance gain offered by the new data has not been assessed on a large set of catchments, which is necessary for model evaluation (Barthel and Banzhaf, 2015).
In this study, we aim to develop a new modelling approach based on a simple structure with an easy parametrisation, 165 assessed on a large sample of catchments to ensure the generality of conclusions. We propose an adaptation of the structure of the conceptual daily rainfall-runoff model GR6J (Pushpalatha et al., 2011) to make it simulate groundwater table levels.
Since no element of the existing model structure was designed to explicitly simulate groundwater level and because of the huge scale gap between point piezometric measurements and aquifer-scale storage volumes, we could not propose a physically explicit solution; thus, we investigated an empirical adaptation of the model structure. Section 2 recounts the process that led 170 to designing this adaptation and the calibration and evaluation schemes of the new model. Section 3 presents the hydroclimatic dataset of 107 catchments over mainland France that was used to evaluate the new calibration with respect to the original one, performed only on streamflow. Section 4 summarises the results and proposes recommendations for model calibration in various contexts.

Context
The French mainland territory hosts a large diversity of climatic, topographic and geological contexts, with catchments representing various hydrological and hydrogeological configurations. Several major aquifers are known to have a significant influence on surface waters, especially on low flows. The Paris basin, with its pile of secondary and tertiary sedimentary formations, hosts several major aquifers for surface hydrology: the Late Cretaceous chalk aquifer is known to govern the multiyear 180 dynamics of the Somme and part of the Seine and Loire basins, with a noteworthy long flood event after the exceptionally wet years of 1999 and 2000 (Pinault et al., 2005;Habets et al., 2010); the Beauce tertiary limestone aquifer controls the hydrology of a key agricultural region astride the Loire and the Seine basin, with a major groundwater contribution to low flows (Lalot et al., 2015); the Cenomanian sand aquifer in the Perche region, which is directly connected to the Eure and Huisne basins, is regarded as an essential groundwater reserve for the region and its declining trend is a major threat for Perche rivers 185 (Lenhardt et al., 2009). The second largest French sedimentary basin, the Aquitaine Basin, has a more complex configuration with thick multi-layer aquifers covered by poorly permeable formations, such as Pyrenean molasses. The outcropping areas of these formations are visible in figure 1.
The large sedimentary basins are not the only geological areas in France which host aquifers that are of interest for surface hydrology modelling. Aquifers located in alluvial plains, such as the international Rhineland aquifer -and its French part in 190 the Alsace plain quaternary alluvium -or the Bresse graben gravels, play a major role in the streamflow dynamics of the Saone and the Rhine basins. As highlighted in the Introduction, even small alluvial aquifers outside plains can have an influence on rivers: for example, several small left-bank tributaries of the Rhone are mostly ruled by the Bièvre moraine aquifer, with visible consequences on water quality (Bel et al., 1999). Regions in which geological formations are composed of metamorphic or igneous rocks, such as Brittany or the Ardennes, can host fractured bedrock aquifers, linked to surface rivers. The wide 195 monitoring network of rivers and groundwater in France, described below, allowed us to select a test dataset of catchments which is representative of this diversity.

Data sources
Climatic data -daily cumulative precipitation, average temperature and fraction of solid precipitation -were taken from the SAFRAN (Système d'Analyse Fournissant des Renseignements Adaptés à la Nivologie) re-analysis (Vidal et al., 2009) by 200 Météo France; they are available at a daily time step for the 1958-2018 period. Daily potential evaporation was computed using the formula by Oudin et al. (2005). Streamflow data were retrieved from the French national database Banque Hydro (Leleu et al., 2014;SCHAPI, 2021). These hydroclimatic data are aggregated at the catchment scale and at a daily time step for mainland France in the HydroSafran database (Delaigue et al., 2021), maintained by INRAE (Institut national de recherche pour l'agriculture, l'alimentation et l'environnement).

205
Groundwater level data are from the French national database ADES (BRGM, 2021) (Accès aux données sur les eaux souterraines), which gathers piezometric data from many providers in the French territory. Selected piezometers were taken from two reference networks, to ensure the quality of data: RNESOUPMOBRGM (national quantitative monitoring network managed by BRGM, the French national geological survey) and RNESP (heritage national network for groundwater monitoring).

Dataset selection 210
Catchments were selected on the basis of data availability criteria, shown below, and an analysis of the hydrogeological context through he French national reference cartography of hydrogeological formations BDLISA (Brugeron et al., 2018) (Base de données des limites des systèmes aquifères). For each catchment, one or several piezometers were chosen, assessing the connection of the monitored aquifers with surface water bodies. First, using the provided metadata, each piezometer extracted from the ADES database was associated with a hydrogeological entity in BDLISA, representing an aquifer. Catchments in 215 which anthropogenic activities -dams, major direct withdrawals or inflows -are known to have a significant influence on streamflow and catchments in which more than 10% of precipitation falls as snow were discarded. Then, for each catchment, piezometers associated with aquifers emerging inside the catchment boundaries were listed and maps -see an example in figure 2 -were produced to assess the importance of each hydrogeological formation for the catchment. Piezometers associated with formations with outcropping or sub-outcropping areas representing less than 5 % of the catchment area were discarded, 220 along with those located on the wrong side of underground watersheds -i.e. where groundwater does not flow to the catchment outlet but to another catchment -as identified by BDLISA. After this spatial selection, the available groundwater level and streamflow data were examined. An initial visual inspection of the time series was performed to eliminate data of too low quality, relying on the expertise of database maintainers. Then, catchments and piezometers were selected according to the following criteria :

225
-At least 20 years of continuously available streamflow data with less than 10 % of missing data; -At least 20 years of continuously available groundwater level data with less than 10 % of missing data; -At least 10 years of continuous contemporaneity between streamflow and groundwater level. Figure 2 shows two situations encountered at this stage: on the left, the Sensée river is connected to one monitored aquifer but three piezometers are available. In this case, the piezometer with the longest time series, with respect to contemporaneity with 230 streamflow data, was selected to represent the aquifer. On the right, the Seudre river is connected to two monitored aquifers with one piezometer for each one; in that case, the two piezometers are kept. The choice of keeping only one piezometer per aquifer in the catchment was made for the sake of simplicity; in most catchments, when visually comparing the dynamics of the groundwater level time series, no major difference was encountered between piezometers monitoring the same aquifer within the same catchment. A correlation study between groundwater level time series led to the same conclusions. Finally, this selection process yielded to a set of 107 catchments and 160 piezometer/catchment pairs. The majority of catchments -73 -are associated with only one piezometer; 22 of them with two; eight of them with three; one of them with four and three of them with five piezometers. Tables 1 and 2 show geographical and hydrogeological characteristics of the set.
The necessity to choose catchments which are not anthropogenically regulated led to a selection mainly composed of small headwater catchments, representative of the climatic diversity of the French territory. Dismissing the mountainous catchments 240 to avoid the influence of snow favoured the selection of lowland catchments, although several Vosges catchments, whose downstream part is linked to the Alsace plain aquifer, reach maximum altitudes above 1,000 m. However, the average altitude remains low enough for the solid precipitation fraction not to overtake 10 % of total precipitation at the catchment scale. The variability of the mean annual potential evaporation is low, since the dataset does not contain high-altitude catchments -in Table 2. Hydrological characteristics of the catchment dataset. The aridity index is defined as the quotient of annual rainfall and annual potential evaporation (PET). Catchment yield is the quotient of annual streamflow and annual rainfall.

Mean annual streamflow (mm)
Mean annual rainfall (

Parametrisation strategy
For each catchment, the model is calibrated to fit measured streamflow: the six parameters are determined through an optimisation process, by minimising an error criterion between measured and simulated streamflow, in a reference period. In this study, the Nash-Sutcliffe efficiency or NSE (Nash and Sutcliffe, 1970) was used for streamflow.
Since the six parameters have very different dimensions and variation ranges, each of them is transformed with a bijective function to fit into the [−9.99; 9.99] interval. Thereby, the optimisation space for optimal parameter research becomes

Study of the model correlation with piezometry
To adapt the existing model structure for groundwater level simulation, we followed an approach similar to other lumped conceptual models, i.e. using a store as a representation of the aquifer, regarding the water content in the store as a proxy for groundwater level. With this aim in mind, a correlation study was performed on the dataset presented in section 2, in order to identify which of the three conceptual reservoirs of the model structure was the most correlated with piezometry. The 280 production store is part of the production function, which computes the balance between rainfall and evaporation to determine the amount of water available for streamflow. The routing and the exponential stores are part of the routing function, which models the time repartition of this available water to simulate streamflow.
GR6J was calibrated for the 107 catchments of the dataset using the Nash-Sutcliffe efficiency criterion computed on the square root of streamflow. The algorithm by Michel (1991), as implemented in airGR R package (Coron et al., 2021; R Core 285 Figure 5. Distributions of correlations between piezometry and several states of the model Team, 2021), was used on the whole period of available climatic data . Then, the time series of model states obtained -the levels of the three conceptual stores and simulated streamflow as control data -and the groundwater level time series were aggregated at a monthly time step, to avoid problems caused by missing piezometry measurements. Afterwards, for each of the 160 catchment/piezometer pairs, Spearman's correlation (Spearman, 1907) between piezometry and each state series was computed; the results are summarised as boxplots in figure 5. The exponential store ) -see 290 figure 4 for a description of the model -is the most correlated with piezometry and, moreover, it is the only store to be more correlated with groundwater level than with streamflow. The median correlation obtained is 0.762 and 80% of pairs reach a value higher than 0.5.
A high Spearman correlation may highlight a non-linear relationship, since it is a rank correlation. However, it does not seem to be the case here: other investigations not detailed here show that the relationship between the exponential store content and 295 the groundwater level can be regarded as linear, all the more so as the correlation is high. Therefore, it was decided to use the exponential store to simulate piezometry, with an adapted scheme presented in the following section.

Adaptation of the model scheme
A built-in module is added to the existing model structure to simulate groundwater level. The streamflow simulation chain is not modified, but a new output is added to the model, through a linear transformation of the exponential store level.

300
Groundwater level absolute values strongly depend on the piezometer location -its altitude, but also its position with respect to the catchment topography. Indeed, two piezometers monitoring the same aquifer and therefore representing the same dynamics can have different mean levels and their fluctuations can have different ranges -for instance, if the first one is located on a plateau while the second one is on a slope. To avoid having to take into account these problems, it was decided to work with normalised groundwater level δz, where z is absolute groundwater level,z is its mean and σ z its standard deviation: To represent the relationship between the exponential store level Exp -in mm -and simulated normalised groundwater level δ z,sim , several polynomial relationships were investigated. It appeared that using a function of degree 2 or more was not useful to improve performance. Therefore, an affine relationship is added to the model, with two additional parameters X 7 and X 8 , using the following equation: Simulated piezometry z sim can be computed by reversing equation 1: X 7 is the groundwater linear coefficient; trials have shown that it generally takes values between 0 and 1 but can reach 4.

Composite calibration strategy
Now that two additional parameters have been added to the model structure to simulate piezometry, it is necessary to determine their value through an adapted parametrisation strategy. A composite objective function is chosen for calibration, using a linear combination of a criterion on streamflow -the Nash-Sutcliffe efficiency computed on the square root of streamflow -and a 320 criterion on piezometry, called ZError and defined as: Computations detailed in appendix D show that this criterion is in fact Nash-Sutcliffe efficiency, expressed for groundwater level instead of streamflow. Since the two criteria on streamflow and piezometry have the same variation ranges -] − ∞; 1]and the same properties, the objective function C for composite calibration can be taken as a linear combination of the two 325 criteria, with a weight α: α can take any value between 0 and 1: α = 0 means that the calibration is performed only on streamflow and α = 1 only on piezometry. In order to find a compromise between these two objectives, 51 values are explored from 0 to 1 by a step of 0.02.
For each value of α, C(α) is maximised as a function of eight parameters. The parameter space transformations described 330 in appendix B are used to convert the optimisation space into the hypercube [−9.99; 9.99] 8 . The differential evolution global optimisation algorithm -implemented in the RcppDE R package (Price et al., 2006;Mullen et al., 2011;Ardia et al., 2011a, b;Eddelbuettel, 2018;Slater et al., 2019;Ardia et al., 2020; R Core Team, 2021) -is then executed to find the global optimal point for the eight parameters.

Split-sample test evaluation scheme 335
To assess the effect on streamflow simulation performance of the new calibration scheme described above, a split-sample test (Klemeš, 1986) is conducted for each catchment/piezometer pair of the assessment dataset described in section 2. For each pair, the available data are divided into two time periods P 1 and P 2 of equal length, defined so as to encompass the same number of data points for which both groundwater level and streamflow are available. Thereby, both periods contain the same amount of information and can be equally used for calibration and validation. The exact duration of periods depends on the pair, since 340 the data availability time periods are diverse: durations spread from 5.6 to 28.5 years by period. Before each period, a warm-up timespan of 5 years is set: the model is run on this period but the resulting simulated values are not used to compute criteria.
After determining these periods, the adapted model structure is calibrated on P 1 using C(α), for each value of α; the parameter set obtained is then used to run the model on P 2 and compute several validation criteria. Then, the periods are switched and the same procedure is executed. The following validation criteria are used:

345
-N SE( √ Q) to evaluate the model performance on the whole streamflow spectrum; -N SE( 3 √ Q) to evaluate the model performance on low-flows. It was preferred to zero-diverging transformations such as 1 Q or log(Q) to avoid numerical problems with very low streamflow values; -ZError to assess the model performance in groundwater level simulation.
Since the evaluation is performed for validation, the results presented in section 4 are, unless otherwise specified, validation 350 results.
To assess the benefit of using groundwater level data in the calibration process, the distributions of evaluation criteria values need to be compared to reference ones. For streamflow, the value α = 0 corresponds to the original calibration framework, only performed on observed streamflow data. Parameters X 7 and X 8 are only used to simulate normalised groundwater level and therefore, when α = 0, the sensitivity of the calibration criterion to their values is zero. Thus, they are randomly determined by 355 the stochastic optimisation algorithm and no relevant normalised groundwater level is simulated, except a random affine transformation of the exponential store level which cannot be compared to observed data. Therefore, another reference distribution than the one obtained for α = 0 is needed to evaluate groundwater level simulation. The value α = 1 is used, since it is the case in which the model is calibrated only with observed groundwater level data and no streamflow; we thus expect the best theoretically possible groundwater level simulation performance for this value of α.

360
The differences between evaluation criteria distributions are evaluated visually and then, in order to objectify them, a Wilcoxon-Mann-Whitney test (Wilcoxon, 1945;Mann and Whitney, 1947;Bauer, 1972) is conducted. The distributions obtained for the values of α are compared with the reference ones: α = 0 for NSE; α = 1 for ZError. Therefore, for each value of α, two tests are conducted: one to assess whether the streamflow simulation performance has significantly deteriorated and one to evaluate whether the performance of groundwater level simulation is significantly lower than the one obtained for α = 1.

365
To assess the influence of the geological context, the test dataset of 160 catchment/piezometer pairs was divided into six groups, detailed in Model performance for streamflow simulation in validation is not improved by the proposed calibration scheme. Figure 6 shows that the distribution of Nash-Sutcliffe efficiency computed on the square root of streamflow does not appear to change significantly for values of α under 0.34; for higher values, the performances deteriorate, but it is surprising to note that they 375 slowly decrease while increasing α and they remain acceptable until α = 0.84 -even though the loss of about 0.2 is significant. Beyond these values, the performances have considerably deteriorated: the calibration cannot be suitably performed on groundwater level time series only; this is an expected result, since the dynamics of groundwater level and streamflow signal  4.2 Is the model able to simulate groundwater levels?
The model appears to be able to simulate groundwater levels with a satisfactory performance. Figure 7 shows the distributions of the ZError criterion for the 51 values of α, compared to the theoretically maximum possible performance which is obtained with α = 1 -i.e. a calibration performed only on groundwater level with no streamflow information. The distribution of ZError for α under 0.1, with very little groundwater level information added to the calibration process, the performance is much lower, but even for α = 0.02, it is acceptable, with a median ZError around 0.5.

Recommended calibration framework and examples
Results of the statistical evaluation of differences between performance criteria distributions are presented as p-values in figure   390 8, with a significance threshold of 5%.
It appears that for values of α greater than 0.22, streamflow simulation performance has significantly deteriorated; for α lower than 0.12, groundwater level simulation performance is significantly below that obtained for higher values of α.
A narrow interval -α between 0.14 and 0.2 -corresponds to values for which the model performance for both outputs is comparable to reference distributions: the model is as good at streamflow simulation as the original model and it cannot be 395 better at groundwater level simulation. Therefore, it was decided to choose the value α = 0.16 as the recommended calibration framework, even though any value in the described interval could be chosen without significantly changing the results.     Another example catchment is presented in figures 11, 12 and 13: the Seudre River in Saint-André-de-Lidon. It is a small coastal river located in Saintonge, linked to two regional aquifers of the Aquitaine basin -see figure 2: the Cenomanian sands 405 and limestones and the Late Cretaceous multi-layer limestones, each one being monitored by one selected piezometer. Figure   11 shows the results on streamflow: adding piezometry to the calibration process did not significantly improve the performance  which piezometry in underestimated, is also unsatisfactory for streamflow simulation, since the model is unable to reproduce the whole variability of the hydrograph during this period.

Is the new parametrisation stable?
The parametrisation stability between periods is another measure of the robustness of the model: if the parameter values depend on the calibration period, it will cast doubt on the model capacity to extrapolate streamflow values outside this period and thus, 415 to be used, for instance, as a forecasting tool. The split-sample test allows us to assess this stability by comparing the parameter values between the two calibration periods P 1 and P 2 . Figure 14 shows the results of this comparison for the six parameters  of GR6J and the original calibration framework -obtained with α = 0; figure 15 does so for the modified calibration, with the two added parameters. For each parameter, the Pearson correlation between the values obtained for the two periods was computed.

420
The original calibration framework leads to a rather stable parametrisation, except for the exchange threshold X 5 with a non-significant correlation between the two periods. The modified calibration, using groundwater level data, yields more stable parameter values, with increased correlations between the two periods, except for the two parameters ruling the inter-catchment exchange function, X 2 and X 5 , for which the correlations have slightly deteriorated. The two added parameters, X 7 and X 8 , are also very stable between periods, with a correlation of 0.73. Since the modified calibration framework is a new constraint 425 on the routing function, it is not surprising to note that the three routing parameters -X 3 , X 4 and X 6 -become significantly more stable between periods, which is a sign of an improved model robustness.
The difficult transferability of the exchange function parameter values, i.e. the relevance of using them while they were calibrated on another period of time, was highlighted by de Lavenne et al. (2016). This function is sometimes regarded as the flux between catchments as they are defined by surface topography, which may not correspond to underground watersheds -for  (2008), in the general performance of the model. The stability issues exposed by the present study highlight the need for further development of this exchange function to take into account the henceforth explicit representation of groundwater level through the exponential reservoir. Other parts of the Paris basin, the North Sea coastal rivers and the Alsace plain have a mixed situation but generally do not reach the extreme points of the NSE distributions.
Catchments in which the performance gain between the two calibration frameworks is significant, i.e. beyond 0.05, are all 445 located either on the Picardy and Normandy chalk or in the Beauce plain. It is interesting to note this significant improvement is observed in catchments in which the initial model performances was low. However, these areas also host catchments for which the composite calibration framework produces a significant deterioration of streamflow simulation performance.  As for the groundwater level simulation, the performance does not follow the same spatial distribution as streamflow. High ZError values are observed in Brittany and in the western part of the Paris basin, along a crescent running from Artois to The sub-group analysis yields clearer results, visible on figure 17. For the absolute streamflow simulation performance, the original calibration framework yields high values of NSE for groups 2, 4 and 6, medium ones for group 1 and lower scores for groups 3 and 5. These patterns are found again for the composite calibration framework, even though the distribution 455 of performances for group 5 is narrower. Regarding the difference between the two calibration frameworks, a significant improvement for a small part of the dataset is observed in groups 3 and 5 too, with no deterioration of the median performance in the group, while in groups 1, 2, 4 and 6, the median performance is reduced. A significant decrease in performance is observed for a quarter of group 2 and more than a decile of group 4. As for groundwater level simulation, groups 3, 4 and 6 have narrow distributions centred around a high median score -around 0.7 -while other groups have much wider distributions 460 including simultaneously high, medium and low scores.
The analysis of results between groups of catchments with similar hydrogeological contexts allows to formulate general recommendations to model users, shown in section 5.2. However, the differences between sub-groups have not been successfully linked to hydrogeological characteristics of aquifers, such as permeability or transmissivity. In fact, these data are difficultly available over the French territory, except in experimental, extensively instrumented catchments. The results presented in the previous sections can be seen as disappointing with reference to the objective of the study. Bringing new information, observed groundwater level data, to the GR6J model yielded no improvement in streamflow simulation performance. Of course, a question arises that we do not wish to avoid in this paper: is this disappointing conclusion modelspecific, i.e. is it due to the conceptual nature of the GR6J model? Would a less conceptual and more descriptive model have 470 yielded more satisfactory results? Would a more heavily parameterised model have yielded more satisfactory results? Let us first answer this second question: equifinality is a plague in all modelling efforts, and we would not claim as a success an operation that would consist in improving marginally the situation of a model that was previously impossible to calibrate.
Thus, we reject the critique on model complexity as unworthy for a modeller. Concerning the physical realism of the model, no a priori conclusion can be drawn on more physically-based models performance without any empirical evaluation on a 475 large set of catchments. However, the fact that the exponential reservoir -introduced in GR6J structure to represent the slow aquifer transfers -represents either well or very well the dynamics of piezometers on a large catchment set cannot be the sheer consequence of luck. If the piezometric measurements are well represented, both on the calibration and the validation period, this means that our mathematical representation is adequate to describe the underlying physical processes, even without having been designed to do so.

480
Similar studies performed on other conceptual models did not result in different conclusions: Thiéry (1988) does not mention an improvement in streamflow simulation when calibrating GARDÉNIA with groundwater level data; the study by Moore and Bell (2002) on the PDM model is not conclusive since the new model structure is not compared to a reference one; finally, the calibration framework proposed by Széles et al. (2020) for HBV model gave a similar conclusion to the one of this study: using groundwater level data for calibration helps representing aquifer storage in the model, but did not result in improving 485 streamflow simulation.
Regarding the catchment dataset, we tried to constitute a set of catchments that is the widest possible with respect to our data sources and the selection criteria exposed in section 2.3. A selection bias in the present study is possible, mainly because aquifers regarded as important for surface water resources are the ones that have been monitored for the longest time, in the largest number of measurement points: the chalk aquifer in Picardy, the Alsace plain alluvium of the Beauce tertiary limestones.

490
However, the catchment dataset used in this study is diverse enough to draw general conclusions, at least in climatic and hydrogeological conditions similar to the ones observed in mainland France. Further evaluation on different contexts, in other countries, would help putting our results into perspective.

495
The study presented here concerns the implementation of a new calibration procedure for an existing streamflow simulation model, GR6J; it is not about the development of a completely new model. For each catchment, among all parameter sets that yield equivalent streamflow simulations, we identified a particular parameter set which is able to simulate, additionally, groundwater level. This new modelling capacity does not induce a significant deterioration in the streamflow simulation performance, neither does it improve it, except in a few particular cases. However, an advantage of the composite calibration framework 500 was highlighted: since we identified a particular parameter set among equivalent sets for streamflow, we probably reduced equifinality in the model calibration, which is suggested by the parameter stability improvement. We may thus expect a more robust model, even if a specific equifinality study would help enforce this conclusion.
The results presented in this paper can be seen as truly encouraging -realistic representation of the piezometric variability as one of the states of the model -but scientific honesty requires us to mention that to us they were -at least initially -505 truly disappointing, because we aimed at improving the overall representation of streamflow through inclusion of piezometric information and not the other way around.

Recommendations to users: which calibration should be used in which context?
The analyses performed in this study lead to the following recommendations for the GR6J model calibration: in most catchments, no improvement in streamflow simulation is expected using a composite calibration framework with 510 groundwater level data; in catchments in which the original model already performs well, adding groundwater level data to the calibration is probably useless to improve streamflow simulation performance; in catchments in which the model reaches lower validation scores, a performance improvement is possible but not probable and it is most likely to happen in a chalk or tertiary limestone context;

515
the model, with composite calibration, is able to simulate groundwater level with satisfactory performance for chalk, tertiary limestones and Cretaceous sand aquifers; groundwater level simulation is more uncertain for other geological contexts (quaternary formations, bedrock, Triassic sandstones or Jurassic limestones). Good results have been observed in the bedrock context of Brittany.

520
Beyond streamflow simulation, being able to simulate groundwater level using such a lumped conceptual model -much simpler and lighter to implement than usual groundwater models -is likely to lead to new uses of GR6J. Thereby, since GR6J is part of the operational low-flow forecasting platform Premhyce (Nicolle et al., 2020;Tilmant et al., 2020), it is conceivable to use it as a groundwater level sub-seasonal forecasting tool in some chosen points in France, which is crucial for an anticipative management of groundwater resources. Further studies are needed to evaluate the framework in forecasting mode; a data 525 assimilation process may be necessary to improve the forecast liability and smoothness. Although this study does not include any modification of the streamflow simulation scheme, it offers an overview of possible modifications: the division coefficient between the routing and the exponential stores remained fixed in the present study and may become a new model parameter to rule the size of the aquifer-river flux; the role of the exchange function needs to be clarified and its formulation has to become more stable and readable.

530
Code and data availability. Streamflow data are available on the Banque HYDRO website (SCHAPI, 2021), their use is limited to particular  (Coron et al., 2021). The national hydrogeological reference map is available on the BD LISA website https://bdlisa.eaufrance.fr (Brugeron et al., 2018).

535
Appendix A: Detailed operation of the GR6J model

A1 Production function
The production function is mainly composed of a production store, whose capacity X 1 is the first parameter of the model.
Inputs are P the daily rainfall depth and E the daily potential evaporation. Rainfall is neutralised by evaporation to compute net rainfall P n and net evaporation E n , through a case disjunction:

540
-If P > E, then P n = P − E and E n = 0; -Otherwise, P n = 0 and E n = E − P .
If P n is positive, a part of it, P s , feeds the production store, which has a level S and a parameter X 1 : Otherwise, a part E s of E n is taken from the production store: The content of the production store is then updated by S = S − E s + P s . Part of the water content of the production store P erc percolates to the routing function: The content of the production store is updated again by S = S − P erc. The quantity of water P r that reaches the routing 550 part of the model is finally P r = P erc + P n − P s .

A2 Unit hydrographs
P r is divided into two components: 90% are routed through the one-sided unit hydrograph U H 1 and the remaining 10%, through a two-sided unit hydrograph U H 2 . The cumulated ordinates of the unit hydrographs SH 1 (t) and SH 2 (t) are determined by the basetime X 4 , for t ∈ N: Ordinates U H 1 (t) and U H 2 (t) are then computed differentiating the cumulated ordinates: Finally, the respective outputs of the first unit hydrograph Q 9 and the second one Q 1 are computed through a convolution of 560 P r :

A3 Routing stores
This part of the model structure is composed of two branches, that of the stores -fed by Q 9 from the first unit hydrograph -565 and the direct branch -fed by Q 1 from the second unit hydrograph. In the stores' branch, Q 9 is partitioned between the two stores, with 60% for the routing store and 40% for the exponential store. A potential exchange Exch is computed from the water content of the routing stores Rout, its capacity X 3 and the exchange parameters X 2 and X 5 : This flux can be negative, zero or positive. Since the routing store cannot have a water content Rout under zero, the actual 570 exchange flux from the routing store AExch 1 is limited by the content of the latter, which gives the following equation: The routing reservoir is then filled with: And the output Q R of the routing reservoir is computed as: The water content of the reservoir is finally updated as Rout = Rout − Q R .
As for the exponential store, it is a bottomless reservoir whose water content Exp can be negative. Therefore, no case disjunction is necessary and the store can be filled with: Its output is computed, using its capacity X 6 , as: Q Rexp = X 6 log 1 + exp Exp X 6 (A14) The exponential store can now be updated using Exp = Exp − Q Rexp .
The second branch, fed by Q 1 , is also subject to exchange AExch 2 with a case disjunction: The output of the second branch Q d can now be computed using Q d = Q 1 − AExch2. The simulated streamflow Q sim is finally computed by adding the components from the three branches: Appendix B: Parameter ranges and transformations used for original and modified GR6J calibration Table B1. Parameter ranges and transformation functions for GR6J model calibration. To make calibration easier, the parameter original search ranges, exposed below, are transformed to [−9.99, 9.99  Appendix D: Equivalence between ZError and Nash-Sutcliffe efficiency The model structure proposed here does not simulate absolute groundwater level, but only its normalised version. Then, by reversing the normalisation equation 1, the equation 3 allows to get the absolute groundwater level. By combining equations 1, 3 and 4, the following expression of ZError is found: Which gives: By definition of standard deviation, we have: By combining the two previous equations, we get: Which is exactly the definition of Nash-Sutcliffe efficiency or NSE (Nash and Sutcliffe, 1970), expressed for groundwater level instead of streamflow. This shows the correspondence between ZError and NSE.
Author contributions. Both authors conceptualised the method. AP performed the tests on the dataset and developed the computing code.
Both authors wrote the paper.
our warmest thanks to Charles PERRIN (INRAE Antony), Paul ASTAGNEAU (INRAE Antony) and François BOURGIN (INRAE Antony) for their thorough proofreading which considerably improved the manuscript. Last but not least, computing codes could not have been developed without the precious expertise and availability of Olivier DELAIGUE (INRAE Antony).