Data analysis and model building for understanding catchment processes: the case study of the Thur catchment

The development of semidistributed hydrological models that reflect the dominant processes controlling streamflow spatial variability is a challenging task. This study illustrates this process through the case of the Thur catchment (Switzerland, 1702 km), an alpine and pre–alpine catchment with large spatial variability in terms of climate, landscape, and streamflow (measured at 10 subcatchments). In order to appraise the dominant processes that control catchment response, and build a 15 model that reflects them, the model development follows a two–stages approach. In a first stage, we use correlation analysis to identify the main influencing factors on the spatial variability of streamflow signatures. Results of this analysis show that precipitation averages control signatures of water balance, snow processes control signatures of seasonality, while landscape characteristics (especially geology) control signatures characterizing the importance of baseflow. In a second stage, the results of the previous analysis are used to develop a set of model experiments aimed at determining an appropriate model 20 representation of the Thur catchment. These experiments confirm that only a hydrological model that accounts for the heterogeneity of precipitation, snow related processes, and landscape features such as geology, produces hydrographs that have signatures similar to the observed ones. This model provides consistent results in space–time validation, which is promising for predictions in ungauged basins. The presented methodology for model building can be transferred to other case studies, since the data used in this work (meteorological variables, streamflow, morphology and geology maps) is 25 available in numerous regions around the globe.


Introduction
Due to the spatial variability of landscape (e.g. topography, land use, etc.) and climate, hydrographs can differ substantially between catchments. Being able to quantify and explain hydrograph spatial variability is important both to improve processes understanding and to make predictions useful for many human activities, such as flood protection, drinking water 30 production, agriculture, energy production, and riverine ecosystems management (e.g., Hurford and Harou, 2014).
Understanding catchment differences and, more specifically, how to transfer hydrological knowledge, methods, and theories between places, is a common objective of many research areas in hydrology, including comparative hydrology (e.g., Falkenmark and Chapman, 1989), model regionalization (e.g., Parajka et al., 2005), catchment classification (e.g., Wagener et al., 2007), and prediction in ungauged basins (e.g., Hrachowitz et al., 2013). In the case of streamflow, the attempt to explain its spatial variability is typically accomplished either using statistical approaches, which are designed to regionalize 5 selected characteristics of the hydrograph (streamflow signatures), or through hydrological models that account for relevant spatial information. In particular, statistical approaches such as regression analysis (e.g., Berger and Entekhabi, 2001;Bloomfield et al., 2009) and correlation analysis (e.g., Trancoso et al., 2017), or machine learning techniques like clustering ( e.g., Sawicz et al., 2011;Toth, 2013;Kuentz et al., 2017) are used to group together catchments that present similar characteristics and to extrapolate the signatures where unknown. Such approaches have been useful to quantify the 10 hydrological variability and identify its principal drivers. However, they are often not designed to discover causality links and can be affected by multicollinearity, that arises when multiple factors are correlated internally and with the target variable (Kroll and Song, 2013).
By incorporating spatial information about meteorological forcing and landscape characteristics, semidistributed hydrological models have the ability to mimic the mechanisms that influence hydrograph spatial variability. However, 15 identifying the relevant mechanisms is challenging. One possibility is to be as inclusive as possible in accounting for all the catchment properties that are, in principle, important in controlling catchment response. However, this approach leads to models that tend to be data demanding and contain many parameters. For example, Gurtz et al. (1999) considered several landscape characteristics (elevation, land use, etc.) in their application of a semidistributed model to the Thur catchment (Switzerland), which resulted into a model with hundreds of hydrological response units (HRUs) that were defined a-priori 20 based on the complexity of the catchment. The other option is to try to identify the most relevant processes and neglect others, in order to control model complexity. For example, Fenicia et al. (2016) compared various model hypotheses to determine an appropriate discretization of the catchment in HRUs and appropriate structures for different HRUs. Antonetti et al. (2016) used a map of dominant runoff processes following Scherrer and Naef (2003) for defining HRUs. However, these approaches require a good experimental understanding of the area, which is not always available. 25 Convincing model calibration-validation strategies are essential to provide confidence that the model ability to fit observations is a reflection of model realism and not a consequence of calibrating an overparameterized model (e.g., Andréassian et al., 2009). A common approach for calibration of semidistributed models is the so called 'sequential' approach, where subcatchments are calibrated sequentially from upstream to downstream (e.g., Verbunt et al., 2006;Feyen et al., 2008;Lerat et al., 2012;De Lavenne et al., 2016). Although this approach may provide good fits and therefore it has 30 its practical utility where data is available, it does not provide understanding into the causes of streamflow spatial variability and results into models that are not spatially transferable. Moreover, such models are prone to contain many parameters, as each subcatchment would be represented by its own set of parameters. Alternative calibration-validation approaches that enable model validation not only in time but also in space are conceptually preferable, particularly when the modeling is used for process understanding or prediction in ungauged locations (e.g., Wagener et al., 2004;Fenicia et al., 2016).
This study combines the strengths of catchment regionalization approaches and semidistributed hydrological models by first using regression analysis to understand the main causes of variability of streamflow signatures, and then using this analysis to inform the structure of a distributed hydrological model. The model objective is to explain the observed spatial diversity 5 of streamflow characteristics with the minimum possible complexity, while maintaining a process based interpretation. In particular, the objectives of the study are to: (1) explore the spatial variability present in the Swiss Thur catchment regarding landscape characteristics, meteorological forcing and streamflow signatures; (2) identify the main drivers that explain the variability of the hydrological response; (3) based on this analysis, build a set of model experiments aimed to test the relative importance of dominant processes and their effect on the hydrograph; (4) appraise model assumptions against competing 10 alternatives using a stringent validation strategy.
The paper is organized as follows: Section 2 presents the study area and gives information about data availability; Section 3 and Sect. 4 are both divided in methods and results and present, respectively, the correlation analysis and the modeling part of this paper; Section 5 puts the results of this work in perspective, comparing them with other studies; Section 6, finally, summarizes the main conclusions. 15

Study area
This study is carried out in the Thur catchment ( Fig. 1), located in north-east of Switzerland, south-west of the Lake Constance. With a total length of 127 km and a catchment area of 1702 km 2 , the Thur is the longest Swiss river without any natural or artificial reservoir along its course. The Thur river is very dynamic, with streamflow values that can change by two orders of magnitude within a few hours (Schirmer et al., 2014). 20 The Thur catchment has been subject of several studies in the past; Gurtz et al. (1999) performed the first modelling study on the entire catchment using a semi-distributed hydrological model; Abbaspour et al. (2007) modelled hydrology and water quality using the SWAT model; Fundel et al. (2013) and Jorg-Hess et al. (2015) focused on low flows and droughts ;Jasper et al. (2004) investigated the impact of climate change on the natural water budget. Other modelling studies include also Melsen et al. (2014) and Melsen et al. (2016), which investigated parameters estimation in data limited scenarios and their 25 transferability across spatial and temporal scales, and Brunner et al. (2019) who studied the spatial dependence of floods.
The Thur includes also a small-size experimental subcatchment (Rietholzbach, called Mosnang in this paper after the name of the gauging station) that was subject of many field studies at the interface between process understanding and hydrological modelling (e.g., Menzel, 1996;Gurtz et al., 2003;Seneviratne et al., 2012;von Freyberg et al., 2014;von Freyberg et al., 2015). 30 The topography of the catchment is presented in Fig. 1b; the elevation ranges between 356 m a.s.l. at the outlet and 2502 m a.s.l. at Mount Säntis. The majority of the catchment lies below 1000 m a.s.l (75 %) and only 0.6 % is above 2000 m a.s.l. (Gurtz et al., 1999). Based on topography (Fig. 1b), the catchment can be visually subdivided into two distinct regions: the northern part, with low elevation and dominated by hills and flat land, and the southern part, which presents a mountainous landscape.
The land use (Fig. 1c) is dominated by pasture and sparse vegetated soil (60 %) and forest (25 %); urbanized and cultivated areas are located mainly in the north and cover 7 % and 4 % of the catchment respectively. 5 Most of the catchment is underlain by conglomerates, marl incrustations and sandstone (Gurtz et al., 1999). For the purpose of this study, the geological formations have been divided into three classes (Fig. 1d): "consolidated", covering mainly the mountainous part of the catchment, "unconsolidated", located in the north, and "alluvial", located in the proximity of the river network, mainly in the plateau area; the latter formation constitutes the main source of groundwater in the region (Schirmer et al., 2014). The soil depth (Fig. 1e) is shallower in the mountainous part of the catchment and deeper in the 10 northern part.
Based on the availability of gauging stations (Table 1), the catchment was divided in 10 subcatchments (Fig. 1a), with a total drained area that ranges between 3.2 km 2 (Mosnang) and 1702 km 2 (Andelfingen). Streamflow time series are obtained from the Federal Office for the Environment FOEN and the data is available from 1974 to 2017 but is used only form 1981 to 2005 to match the precipitation, temperature, and potential evapotranspiration (PET) time series. In the considered range, the 15 streamflow data are relatively continuous, with two gaps, one in St. Gallen, from 31 December 1981 to 01 January 1983, and the other one in Herisau, from 31 December 1982 to 09 May 1983.
The raw maps (topography, land use, geology, and soil) are obtained from the Federal Office of Topography swisstopo. The meteorological data is obtained from the Federal Office of Meteorology and Climatology MeteoSwiss. Precipitation and temperature are interpolated, as done in Melsen et al. (2016), with the pre-processing tool WINMET (Viviroli et al., 2009) 20 using inverse distance weight (IDW) and detrended IDW respectively; while the first method considers only the horizontal variability (related to the distance from the meteorological stations), the second adds a vertical component to the variability related with the elevation (Garen and Marks, 2001). PET data is then obtained, as done in Gurtz et al. (1999), starting from meteorological and land use data, using the Penman-Monteith equation (Monteith, 1975), implemented as part of the hydrological model PREVAH (Viviroli et al., 2009). All these values are calculated at pixel (100 m) scale and then averaged 25 over the subcatchments. All the time series are used at daily resolution in the subsequent analyses, aggregating the available hourly data. This choice was influenced on one hand by the need of limiting the computational demand for the model experiments, for which a coarser temporal resolution is preferable, and on the other hand by the need of representing relevant hydrograph dynamics, for which finer temporal resolution is desirable (e.g., Kavetski et al., 2011). A daily data resolution, although it may obscure subdaily process dynamics, appeared to be a good compromise, and it is a typical choice in 30 distributed model applications at such spatial scales (e.g., Kirchner et al., 2004).
3 Identification of influencing factors on the spatial variability of streamflow signatures

Methodology
The purpose of the analysis presented in this section is to understand the influence of climatic conditions and landscape characteristics on streamflow. Climatic conditions are represented by precipitation, potential evaporation, and temperature time series. Landscape characteristics are represented by maps of topography, land use, geology and soil. 5 Climatic conditions, landscape characteristics and streamflow are represented through a set of statistics. In the following, statistics calculated based on streamflow data will be called streamflow "signatures", as it is often done in catchment classification literature (e.g., Sivapalan, 2006). We will refer to climatic and landscape indices for statistics calculated on climatic data and landscape characteristics. A broad list of signatures and indices is presented in Sect. 3.1.1; Section 3.1.2 presents an approach for reducing such list to a set of meaningful variables; Section 3.1.3 illustrates the approach for 10 determining the indices that mostly control streamflow signatures.

Catchment indices for representing streamflow, climate, and landscape
Streamflow signatures ( ) and climatic indices ( ) were obtained using streamflow, precipitation, PET, and temperature time series. The values were calculated using 24 years of data, between 01 September 1981 and 31 August 2005; we considered the 01 September as the beginning of the hydrological year. The periods with gaps in the data (refer to Sect. 2 for details) 15 were discarded from the analysis of the specific subcatchment. Landscape indices were obtained using the maps described in Section 2. Addor et al. (2017) recently compiled a comprehensive list of streamflow signatures and climatic indices for characterizing catchment behaviour (see Table 3 in Addor et al. (2017)). Here, we adopted their selection. The streamflow signatures here considered are described hereafter, followed by an explanation of their rationale: 20 • average daily streamflow Q = �, where is the streamflow time series and the overbar represents the average over the observation period; • runoff ratio RR = � � , where is the precipitation time series; • streamflow elasticity ( EL ) defined as where Δ� and Δ� represent the streamflow and precipitation jumps between two consecutive years and med is the median function; • slope of the flow duration curve ( FDC ) defined as the slope between the log-transformed 33 rd and 66 th streamflow percentiles; • baseflow index BFI = ( ) ������ � , where ( ) represents the baseflow and was calculated using a low-pass filter as 30 illustrated in Ladson et al. (2013) with the equation with t (f) representing the quick flow. The settings of the filter were taken according to the findings of Ladson et al. (2013) and, in particular, three filter passes were applied (forward, backward, and forward), the parameter b was chosen to be equal to 0.925, and a reflection of 30 time steps at the beginning and at the end of the time series was 5 used; • mean half streamflow date ( HFD ) (Court, 1962), defined as the number of days needed in order to have a cumulated streamflow that reaches the 50 % of the total annual streamflow; • 5 th and 95 th percentiles of the streamflow ( Q5 and Q95 respectively); • frequency ( HQF ) and mean duration ( HQD ) of high-flow events: they are defined as the days when the streamflow 10 is bigger than nine times the median daily streamflow; • frequency ( LQF ) and mean duration ( LQD ) of low-flow events: they are defined as the days when the streamflow is smaller than 0.2 times the mean daily streamflow; The frequency of days with zero streamflow, present in Addor et al. (2017), was not considered in this study because there are no ephemeral subcatchments in the study area. 15 This group of streamflow signatures is capable of capturing various characteristics of the hydrograph: Q measures the overall water flows, RR represents the proportion of precipitation that becomes streamflow, EL measures the sensitivity of the streamflow to precipitation variations, with a value greater than 1 indicating an elastic subcatchment (i.e. sensitive to change of precipitation) (Sawicz et al., 2011), FDC measures the variability of the hydrograph with a steeper flow duration curve indicating a more variable streamflow, BFI measures the magnitude of the baseflow component of the hydrograph, and 20 can be considered as a proxy for the relative amount of groundwater flow in the hydrograph, HFD measures the streamflow seasonality, Q5 , LQF , and LQD measure low-flow dynamics, Q95 , HQF , and HQD measure high-flow dynamics.
Climatology was represented through the following indices (see Addor et al. (2017), Table 2): • average precipitation P = �; • average PET PET = �����, where is the potential evapotranspiration time series; 25 • aridity index AI = ������ � ; • fraction of snow ( FS ), defined as the volumetric fraction of precipitation falling as snow (i.e. on days colder than 0 °C); • frequency ( HPF ) and mean duration ( HPD ) of high precipitation events: they are defined as days when the precipitation is bigger than five times the mean daily precipitation; 30 • season ( HPS ) with most high precipitation events (defined as above); • frequency ( LPF ) and mean duration ( LPD ) of dry days: they defined as days when the precipitation is lower than 1 mm day -1; • season ( LPS ) with most dry days (defined as above).
The seasonality of precipitation used in Addor et al. (2017) was not considered in this study as it relied on fitting a sinusoidal function to the precipitation values, which in our case did not produce reliable results. Nevertheless, these climatological 5 indices are able to comprehensively represent the climatic conditions of the suubcatchment, with P representing average water input, PET representing average evaporative demand, AI measuring the dryness of the climate, FS measuring the relative importance of snow, HPF , HPD , and HPS measuring the importance of intense precipitation events, and LPF , LPD , and LPS measuring the importance of dry days. The landscape characteristics were divided in four categories: topography, land use, soil, and geology. In order to quantify 10 the characteristics of each category, a set of indices ( ) was defined. It is important to notice that all the areas calculated in this analysis were normalized by the respective subcatchment area ( A ) in order to get comparable values between subcatchments of different size.
Topography was represented with the following indices, calculated based on the digital elevation model (DEM): • average elevation ( TE ); 15 • average slope ( TSm ); • fraction of the subcatchment with steep areas ( TSs ) , with slope larger than 10°; • aspect, i.e. fraction of the subcatchment facing north ( TAn ), south ( TAs ), or east and west ( TAew ).
Land use was represented with the following characteristics, obtained by reclassifying the land use map in four categories (from 22 original classes): 20 • fraction of the subcatchment with crops land use ( LC ); • fraction of the subcatchment with pasture land use ( LP ); • fraction of the subcatchment with forest land use ( LF ); • fraction of the subcatchment with urbanized land use ( LU ).
Soil type was represented with the following indices, derived by the soil map: 25 • fraction of the subcatchment with deep soil (soil depth greater than two meters) ( SD ); • average soil depth ( SM ).
Geology was represented by the following indices, obtained by reclassifying the original map in three categories (from 22 original classes): • fraction of the subcatchment with alluvial geology ( GA ); 30 • fraction of the subcatchment with consolidated geology ( GC ); • fraction of the subcatchment with unconsolidated geology ( GU ).
The reclassification of the land use and of the geology maps consisted in aggregating specific classes into general classes (e.g. combining different types of forests into a unique forest class) with the objective of reducing their number, in order to facilitate subsequent analyses.
The consideration of topography, land use, soil, and geology for defining landscape indices was based on their potential influence on hydrological processes, and in turn, on the shape of the hydrograph. These landscape characteristics can all play 5 an important role in controlling hydrological processes: land use can, for example, influence the infiltration of water in the substrate; soil thickness can affect the partitioning between water storage and runoff; vegetation is typically assumed to affect evaporation, and geology can affect groundwater dynamics. Indeed, these characteristics are used by many semidistributed hydrological models, for example for determining parameter values or for dividing the catchment in areas with homogenous hydrological response (e.g., Gurtz et al., 1999). 10

Selection of meaningful streamflow signatures, climatic indices, and catchment indices
The sets of statistics presented in Sect. 3.1.1 were designed to be comprehensive. However, they may also be redundant, for example by containing metrics that express similar characteristics of the underlying data. In order to facilitate subsequent correlation analyses between the various sets of statistics, it is important to reduce each set to a short list of meaningful variables. The reduction of each set of streamflow signatures, climatic indices, and landscape indices was achieved through 15 the following steps: • All the statistics that did not show sufficient variability between the subcatchments were eliminated. We were in fact interested in identifying causes of spatial variability in the streafmow dynamics of the subcatchments of the Thur. Therefore, statistics that had a low variability were not of interest in this analysis. The variability was measured using the coefficient of variation (defined by the ratio between the standard deviation and the average) 20 and statistics with a coefficient of variation less than 5 % were discarded.
• All the catchment indices (e.g. a certain type of land use) that account for a limited part of the subcatchment were discarded. The latter point was motivated by the expectation that landscape characteristics covering a very small fraction of the subcatchment should not have a strong influence on the streamflow signatures considered. Here, landscape indices accounting for less than 5 % of the subcatchment area were discarded. 25 • Within each set of streamflow signatures, climatic indices, and catchment indices we retained only relatively independent metrics. This step was motivated by the need of removing redundant information within each set. The selection of independent metrics was aided by the Spearman's rank score between each pair of metrics, which represents (also non-linear) correlation between variables. Pairs of metrics with high Spearman's rank score are potentially redundant. In eliminating potentially redundant variables, we adopted the following criteria: 30 o Among highly correlated metrics, we preferred those depending on single variables (e.g. only precipitation or only streamflow) to those containing multiple variables (e.g. combining precipitation and streamflow or evaporation, such as the aridity index or the runoff ratio), as this may be a problem when looking for correlations between metrics; o With respect to landscape indices, in many cases the high correlation is due to the fact that they are complementary (the areal fractions sum up to unity). In such cases, we kept one index per class (e.g. a single index for geology). 5 o A high correlation between metrics does not always mean that the metrics represent the same information.
Therefore, the final selection of relevant metrics within each set was guided by expert judgment.
Based on this process, we compiled a reduced list of signatures, climatic indices, and landscape indices, which was used in subsequent analyses.

Identification of climate and landscape controls on streamflow and consequences for model development 10
This analysis aimed to identify climatic and landscape indices that mostly control streamflow signatures. In order to identify causality links between indices ( and ) and signatures ( ) we proceed as follows: • We calculated the correlation between indices and signatures using the Spearman's rank score, and identified pairs of variables with high correlation; • We scrutinized pairs of variables with high correlations using expert judgment to decide if a causality link between 15 variables is justified; • We used the identified causality links to inform the structure of a distributed model.
The distributed model development involved a series of choices regarding the subdivision of the catchment in HRUs, the model structure, and the parameters: all these choices were, in this study, motivated by the results of the correlation analysis, i.e. only catchment characteristics that were found capable of explaining the hydrological response were used. 20

Results and interpretation
This section illustrates the results of the correlation analysis aimed to identify influencing factors that control the spatial variability of streamflow signatures; Section 3.2.1 presents the results of the selection of meaningful statistics; Section 3.2.2 identifies climate and landscape indices controlling streamflow signatures and presents consequences for model development. 25

Selection of meaningful streamflow signatures, climatic indices, and catchment indices
The streamflow signatures defined in Sect. 3.1.1 were calculated for each subcatchment and the values are shown in Table 2 together with the coefficient of variation. All the signatures have a coefficient of variability bigger than the threshold value of 5%, with the most variable signature being LQF (71%) and the least variable HQD (6%). Therefore, none of these signatures was discarded. 30 Figure 2 shows the correlations between the streamflow signatures: the lower triangle contains the Spearman's rank correlation and the upper triangle the p-value associated with the correlations. Based on correlations and on its interpretation, a subset of can be defined as follows: • Q , RR and EL are strongly correlated ( > 0.72). We retained Q and discarded RR and EL because both contain climatic information (precipitation) in their definition; 5 • BFI and FDC are strongly correlated ( = −0.77). We decided to retain BFI as it is of easier interpretation (it is a proxy for the importance of groundwater flow, which is a potentially important process for the subsequent model development); • HFD was kept because it measures the seasonality of the streamflow. Note that HFD is strongly correlated with Q ( = 0.88). However, they reflect different properties of the hydrograph. In particular, HFD can be an useful 10 indicator for the effect of snow-related processes; • Q5 and HQD were retained because they have low correlation ( < 0.71) with the other selected signatures and because the first represents low flows and the second high flows; • Q95 , HQF , LQD , and LQF were discarded because they all show correlations with the selected signatures. In summary, the original set of streamflow signatures was reduced to a set of five meaningful signatures, which will be in the 15 subsequent analyses: average daily streamflow ( Q ), baseflow index ( BFI ), half streamflow period ( HFD ), 5 th percentiles of the streamflow ( Q5 ), and duration of high-flow events ( HQD ).
In terms of climatic indices, Table 3 shows their values together with the coefficient of variation. It can be seen that there are some indices that show very little or no variation at all and, therefore, they could be already excluded from the subsequent correlation analysis; they are: HPD (1 %), HPS (0 %), LPF (4 %), LPD (3 %), and LPS (0 %). 20 Fig. 3 shows the correlation between the remaining indices. It can be observed they all have strong internal correlation ( > 0.71). For this reason it was decided to retain only P and FS , as they have lower correlation. The first represents an important term of the water budget, the latter captures snow dynamics. Table 4 shows the values of the catchment characteristics considered in this study. All of them have a coefficient of variation larger than the minimum threshold of 5%. Therefore, none of them was excluded based on this criterion. The second 25 criterion for the pre-exclusion of the catchments characteristics, consisting in removing occupying less than 5% of the subcatchments, led to the suppression of LC (which occupies 4% of the subcatchment). Figure 4 shows the correlations between catchment characteristics; in many cases the high correlation is due to the fact that many indices are complementary (e.g. different types of geology). The following were selected (one index per class): • A because it is low correlated to the other features; 30 • TE and TAs in representation of the topography; • LF for the land use; • SD representing the soil characteristics; • GC for the geology. In summary, the original set of catchment indices was reduced to a set of 5 indices. The following results can be noted:

Selection of controlling factors on streamflow signatures
• The three statistics average precipitation ( P ), fraction of snow ( ), and average elevation ( TE ) correlate strongly with average streamflow ( Q ) and seasonality ( HFD ) ( > 0.64 and p-value< 0.05). This correlation can be interpreted as follows: subcatchments with high elevation ( TE ) tend to have higher precipitation ( P ) due to 10 orographic effects, which leads to higher streamflow ( Q ). They also tend to have more snow ( FS ) due to lower temperatures, which influences the seasonality ( HFD ).
• There are then some catchment characteristics that have no correlation ( < 0.45) with the streamflow signatures (catchment area ( A ) and land use ( LF )) or limited correlation (aspect ( TAs ) and deep soil ( SD ), with < 0.64).
• The consolidated geology ( GC ) presents a strong correlation ( = −0.87) only with the baseflow index ( BFI ) that it 15 is not captured by the other indices.
• The streamflow signatures of low and high flows ( Q5 and HQD ) cannot be explained by any index, with little correlation only with P and TE ( < 0.60) that is not sufficient to reach a p-value lower than 0.05.
These results are the premise for designing meaningful model experiments.

Hypotheses for model building 20
Our hypothesis is that only a model that accounts for the influencing factors that affect the streamflow signatures will be able to reproduce spatial streamflow variability. In this section, we synthetize the outcomes of previous analyses in the form of testable hypotheses for model building.
1. The precipitation is the first driver of the differences in the water balance of the subcatchments. The effect of topographic variability manifests itself primarily as an influence on precipitation (amount and type). Accounting for 25 variability of precipitation therefore implicitly reflects such effect of topography on the hydrograph, since the inputs were interpolated taking into account the effect of the elevation (Sect. 2).
2. Snow related processes (e.g. amount of snow, timing of snowmelt) control differences in streamflow seasonality between subcatchments.
3. Geology exerts an important control on the partitioning between quick flow and baseflow. 30 4. The other catchment characteristics (e.g. soil, vegetation) show little or no correlations with the streamflow signatures and therefore they should not be considered if the idea is to keep the model as simple as possible.
These hypotheses will be tested through specific model comparisons, described in Sect. 4.1.5. 2. HRUs are defined based on catchment characteristics (e.g. topography, geology or vegetation); they represented parts of the catchment that are supposed to have a similar hydrological response to the meteorological forcing. Each HRU is characterized by its own parameterization. Different definitions of HRUs were tested, as described in 20 Section 4.1.5.
Each HRU has a unique parameterization. However, given the choice of discretizing the inputs per subcatchment, a HRU that spans multiple subcatchments will generally have different states in each subcatchment. Therefore, the same HRU needs its own model representation in each subcatchment where it is present. For more details about our model implementation of "HRUs" refer to Fig. 4 of Fenicia et al. (2016). 25 The model was built using the modelling framework SUPERFLEX . In contrast to Fenicia et al. (2016), for simplicity we chose a unique structure to represent the various HRUs (as said above, this structure will generally have different parameters in order to represent the hydrological behaviour of distinct HRUs). The structure used to represent the HRUs is represented in Fig. 6 with the equations listed in the Appendix A. The structure includes a snow reservoir (WR), with inputs distributed per subcatchments. Snowmelt and rainfall are input to an unsaturated reservoir (UR), which 30 determines the portion of precipitation that produces runoff. This flux is split through a fast reservoir (FR), designed to represent the peaks of the hydrograph, proceeded by a lag function to offset the hydrograph, and a slow reservoir (SR), designed to represent baseflow. This structure was chosen to be parsimonious while general enough to reproduce typical hydrograph behaviour; it was tested in previous applications ( e.g., van Esse et al., 2013;Fenicia et al., 2014;Fenicia et al., 2016) demonstrating its suitability to reproduce a wide range of catchment responses. It also resembles popular conceptual hydrological models such as HBV (Lindstrom et al., 1997) and HyMod (Boyle, 2003), which are shown to have wide 5 applicability.

Error model
where [ ; ] represents the Box-Cox transformation (Box and Cox, 1964) with parameter , which is used to account for heteroscedasticity (stabilize the variance). For ≠ 0: 15 The residual error term is assumed to follow a Gaussian distribution with zero mean and variance 2 t~( 0; 2 ) The error model has, therefore, two parameters ( and 2 ); the first was fixed to 0 where T represents the length of the time series, is the Gaussian probability density function (PDF) and ′( | ) is the derivative of ( , ) with respect to evaluated at the observed data . Specifying Eq. (7) for the case where 25 ( ; ) is defined by Eq. (5), the expression of the likelihood function becomes: Equation (8) represents the likelihood function that is then used, together with an uniform prior distribution, to calibrate the parameters of the model as described in Sect. 4.1.3.

Calibration
Parameter calibration was performed with the objective of maximizing their posterior density. According to Bayes equation, the posterior distribution of model parameters is expressed as the product between the prior distribution and the likelihood function; since an uniform prior was used for the parameters, this is equivalent to maximizing the likelihood function in the defined parameter space; the optimization procedure was performed with a multi-start quasi-Newton method (Kavetski et 5 al., 2007) with 20 independent searchers. We empirically established that with models of our complexity (about 10 parameters), 20 independent searches provide good confidence that a global optimum is found.
The evaluation of the model ability to reproduce streamflow was carried out in space-time validation (see also Fenicia et al., 2016).

Performance assessment 20
Model performance was assessed using the following metrics: 1. Time series metrics, which evaluate the ability of reproducing streamflow time series. The metrics used for this assessment are the following: • Normalized log-likelihood (LL), that is, the logarithm of Eq. (8) normalized by the number of time steps present in the time series. This metrics corresponds to the objective function used for model optimization. It 25 can be observed that, since λ is fixed at 0.5 in the Box-Cox transformation, model calibration is equivalent to maximising the Nash-Sutcliffe efficiency (NS) calculated with the square root of the streamflow. LL is not bounded but a higher value means a better match between two time series since, in this case, the absolute value of the residual is smaller and, thus, their PDF higher.
• Nash-Sutcliffe efficiency 30 Which is often used in hydrological applications, and it provides a sense of general quality of the simulations.
NS is bounded between −∞ and 1, with 1 meaning a perfect match.
2. Signature metrics, which determine the ability of reproducing the selected streamflow signatures ( ) which, as illustrated in Section 3.2.1, are average daily streamflow ( Q ), baseflow index ( BFI ) mean half streamflow date ( HFD ), 5 th percentile of the streamflow ( Q5 ), and duration of high-flow events ( HQD ). The accordance between 5 simulated and observed signatures was assessed both visually and using the Spearman's rank correlation.
The use of multiple metrics for assessing model performance enables a comprehensive assessment of various characteristics of the simulations. Time series metrics were designed to appraise the general quality of the model fit. Signatures, instead, were designed to highlight selected characteristics of the data at the expense of others.

Model experiments for testing the results of the correlation analysis 10
Using the model structure described in Sect. 4.1.1, four model configurations were compared by varying the number and the definition of the HRUs, and changing the structure of the HRUs (Fig 6). The objective of the experiments was to test the hypotheses 1-4 in Sect. 3.3.
The first hypothesis (precipitation controls the water balance) is tested with the model M0, with uniform parameters on the catchment (i.e. a single HRU) and distributed precipitation input. This model does not consider snow processes. We expect 15 that this model will be able reproduce differences in streamflow averages between subcatchments.
The second hypothesis (snow controls seasonality) is tested with the model M1. Relatively to M0, M1 accounts for snow processes, represented by simple degree day snow module (see , with inputs (temperature) distributed per subcatchment.
The third hypothesis (geology controls baseflow) is tested with the model M2. Relatively to M1, M2 considers two HRUs, 20 defined based on geology type. One HRU contains the areas with consolidated geology while the other HRU contains the rest of the catchment (unconsolidated and alluvial geology together).
The fourth hypothesis (other catchment characteristics should not be considered if the idea is to keep the model as simple as possible), is exemplified by the model M3. M3 is analogous to M2 except that HRUs are based on catchment characteristics that did not show correlation with the streamflow signatures. Among those characteristics, we have selected land use, and 25 considered an HRU based forest and crops and the second one that occupies the rest of the catchment.
The total number of the calibrated parameters depends on the number of HRUs and on the structure used to represent them: it was 8 for M0, 9 in M1, and 13 in M2 and M3, where 5 parameters were linked between different HRUs (Table A1); those parameters are: e that governs the evapotranspiration, rise OL and rise IL that control the routing in the river network, WR that regulates the outflow of the snow reservoir, and max UR that determines the behaviour of the unsaturated reservoir. 30

Results and interpretation
This section presents the results of the modelling experiments. Section 4.2.1 illustrates model results in terms of hydrograph metrics. Section 4.2.2 presents model results in terms of signatures. An interpretation of the results, including a comparison with the conclusions of the correlation analysis, is given in Sect. 4.2.3. Figure 7a shows the values of the likelihood function (corresponding to the calibration objective function) for the four models in calibration and validation. It can be observed that M0 is, by far, the worst model, having a low value of likelihood.

Model performance in terms of hydrograph metrics 5
Moving to the other three models, it can be seen that, during calibration, M1, which has the lowest number of calibration parameters, has the lowest performance, whereas M2 and M3 have similar higher likelihood values. This behaviour persists in time validation, with M2 and M3 that outperform M1. In space and space-time validation, however, M3 has the lowest 10 likelihood value of the three, whereas M1 and M2 limit their decrease in performance, ranking, respectively, second and the first in terms of optimal likelihood value. It can also be observed that M2 is generally better than M1, with NS values that are higher or approximately equal except for the subcatchments Andelfingen and Halden, where the NS is slightly worse (however still higher than 0.80). M3 is clearly better than M1 on Andelfingen, Frauenfeld and Wängi, and clearly worse on Herisau and Mosnang. In particular, in Mosnang (the smallest basin), M3 reaches the worst performance of all models on all subcatchments.
Regarding M0, it is interesting to observe that it has the worst performance (among all the subcatchments) in Appenzell, 25 which is the subcatchment that is most affected by snow ( FS = 0.21), while it reaches a performance similar to M1 in Frauenfeld and Wängi, which are two subcatchments with almost no snow. Figure 8 compares the ability of M0 and M1 to capture the signatures representing average streamflow ( Q ) and seasonality ( HFD ). The analysis is presented for space-time validation and, for HFD , it focuses only on the four subcatchments that are 30 most affected by the snow ( FS > 0.10) to emphasize the differences between the results of the two models. Each colour represents a different subcatchment and each dot a year; the red dashed line has a 45 ° slope and represents where the dots should align in case of perfect simulation results. The Spearman's rank score is also reported and gives information about the degree of dependency between the two variables. It is important to stress that the models have not been calibrated using any of the signatures as objective function, which therefore represent an independent evaluation metric.

Model performance in terms of signature metrics
It can be observed that M0 represents Q as well as M1, with almost no difference between the two models. Focusing on the 5 ability of capturing HFD , it can be seen that with M0 the points all lie in the upper-left part of the plot, meaning that this model underestimates the signature values. With respect to M1, instead, the points are more aligned around the diagonal.
This difference in performance is also exemplified by the value of that is 0.66 for M0 and 0.85 for M1. The performance of all the models decreases for Q5 where the models have a similar performance with equal to 0.62, 0.66, and 0.61 for M1, M2, and M3 respectively. The points cloud is still aligned to the diagonal but it is quite dispersed, 15 especially if compared with Q and HFD , meaning that the models capture the general tendency but have deficiencies capturing the inter-annual variability.
In terms of BFI , M2 performs clearly better than the other models. It is the only model that is able to represent this signature, with equal to 0.83 and the points that align to the diagonal. The other two models have a lower performance ( equal to 0.31 and 0.52 for M1 and M3 respectively) with the points cloud that is quite dispersed and the dots align almost 20 vertically, implying that the simulated values have a range of variability that is definitely smaller than the observed data. Figure 10 shows the comparison between observed and simulated HQD ; since this signature requires a long time window to be computed, it is not calculated year by year (as done with the other signatures) but it is available only the aggregated value over the 24 years. The performance of M1 and M2 is overall good, with that is 0.77 and 0.69, while M3 shows some deficiencies ( equal to 0.48); all the models tend to slightly overestimate the duration of high flow events with most of the 25 points that lie on the right side of the diagonal.

Interpretation of hydrological model results
The results of the hydrological model experiments appear to support our hypothesis that only models that account for the influencing factors that affect the streamflow signatures are able to reproduce streamflow spatial variability (see Sect. 3.3).
This provides confidence that those models are a realistic representation of dominant processes in the catchment. 30 In particular, the results of M0 show that accounting for the spatial heterogeneity of the precipitation alone is sufficient to achieve a good accuracy signatures of water balance, with of 0.95 for average streamflow . More complex models with more HRUs and more parameters do not result in any improvement in reproducing the average streamflow signature.
The differences between M1 and M0 show that differences in streamflow seasonality HFD can be largely attributed to the (spatially variable) effect of snow accumulation and melting. More complex models (M2 and M3) do not demonstrate an 5 improvement in this signature.
M2 determines a large improvement in matching signatures of baseflow variability. The ability of fitting BFI goes from 0.31 for M1 to 0.83 for M2. This result confirms that geology influences spatial variability of quickflow vs baseflow partitioning, as indicated by correlation analysis.
M3 reassures that the relatively good results of M2 are not just due to increasing complexity. Although this model performs 10 slightly better than the M1 in terms of matching signatures such as BFI , M2 is still much better (e.g. the Spearman's rank score for BFI is 0.83 for M2 and 0.52 for M3).
All the models do not preform particularly well in reproducing Q5 and HQF . These problems shows that such models may not represent well extreme values (high and low flow), and therefore they are still amenable to improvements.
Overall, distributing the inputs is sufficient to get good performance metrics, water balance, and seasonality, confirming the 15 fact that the precipitation rate and the partitioning between rainfall and snow are the first controllers on these hydrograph characteristics, but, if we want to capture also other important characteristics of the hydrograph, described by signatures like , the discretization of the catchment in HRUs is necessary. This discretization has to be carefully made and a preliminary analysis to understand dominant influencing factors on signatures can help in this decision. As shown in Fig. 9, if we use characteristics that are not strongly correlated with the signatures (e.g. land use) the results are worse than if we choose 20 characteristics that show a correlation with signatures (e.g. geology). M2 is capable of capturing the signatures not just because it is more complex than M1, but because it incorporates the causality link between the geology and the streamflow signatures in its structure.

General discussion
Explaining the spatial variability observed in catchment hydrological behaviour by identifying the most important controls 25 on water fluxes and pathways is a major focus of catchment hydrology and a central theme in classification studies (e.g., McDonnell and Woods, 2004;Wagener et al., 2007). A common approach for interpreting the spatial variability of catchment responses is through correlation based analyses, which seek correlations between climatic or catchment characteristics and streamflow signatures (e.g., Lacey and Grayson, 1998;Bloomfield et al., 2009). One of the issues with this approach is that correlation does not always imply causality, and the presence of multiple correlated variables can 30 obscure process interpretation.
In this study, we combine correlation analysis for identifying dominant influencing factors on streamflow signatures with hydrological modelling, by using the interpretation of the correlation analysis as an inspiration for generating testable model hypotheses. The combination of correlation analysis on streamflow signatures and hydrological modelling is beneficial because on the one hand, the speculations on dominant processes resulting from the correlation analyses can be verified in the modelling process. Specifically, we developed model experiments to test the influence of precipitation spatial 5 distribution on streamflow average and seasonality, and the influence of geology on quickflow vs baseflow partitioning. On the other hand, model building benefits from guidance resulting from preliminary correlation analysis. The construction of a distributed model requires several decisions (e.g., Fenicia et al., 2016), including how to "break-up" the catchment in a meaningful way, and preliminary correlation analysis can motivate some of these decisions. For example, the HRUs defined based on geology, as suggested by correlation analysis resulted in better model performance than HRUs based on land use, 10 particularly in the representation of streamflow signatures.
Although several modelling decisions were guided by data analysis, it should be noted that alternative decisions would have been similarly consistent with the data. For example, both precipitation and elevation are correlated with average streamflow, and both geology, topography and soil type characteristics are correlated between each other and with baseflow index (Section 3.2.2 and Figure 5). The correlation of catchment characteristics (e.g. geology, soil and topography) can be 15 attributed to the fact that they evolved together in the shaping of the catchment morphology (e.g. mountainous regions have impervious topography with shallower soil and, for these reason, are less suitable for human activities, influencing land use).
The decisions on which variables are chosen to reflect a causality link is not always obvious from correlation analysis alone, and it requires expert judgment, which is not always generally shared.
Our results on the Thur catchment with respect to the effect of meteorological inputs on average streamflow and of the 20 geology on baseflow index are in general agreement with previous work. Kuentz et al. (2017) made a classification study over more than 40000 catchments across all Europe (of which almost 2700 are gauged) and found that the rainfall is the first controller of the average streamflow, geology controls the BFI, topography the flashiness index, and, for most of the cases, land use is the second controller of them; Bloomfield et al. (2009) used a linear regression model and linked the lithology of the Thames Basin (UK) with the BFI; Lacey and Grayson (1998) noted that geology controls the BFI in two ways, storing 25 the water and impacting the soil formations; Fenicia et al. (2016) compared different model structures and catchment discretization methods in the Attert Basin (Luxemburg) and discovered that the best model was the one that incorporates a spatial representation of the meteorological inputs and of the geology.
On the other hand, this general tendency should not be generalized to all places. For example, Mazvimavi et al. (2005) found that geology was not important for the BFI, as in their case study the aquifer was deep and disconnected from the river. 30 Bouaziz et al. (2018) found a strong influence of regional groundwater flow in the Meuse catchment which altered the water balance.
One of the main limitations of this work is the restricted number of catchments involved and the limited spatial extension of the study. For this reason, it is difficult to generalize the results to other climatic regions. The subcatchments belong all to the same region and the landscape and climatic characteristics, while varying substantially within the basin, can still be quite different from characteristics found elsewhere.
The limited number of catchments involved in this study (only 10) can also pose some problems in the correlation analysis, where only linear or monotonic correlations have been investigated while other forms of relationship, including the mutual interaction between various influencing factors, have been neglected. This can lead to the exclusion of characteristics that are 5 indirectly related to the streamflow signatures.

Conclusion
In this study, we presented a methodology for the construction of a semi-distributed hydrological model where model hypotheses, instead of being made a-priori, are informed by preliminary correlation analysis on streamflow signatures.
Besides providing guidance to model development, the proposed approach is useful in the fact that modelling can be used to 10 test specific hypotheses on dominant processes resulting from correlation analysis.
Our analysis was applied to the Thur catchment, with the objective of understanding the main controls on streamflow spatial variability. The main findings are summarized in the following points: 1. we found large spatial variability between the subcatchments of the Thur in terms of various streamflow signatures reflecting multiple temporal scales: yearly, seasonal and event scale. In terms of climatic characteristics, indices 15 reflecting fraction of snow, precipitation totals, and aridity varied considerably between catchments. Other precipitation characteristics such as season, frequency and duration of dry and wet days did not vary significantly between catchments. In terms of landscape characteristics, there is large variability of topography (e.g. from upstream mountainous to downstream flat areas), geology (with unconsolidated, more permeable, and consolidated, relatively impermeable formations), and soils (with low depths in the mountains, and large depth in the floodplains) 20 in all catchments; 2. based on correlation analysis and expert judgment, we determined that climatic variables, especially precipitation average, are the main controls on streamflow average yearly values; the fraction of snow is responsible for streamflow seasonality by delaying the release of winter precipitation to the spring season, and geology controls the baseflow index, with a higher fraction of unconsolidated material determining higher baseflow; in space time validation, we confirmed that model decisions based on correlation analysis were appropriate. In particular, we confirmed that M0, in spite of a generally poor performance, is sufficient to capture signatures of streamflow average. M1 improves signatures of streamflow seasonality. M2 enables reproducing signatures such as the baseflow index. Model modifications that are not in line with the results of the signature analysis, such as subdividing the catchment using vegetation based HRUs (M3), despite increasing model complexity, not only do not lead to an improvement, but cause deterioration in space-time validation. Overall, these results suggest that causality relationships explaining the influence of climate and landscape characteristics on streamflow signatures 5 can be constructively used for distributed model building.
The relatively good performance obtained in space-time validation suggests that the proposed approach could be used for the prediction of the streamflow in other ungauged locations within the Thur catchment. The method proposed uses data that is commonly available in many gauged catchments (e.g. meteorological data, streamflow measurements, and maps of 10 elevation, geology, land use, and soil); therefore, it is easily transferable to other locations.