Calibration event selection for green urban drainage modelling

Calibration of urban drainage models is typically performed based on a limited number of observed rainfall-runoff events, which may be selected from a longer time-series of measurements in different ways. In this study, 14 singleand twostage strategies for selecting these events were tested for calibration of a SWMM model of a predominantly green urban area. The event selection was considered in relation to other sources of uncertainty such as measurement uncertainties, objective functions, and catchment discretization. Even though all 14 strategies resulted in successful model calibration, the difference 5 between the best and worst strategies reached 0.2 in Nash-Sutcliffe Efficiency (NSE) and the calibrated parameter values notably varied. Most, but not all, calibration strategies were robust to changes in objective function, perturbations in calibration data and the use of a low spatial resolution model in the calibration phase. The various calibration strategies satisfactorily predicted 7 to 13 out of 19 validation events. The two-stage strategies performed better than the single-stage strategies when measuring performance using the Root Mean Square Error, flow volume error or peak flow error (but not using NSE); when 10 flow data in the calibration period had been perturbed by +-40%; and when using a lower model resolution. The two calibration strategies that performed best in the validation period were two-stage strategies. The findings in this paper show that different strategies for selecting calibration events may lead in some cases to different results for the validation period, and that calibrating impermeable and green area parameters in two separate steps may improve model performance in the validation period, while also reducing the computational demand in the calibration phase. 15 Copyright statement. TEXT


Introduction
Calibration of generic urban drainage model codes is usually required to obtain a model representing an actual site with sufficient accuracy.In the calibration process, the information contained in records of relevant variables, such as rainfall and flow rates at the catchment outlet, is used for estimating model parameter values that produce results consistent with the data (Mancipe-Munoz et al., 2014).It can be expected that the best parameter estimates will be obtained when they are inferred from the largest amount of information, i.e. by using all data from a long series of measurements.However, the availability of calibration data may be limited and the nature of the calibration process, by trial and error, requires model iterations for many different parameter sets, which means that the runtime of the model has to be kept short and the length of the simulated periods sensor to be a reliable instrument (Duchon, 2002;Lanza et al., 2010).Records were available for individual rain events in 2013-2015 and continuously for 2016 and 2017.
The flow rates in the storm sewer draining the catchment were performed, at 1-minute intervals, by means of an ISCO 2150 AV sensor (a combination of an acoustic Doppler velocimeter and a pressure transducer) installed in the catchment outlet formed by a 400 mm diameter concrete sewer pipe.This type of sensor was assessed in the laboratory by Aguilar et al. (2016) and found to have a combined uncertainty (consisting of bias, precision and benchmark uncertainty) of ±19.0 mm for the water depth measurements (the test range was 10-150 mm) and ±0.0985 m/s for the velocity measurement (test range 0.1-0.6 m/s).
These tests were carried out in a 0.46 m wide square channel, so the stage-discharge relationship was different from the study site described herein.It was also reported that the field performance of this type of sensors can suffer from the presence of too few (Teledyne ISCO, 2010) or too many particles suspended in the water (Nord et al., 2014).
While the difficulties in estimating all the uncertainties at the actual field site prevented a precise determination of the uncertainties' magnitude, the general lab tests of the sensors used confirmed the acceptability of their records for the study purpose.Finally, it was also confirmed by Dotto et al. (2014) that errors in the calibration data can be compensated for in the calibration process.
The available precipitation record was divided into rainfall events with at least six hours without precipitation between them.
Events deemed suitable for use in calibration were selected using the following criteria: 1.A minimum total precipitation of 2 mm (Hernebring, 2006) 2. No or small gaps in rain and flow data , i.e. both have to be available for >90% of the event duration 3. Sufficient in-pipe water depths for the flow sensor to work reliably: >10 mm during at least 50% of the event and >25 mm at least once in the event, based on recommendations from the manufacturer (Teledyne ISCO, 2010).site visits.The advantage of these single land-use subcatchments is that their parameter values maintain their physical meaning and can be calibrated (or appropriate values found in the literature) for each land use or cover.The traditional approach of using larger subcatchments with multiple land uses/covers usually necessitates calibration to estimate the values of parameters that then represent a weighted average value over multiple land uses/covers.Some spatial characteristics, such as the slope and the width of subcatchments, can also be estimated more easily for smaller, uniform subcatchments.This approach has been used successfully by e.g.Krebs et al. (2014Krebs et al. ( , 2016)), Petrucci and Bonhomme (2014) and Sun et al. (2014).Within SWMM the Green-Ampt infiltration method was selected since it can be calibrated with just two parameters (Rossman, 2016).
Whenever feasible, parameters for the different subcatchments were set directly from the available GIS data and site visits, i.e. the sizes and slopes of all subcatchments and sewer pipes, as well as the catchment widths of small and disconnected roofs.
For other subcatchments the catchment width was calibrated together with the other model parameters.To reduce the scope of the calibration problem, parameters were grouped based on land cover, yielding a total of thirteen calibration parameters for the hydrodynamic model.Parameter values were limited based on values reported in literature (see Table 1).The precipitation gauge was situated a few hundred metres outside of the actual catchment, and may have provided a biased estimate of the catchment rainfall.Therefore, a rainfall multiplier for each individual rainfall event was included in the calibration.This approach has been used with satisfactory results e.g. by Datta and Bolisetti (2016), Fuentes-Andino et al. (2017, and Vrugt et al. (2008), although it is limited by assuming a simple multiplicative difference between the gauge and catchment-average rainfall, which is not necessarily the case (Del Giudice et al., 2016).The rainfall multipliers create a way of adjusting the rainfall volume in the calibration so that the simulated runoff volume can better match the observed runoff volume.It is, however, not possible to distinguish between deviations between rainfall at the gauge and the catchment-averaged rainfall, errors in the rainfall measurement, and errors in the runoff measurement.A more traditional approach would be to calibrate  ksat Grass areas (GR) e 1 -200 (Rawls et al., 1983) Initial moisture deficit [-] imd Grass areas (GR) e 0.10 -0.35 a In SWMM, the subcatchment width is an input, but in this group of subcatchments, the length (in the flow direction) showed more similarity among the subcatchments, so it was calibrated instead of the width.
b Includes vegetation and trees as well.
c The maximum value was intentionally set high since the swales' outlets are not always located exactly at the lowest points and the swales can be observed with larger ponds after heavy rain events.
d Field experiments on similar swales in the same city.
e Used for both grass areas and swales.
the percentage of impervious areas, but in view of the availability of high-resolution land-cover information, it was preferred to apply rainfall multipliers instead.
Green surfaces like those in the study area have a long hydrological memory for antecedent rainfall, and this had to be accounted for in the simulations.Neglecting this memory would increase the risk of green areas allowing unrealistically high infiltration in some rainfall events.Since SWMM does not allow for setting the initial values of state variables directly, such adjustments can be done by choosing an appropriate warm-up period for modelling runs.When sufficiently long warm-up periods are used, this approach offers an advantage consisting of treating the first rainfall/runoff peak of an event the same as way as any following peaks, i.e., with initial conditions corresponding to a continuous simulation.The required length of this warm-up period was estimated by finding the last time before each rainfall event when the study area was dry.This was calculated for all rainfall events using the actual precipitation data and for various values for the maximum depression storage 10 and infiltration rate.The last antecedent time when the study area was dry was then used as the starting point of the warm-up period.This lookup procedure was applied to every event for each iteration in the calibration process, so that all events were treated the same way as in a continuous simulation.In the calibration process, the Shuffled Complex Evolution -University of Arizona algorithm (SCE-UA; Duan et al. (1994)) was used to estimate the optimal values of the parameters.The algorithm was selected because it is commonly used in hydrological studies and allows for parallel computing.The Python library SPOTPY (Houska et al., 2015), which includes this algorithm, was used to carry out the entire calibration process.

Event selection
This paper investigates single-and two-stage calibration scenarios (CS), with each CS using six rainfall events.The single-stage CSs used the six events with the highest values of a certain event characteristic, and calibrated all parameters simultaneously.
Two-stage calibration scenarios calibrated first the parameters related to impervious areas, using a set of three rainfall events, followed by the pervious area parameters using another set of three rainfall events.Since only 12% of the total catchment surface is impervious and connected directly to storm sewers, it was assumed that the events, for which runoff volume was less than 12% of rainfall volume, produced runoff only from impervious areas.Therefore, these events were suitable for calibration of impervious area parameters in the first stage of the calibration process.Following this step, events with more than 12% runoff were assumed to also include runoff from green areas and were used to estimate pervious area parameters in the second stage of the calibration.When calibrating the green area parameters, the parameters related to impervious areas were kept fixed at their values from the first stage.This procedure splits the optimization problem into two smaller problems that have fewer parameters and shorter run times.The smaller number of parameters (reduced dimensionality) can ease the search for optimal parameter sets, while the shorter run time per iteration allows shortening the total time needed, increasing the number of iterations used, or including more events in the calibration.
Characteristics related to the rainfall, flow depths and flow rates were calculated for each event.For the single-stage calibration scenarios, the six highest ranking events for each characteristic were selected.For the two-stage calibration scenarios, the three highest ranking events with less than 12% runoff were selected for the first stage and the three highest ranking events with more than 12% runoff were selected for the second stage.To avoid making the comparison too large in scope, a limited number of calibration scenarios (eight single-stage and six two-stage) was selected for use in this study.This selection was made so that it included a range of different characteristics and avoided multiple CSs with the exact same set-up of events.The names of the CSs consist of two or three elements: -T6 (Top 6) for single-stage or T32S (Top 3 -2 stages) for two-stage scenarios.
-The duration over which the characteristics were calculated: sum, mean and max refer to the whole event.30 and 60 min refer to the time interval used to calculate an average rainfall intensity or flow rate (i.e. the highest value found within the event for a 30 or 60 minute moving average).Calculating rainfall intensities and average flow rates over these windows rather than the entire event suppresses the effects of e.g.dry periods within events on such calculations.The calibration scenario N_T6 consists of the six events that were selected most often in other calibration scenarios with the goal of obtaining a set of events that score highly on a variety of characteristics.
Therefore, this issue is addressed herein just by comparing the parameter values obtained in different calibration scenarios.
Calibration data measurement uncertainties.Measurement uncertainties of flow rates in storm sewer pipes have been described by a number of researchers, e.g., Aguilar et al. (2016), Blake and Packman (2008), Bonakdari and Zinatizadeh (2011), Heiner and Vermeyen (2012), Lepot et al. (2014), Maheepala et al. (2001).In this paper, structural flow measurement errors are considered by testing calibration after reducing or increasing all flow observations by 40%.This value was chosen on the basis of uncertainties reported by Aguilar et al. (2016) applied to the current outflow measurement location and is slightly higher than the value of 30% used by Dotto et al. (2014) and Kleidorfer et al. (2009a).The flow data from the validation period was not adjusted.Other researchers (e.g.ibid) also tested the effect of random errors; such effects and their thorough investigation were deemed outside of the scope of this paper.However, it should be noted that the use of measured flow rates, implemented in this study, involves the presence of random errors in the calibration data sets used.
Objective functions.The calibration process strives to find the optimal value of the specified objective function, so the choice of such a function can be expected to affect the calibration results.This was addressed here by assessing all calibration scenarios using both Nash-Sutcliffe model efficiency (NSE) and Root Mean Square Error (RMSE) as objective functions (see Sect. 2.5).
Conceptualisation / model discretization.The model code (SWMM) employed in this study has been widely used for many years, with some improvements made to those parts of its conceptualisation that were deemed unsatisfactory.Therefore, it is safe to assume that the SWMM conceptualization (Rossman, 2016) is appropriate for urban drainage modelling and there was no need to consider this issue further.However, the choice of catchment discretization into the subcatchments in the model is done, somewhat subjectively, by the modellers for individual studies: therefore, two levels of discretization were compared: (i) the basic model set-up (the high-resolution model described in Sect.2.2), and (ii) a simpler, more traditional set-up using five subcatchments.In the latter case, each subcatchment was created by aggregating multiple smaller subcatchments from a For two subcatchments the percentage routed was estimated at 0% and 100% respectively.A single percentage was calibrated and shared by the three remaining subcatchments.
the high-resolution model.The area and percentage imperviousness of each aggregated subcatchment were calculated from its constituent smaller catchments.The calibration parameters were modified accordingly, as shown in Table 2, with the total number of calibration parameters (including rainfall multipliers) being the same.
Sources of uncertainty not considered.The calibration algorithm used in this study (SCE-UA) has been widely applied in hydrological applications with great success, so there was no need to subject it to scrutiny in this paper.Similarly, since SWMM is a well-established mature model, there was no need to examine the equations, numerical methods and boundaries used in the model.

Objective functions
Each calibration scenario was run with two different objective functions, of which values were first calculated for individual events and the average of those values for the whole scenario served as the target for optimization.The objective function used for all except one calibrations was the Nash-Sutcliffe model efficiency: Where O denotes observed values and S simulated values.The NSE measures the variance of the model errors (the numerator) as a fraction of the variance of the observations (the denominator).This fraction is then scaled so that it extends from -infinity (i.e., the worst possible fit) via 0 (the score that would be achieved by using the average of observations) to 1, for a perfect fit.The NSE is dimensionless, so it allows comparing runoff events of different magnitudes.However, when the vari- ance of the observations is small (e.g. for small runoff events), it can become quite sensitive to small changes in the simulated hydrograph.To examine the impact of different objective functions, one calibration used Root Mean Square Error (RMSE): RMSE has the same units as the observations (in this case L s −1 for the flow rate).For further assessment of the modelled hydrographs, two metrics related to the peak flow and the hydrograph volume were used.The peak flow ratio (PFR) was defined as the ratio of the highest simulated to the highest observed flow rates, regardless of the times when they occurred: Where values >1 indicate overestimated simulated peak flows and values <1 indicate underestimated simulated peak flows.
Finally, the relative volume error (VE) considers total flow volumes throughout the event: It is positive when the simulated total flow volume exceeds the observed one and vice versa.Note that the above formula is only valid if the observation interval is constant.
The quick response of the studied catchment means that low flow rates may cover a significant part of the event.Measurements in this range have relatively high uncertainties and may be considered less relevant than periods with higher flows.
Therefore, it should be avoided that low flows dominate the analysis, which was achieved by including only time steps with observed flow rates >1 L s −1 in calculating these metrics.

Results and discussion
3.1 Calibration performance

Baseline calibration
The baseline calibration (i.e. with NSE as objective function, using the high resolution model without flow data perturbations) was successful for all calibration scenarios, with average NSE for all events ranging from 0.68 to 0.85 (see Table 3).The lowest NSE corresponded to the two CSs based on the percentage runoff (T6_QV_ppP and T32S_QV_ppP).This result can be attributed to one event (see Figure 2), for which both CSs resulted in simulated hydrographs with low NSE, in spite of a visually good fit of the observed data.In this case, low NSE resulted from a small timing error and from low flow rates in the event, which lead to a low variance of the observations and, therefore, an NSE that is more sensitive to small simulation errors.For the two-stage calibration scenarios, the individual stages also produced successful calibrations (stage 1 NSE 0.70 -0.87, stage 2 NSE 0.78-0.87),except for the second stage in T32S_QV_ppP for the reasons explained above.The NSE for the individual calibration events in the different calibration scenarios is similar to that reported by Krebs et al. (2013).
Across the different calibration scenarios and events, the most common source of error was flow underestimation, with 5 respect to both the total flow volume (see Figure 3, left panel) and the peak flow (see Figure 3, right panel).Volume errors for individual events were large in some cases (ranging from 35% underestimation to 30% overestimation), but the average VE for each calibration scenario was limited to underestimation by 1-11%.The magnitudes of the peak flow and volume errors are comparable to those found in previous studies on calibration of SWMM (Barco et al., 2008;Krebs et al., 2016).

Sensitivity to objective functions 10
The differences between calibrations using NSE and RMSE as objective functions were small (see Table 3), with the largest differences being 0.05 (NSE) and 0.4 (RMSE) for T32S_QV_ppP.For three calibration scenarios the NSE calibration found a better RMSE than the RMSE calibration and for four CSs the RMSE calibration found a better NSE than the NSE calibration.This indicates that the algorithm does in some cases find a local rather than a global optimum.However, the differences between them are small.

Sensitivity to model discretization
Calibration runs with a model setup consisting of five instead of 140 subcatchments showed NSE similar to that of the baseline run: the change in performance ranged from +0.08 (T32S_D_prec) to -0.06 (T32S_Q_60m), with only T32S_P_sum showing 5 a larger loss of 0.15.The peak flows predicted by the low-resolution models were most often lower than in the high-resolution model and as a result, peak flow ratios were worse.Overall runoff volume was higher in the low-resolution models, which resulted in a smaller volume error.The changes in peak flow performance were smaller than reported by Krebs et al. (2016)

Sensitivity to structural flow measurement errors
Calibration results (NSE) are shown in Table 3 3 for the cases of structural flow data errors of -40% and +40%.For most calibration scenarios there was a small loss in NSE, except for T6_QV_ppP, which failed to calibrate with an NSE of -0.1 when the flow data was reduced by 40%.Three of the events in that scenario calibrated well (NSE 0.76 -0.95), but the other three produced negative NSE values.These latter three events all missed the first runoff peak; for two of these events the quality of fit, judged visually, was the same as in the baseline run, but since the flow rates were low, the NSE values were unsatisfactory (see Figure 4 for an example).T6_PI_mean included one event, for which the reduction of flow observations by 40% resulted in a hydrograph where large parts fell below the 1 L s −1 threshold.Except for the events described above, the flow errors could be compensated for in calibration.This issue is influenced by the use of rainfall multipliers as discussed in Sect.3.2.2.

Hydrologic model parameters
Figure 5 shows the calibrated parameter values (for the baseline run), normalized with respect to their calibration ranges (see Table 1).There is considerable variation among the calibrated values obtained in different calibration scenarios, demonstrating that even for parameters with a clear physical interpretation, identification of the best (ideal) value is not straightforward.errors (Dotto et al., 2009(Dotto et al., , 2011(Dotto et al., , 2014;;Kleidorfer et al., 2009a;Sun et al., 2014).The variation found here among the optimum parameter values obtained in different calibration scenarios suggests that the selection of calibration events could also affect the uncertainty of parameter estimates and this influence should be investigated further.

Rainfall multipliers
The values of rainfall multipliers found in the calibration process ranged from 0.48 to 2.92, showing that there could be significant measurement errors (in precipitation and/or flow) and/or differences between the gauge rainfall and the catchment average rainfall fitting best with the observed flow rates.For rainfall events that were included in multiple calibration scenarios, the calibrated multipliers from different scenarios were close to each other (see Table 4).This variation is much smaller than that for the hydrological model parameters (see Sect. 3.2.1).This indicates that the rainfall multipliers compensate for discrepancies between the observed and best-fitting rainfall, rather than for other aspects of catchment runoff modelling.The average value of the rainfall multipliers across all events is 1.2.
When all flow data was decreased by 40%, prior to calibration, the different CSs remained in agreement with each other, except for T6_QV_ppP, which failed in this run.The average rainfall multiplier across all events was 0.76 (i.e., 37% lower than in the run without any perturbation of flow data).When all flow data was scaled up by 40%, T32S_P_sum and T32S_Q_max produced deviating multipliers (compared to the other calibration scenarios) for three events each, but the quality of fit was the same across all CSs (according to both the NSE and visual comparison).The average value of the multipliers across all events was 1.59 (i.e., 33% higher than in the baseline run).This finding suggests that the rainfall multipliers were responsible  for much (if not all) of the model adjustment to the perturbed flow data.In this respect, the average multiplier of 1.2 in the baseline run suggests that there was some structural disagreement between the observed rainfall and flows.
With the low-resolution model, in contrast to the high-resolution model, there was considerable variation in the values of the rainfall multipliers for each event found by the different calibration scenarios, see Figure 6.The values obtained were 25% lower to 50% higher (for the same event in the same calibration scenario) than in the baseline calibration.Three of the lowresolution two-stage calibrations (T32S_D_prec, T32S_Q_60m, T32S_Q_max) found lower multipliers than in the baseline, T32S_QV_ppP had three higher and three lower multipliers and other CSs had all higher multipliers.This behaviour indicates that (despite similar resulting performance) the rainfall multipliers in the LR-model were used to compensate (within a single event) for the effects of the specific parameter set found in calibration, rather than to compensate for a structural discrepancy between the observed rainfall and flow data as in the baseline calibration.

Individual events
The successful calibrations predicted 7-13 out of the 19 validation events satisfactorily (NSE >0.5), see Table 5.The two-stage calibration scenarios were less sensitive to perturbations of the flow data in the calibration period and to switching from the high resolution to the low-resolution model.T32S_P_sum, T32S_Q_max, and T32S_QV_ppP actually predicted a higher number of events satisfactorily with the low-resolution model than with the calibrated high resolution model.The events that most often caused failure in validation were four events with peak flow rates of 10 L s −1 or less, and therefore, such failures may be attributed to: (i) relatively high measurement uncertainties, and (ii) high sensitivity of the NSE to even small changes in the hydrographs.However, it should be noted that the two smallest events (both with a peak flow rate of 4.6 L s −1 ) were predicted with NSE>0.5 by some calibration scenarios.For the other CSs, examination of the hydrographs showed that they predict well the magnitude of events, but produce wrong timing.
Another event that failed in validation for all CSs was that with the highest peak flow rate (53 L s −1 ), which was overestimated by a factor of up to three.This event was dominated by an intense, single-peak burst of rainfall, so it could have suffered from high spatial variation of the rainfall.
The volume errors were similar for all high-resolution calibrated models and showed a general tendency to underestimate flow volumes by 25%.When using the low-resolution model, the single-stage CSs underestimated runoff volume by around 40%, while two-stage scenarios underestimated it by a maximum of 27%.Across all CSs, two-stage versions had similar or better performance in terms of total runoff volume.Peak flow ratios were <1 for most events, but for the events that generally did poorly in validation (see above) peak flows (as well as flow volumes) were over predicted instead.The results for both total volumes and peak flows indicate that for most events flows were underestimated, which may be (at least partially) attributed to the discrepancies between observed rainfall and flow found in the calibration phase (see Sect. 3.2.2).
The peak flow ratios obtained for the 19 validation events using the calibrated models from the baseline are shown in the upper panel of Figure 7. Underestimation of peak flows was most frequent, but the largest errors occured when the flow was overestimated.The variation among CSs was generally larger when the prediction error was larger.The corresponding figure for volume errors is shown in the middle panel of Figure 7. Again, underestimation was more common, but overestimation did occur for a limited number of events.For both peak flows and total volumes, the variation among events was generally larger than the variation among different calibration scenarios, showing that selecting a limited number of validation events may also influence the results of the model evaluation.
When examining the NSE of the validation events (see the bottom panel of Figure 7), more variation among the different CSs became visible, although the amount of variation was still event-dependent: inter-CS variation for the same events varies from 0.15 to 1.25.This shows that some events can have a much larger impact on the overall validation results than others.Out of the 19 events, 6 were predicted satisfactorily (NSE>0.5)by some CSs but not by others; 5 events failed for all CSs, and 8 were predicted satisfactorily by all CSs.

Overall performance
To assess the overall performance of different calibration scenarios for the validation period, several ways of combining the individual events were considered (see Table 6).The simplest metric is obtained by using the NSE means, which ranged from 0.13 (T6_PI_30m) to 0.42 (T32S_QV_ppP).There are two problems with this metric: First, since NSE ranges from negative infinity to plus one, one poorly fitting event can offset multiple well-fitting events.Second, two simulated hydrographs of equally poor fit can have rather different (negative) NSE values, producing different impacts on the overall results, which is not justified by a visual comparison.Therefore, this mean metric is not considered a reliable metric for comparisons, when poorly fitting events are present.
The exclusion of low flow (<10 L s −1 peak) events avoids this issue, but does not reward calibration scenarios that do manage to predict these events satisfactorily.Another option is to set all NSE values <-1 to -1 before calculating the mean, Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.which results in NSE ranging from 0.29 to 0.47.Adoption of the median NSEs (insensitive to outliers) lead to a higher range of 0.43 to 0.61, showing that the average or overall validation performance depends more on the outlier events than on typical events.
A more commonly used approach is to combine all the events into a single time series prior to calculating the NSE on the joint time series.This procedure indicated satisfactory performance for all CSs (NSE 0.57 -0.70).The discussion of various metrics shows that caution is needed when averaging performance over multiple events, as metrics may not reflect the fact that a significant number of events is poorly predicted in all CSs (see Table 5).
The considerations in the previous paragraph concern the NSE and are not necessarily applicable to other statistics in the same way.The RMSE is calculated in flow units (L s −1 ) and tends towards larger values for larger events, even if the fit is visually better.Because of this taking the mean across events is somewhat conceptually unsatisfactory, but the resulting values differ from the RMSE calculated on a joint time series only by an offset that is almost the same for all CSs.Therefore, all CSs show the same relative performance.The volume error (VE) was included in this study to yield some indication of the overall difference between the modelled and observed runoff volumes over longer time periods.Therefore, this statistic was summarized over all events using the joint time-series approach.
To obtain an overall ranking of the different CSs in the baseline run, they were ranked by five characteristics (see Table 6) and then the sum of the individual ranks was taken.This shows that the two-stage CSs performed better in the validation period than the single-stage CSs.

Sensitivity to the objective function
For most calibration scenarios, the models that were calibrated with different objective functions (NSE in the baseline run, RMSE in the alternative) retained a similar performance in the validation phase.However, there are differences for some of the two-stage CSs, see Table 7 for a description and Figure 8 for an example.

Low-resolution model 5
The effect of the low-resolution model depended on the calibration scenario considered, see Table 8.Some scenarios scored better in terms of NSE (gains of up to 0.17 and 3 events predicted with NSE >0.5), while others lost performance by the same metrics (up to 0.24 and 5 events).This is a more mixed result than that found by Krebs et al. (2016), who tested high-and lowresolution models of three catchments and found the high-resolution models to perform better in validation for all three.All but one of the two-stage scenarios predicted more events satisfactorily with the low-resolution model than with the high-resolution 10 model.
The volume errors were twelve to nineteen percent points higher for the single-stage calibration scenarios.The two-stage scenarios showed both worsened performance (T32S_P_sum, T32S_PI_mean) and improved performance (T32S_Q_60m and T32S_Q_max, T32S_QV_ppP).When comparing the hydrographs from the two different model discretizations per event, the high-resolution model usually performed better.However, for the last three CSs mentioned, the low-resolution performed better compared to the other CSs.For T32S_Q_60m and T32S_Q_max, the low-resolution model predicted the observed hydrographs better for most validation events.These three calibration scenarios were also the only ones where the low-resolution model resulted in lower values for the calibrated rainfall multipliers.This resulted in lower flows (and therefore better fits) in validation events for the five events that caused problems for most other CSs (i.e. the four lowest and the single highest peak flow rate(s), see Sect.3.3.1).

Overall ranking for validation
For an overall ranking of the different calibration scenarios in the validation period the baseline runs were ranked by each of the following statistics: mean NSE (limited to -1), number of events with NSE >0.5, RMSE (calculated over the joint time series of all events), volume error (see RMSE), and mean peak flow ratio.The ranks for each characteristic were then summed to obtain an overall ranking, see Table 9. T32S_PI_mean and T32S_D_prec performed best, with T6_PI_30m and T6_Q_60m bringing up the rear.

Degradation of performance from calibration to validation
In calibration, the NSE for the different calibration scenarios ranged from 0.68 to 0.85, while in validation it ranged from 0.29 to 0.47 Table 9.The CSs that did better in calibration lost more performance when switching to the validation period, see figure 9. Considering the change in overall rank from calibration to validation, the two-stage scenarios showed smaller changes then the single-stage scenarios.Several scenarios showed large gains (+10 for T6_QV_ppP, +7 for T6_P_sum, +5 for T32S_PI_mean) while the largest losses were smaller (-7 for T6_Q_60m, -6 for T6_Q_max).The findings in this Sect.
demonstrate that good calibration performance is not necessarily indicative of good validation performance and vice versa, and therefore validation should be performed, if at all possible.

Single-stage vs. two-stage calibrations
For those selection criteria, for which both single and two-stage calibrations were performed, the results of the two options were compared directly (see Table 10).In terms of NSE and volume error, the two-stage calibrations performed better than the single-stage calibrations, except for Q_max.In terms of peak flow ratio the results were mixed.For D_prec and PI_mean the two-stage variant outperformed the single-stage across all metrics, for Q_max the single-stage variant performed better and for 5 other CSs the results depended on the metric used.In validation the differences between single and two-stage calibration were less pronounced, see Table 11.In terms of NSE, the single-stage calibrations performed better, but they had the same number of satisfactorily predicted events as the two-stage calibrations.In terms of RMSE, VE and PFR the two-stage calibrations performed better, except for QV_ppP.This is also the only criterion where all metrics indicated the same, i.e. that the singlestage calibration had better results in the validation period.

Conclusions
The objective of this study was to compare different strategies for the selection of calibration events for a hydrodynamic model of a predominantly green urban area.Calibration strategies consisted of single-and two stage calibrations and considered a number of different metrics by which calibration events can be selected from a larger group of candidate events.Calibration strategies were tested with two different objective functions, on data sets with structural flow data errors, and with high and low spatial resolution models.
In the baseline run (high resolution model, Nash-Sutcliffe as objective function, no structural flow data errors), all calibration scenarios produced successful calibrations, albeit with varying performance: NSE values ranged from 0.68 to 0.85.For the twostage calibrations, both stages gave satisfactory results (NSE 0.70-0.87).The two-stage calibrations performed better than their single-stage counterparts in terms of NSE and runoff volume error.The choice of NSE or RMSE as the objective function had only a small impact on the results.
The robustness of the calibration scenarios to structural flow errors was tested by calibrating them after uniformly reducing or increasing all flow observations by 40%.Most calibration scenarios were able to adjust to this with only small effects on the calibration performance, except for T6_QV_ppP (six events with highest percentage runoff), which failed in calibration (NSE -0.1) when flow data was reduced by 40%.This can be attributed to two low-flow events, which produced negative NSE values, even though they visually indicated a good fit.
Switching from a high-resolution to a low-resolution model discretization has only a small impact on calibration performance metrics.However, the values of the rainfall multipliers for each event show much more variation than with the high-resolution models.Most high-resolution calibration models find higher values for the multipliers, but three two-stage CSs find lower values instead.The calibrated scenarios were validated against an independent set of 19 validation events.All calibrated scenarios predicted 7 to 13 of these events satisfactorily (NSE >0.5).A group of four events with peak flow rates of less than 10 L s −1 caused problems in most calibration scenarios, as did the event with the highest observed peak flow rate.Although most calibration scenarios yielded similar results for the validation events with respect to the overall volume error and the ratio between the modelled and observed peak flow rates, there were considerable differences between the CSs when performance for the validation events was measured by NSE.In terms of NSE the single-stage CSs proved more successful in the validation phase, but for RMSE, volume error and peak flow error the two-stage CSs performed better.
In the validation phase, there were again (as in the calibration) only small differences between the two considered objective functions.Concerning model discretization, the low-resolution single-stage calibration scenarios show significantly larger volume errors than their high-resolution counterparts, while most two-stage calibration scenarios show either the same or even improved volume errors.Two two-stage CSs (that also deviated from the others in terms of the calibrated rainfall multipliers) were also the only ones to obtain visually better fitting hydrographs with the low-resolution model setup than with the highresolution model setup.Two-stage calibrations also predicted more validation events satisfactorily when the calibration flow data was perturbed.
An overall ranking of the different scenarios across the different influential factors (objective function, flow data errors, model discretization) showed that T6_Q_max, T32S_D_prec and N_T6 performed the best in calibration.However, in the validation phase this order was changed considerably with T32S_PI_mean, T32S_D_prec and T6_P_sum forming the top three.The ranking of the two-stage scenarios was more consistent between calibration and validation than that of the singlestage scenarios.

Figure 1 .
Figure 1.Map of the studied catchment showing elements of the high-resolution rainfall-runoff model and the distance of the catchment to the rain gauge (RG).The diameters of the pipes range from 400 mm for the main trunk where the flow sensor is located to 200 mm for the smaller branches.

Table 2 .
Calibration parameters and their ranges for the low-resolution model.

Figure 2 .3HDNIORZUDWLRFigure 3 .
Figure 2. Examples of hydrographs for events with high (left) and low (right) objective function (NSE) values.

)Figure 4 .
Figure 4. Calibrated hydrographs for T6_QV_ppP in the baseline run (left) and after reducing all flow measurements by 40% (right).
Gupta et al. (1998) also found considerable variation in the parameter values obtained when using different years as calibration periods for a natural catchment model.Nonetheless, the span of parameter values is considerably reduced compared to the range imposed during calibration, showing that the boundaries were not set too tightly and that the calibration procedure does offer benefits over estimating parameter values directly.Calibrated parameter values are always uncertain estimates.This uncertainty has been investigated for urban drainage models and shown to be dependent on parameter type, study catchments, model structures, catchment discretization and measurement 12 Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2019-67Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 8 March 2019 c Author(s) 2019.CC BY 4.0 License.

Figure 5 .
Figure 5. Normalized calibrated parameter values for different calibration scenarios and the baseline run.The highest and lowest values found for each parameter are indicated.

Figure 6 .
Figure 6.Rainfall multipliers in baseline calibration (horizontal axis) compared to the LR-model calibration (vertical axis).Each dot is a rainfall multiplier calibrated by one calibration scenario for one event.Identical events appearing in multiple calibration scenarios share the same colour.

Figure 7 .
Figure 7. Error statistics for individual validation events for all calibration scenarios in the baseline runs.

Figure 8 .
Figure 8. Examples of hydrographs showing typical (left panel, N_T6) and differing (right panel, T32S_D_prec) behaviour when calibrated for different objective functions.

Table 1 .
Calibration parameters and their ranges.
Rainfall input uncertainty.Since the rain gauge is located outside of the catchment and the maintenance of the gauge was carried out by different people, it is possible that there are structural errors in the rainfall measurements.This was investigated by examining the rainfall multipliers that were included for each event in the calibration (see Sect. 2.2).
a CSs were ranked by each column marked with an asterisk *.The overall ranking is based on the sum of these per-column rankings.

Table 4 .
Baseline run calibrated rainfall multipliers for events that were used in at least three CSs.

Table 5 .
Number of validation events with NSE >0.5 out of 19 total events.Baseline RMSE as obj.func.Cal.flow -40% Cal.flow +40% Low-res.model Total a Run was unsuccessful in calibration 5194/hess-2019-67 Manuscript under review for journal Hydrol.Earth Syst.Sci.

Table 7 .
Effects of calibration with RMSE as the objective function instead of NSE.

Table 9 .
degradation of performance when switching from the calibration to the validation period

Table 10 .
Comparison of single and two-stage calibration strategies in the calibration phase.The highest score for each selection criterion is highlighted.

Table 11 .
Comparison of single and two-stage strategies in the validation phase.The highest score for each selection criterion is highlighted.
a calculated after setting individual event values <-1 to -1.