Informal uncertainty analysis (GLUE) of continuous flow simulation in a hybrid sewer system with infiltration inflow - Consistency of containment ratios in calibration and validation? - DTU Orbit (02/12/2018)

system with infiltration inflow Consistency of containment ratios in calibration and validation? DTU Orbit (02/12/2018) Informal uncertainty analysis (GLUE) of continuous flow simulation in a hybrid sewer system with infiltration inflow Consistency of containment ratios in calibration and validation? Monitoring of flows in sewer systems is increasingly applied to calibrate urban drainage models used for long-term simulation. However, most often models are calibrated without considering the uncertainties. The generalized likelihood uncertainty estimation (GLUE) methodology is here applied to assess parameter and flow simulation uncertainty using a simplified lumped sewer model that accounts for three separate flow contributions: wastewater, fast runoff from paved areas, and slow infiltrating water from permeable areas. Recently GLUE methodology has been critisised for generating prediction limits without statistical coherence and consistency and for the subjectivity in the choice of a threshold value to distinguish "behavioural" from "non-behavioural" parameter sets. In this paper we examine how well the GLUE methodology performs when the behavioural parameter sets deduced from a calibration period are applied to generate prediction bounds in validation periods. By retaining an increasing number of parameter sets we aim at obtaining consistency between the GLUE generated 90% prediction limits and the actual containment ratio (CR) in calibration. Due to the large uncertainties related to spatiooral rain variability during heavy convective rain events, flow measurement errors, possible model deficiencies as well as epistemic uncertainties, it was not possible to obtain an overall CR of more than 80%. However, the GLUE generated prediction limits still proved rather consistent, since the overall CRs obtained in calibration corresponded well with the overall CRs obtained in validation periods for all proportions of retained parameter sets evaluated. When focusing on wet and dry weather periods separately, some inconsistencies were however found between calibration and validation and we address here some of the reasons why we should not expect the coverage of the prediction limits to be identical in calibration and validation periods in real-world applications. The large uncertainties result in wide posterior parameter limits, that cannot be used for interpretation of, for example, the relative size of paved area vs. the size of infiltrating area. We should therefore try to learn from the significant discrepancies between model and observations from this study, possibly by using some form of non-stationary error correction procedure, but it seems crucial to obtain more representative rain inputs and more accurate flow observations to reduce parameter and model simulation uncertainty. © Author(s) 2013.


Introduction
Simulation with deterministic urban drainage models is commonly used to assess the performance of sewer systems and to assess the efficacy of new upgrading or redesign proposals.Rarely are uncertainties addressed in these investigations, and decisions with large economic consequences are usually taken on a purely deterministic basis, as if model simulations were in full conformity with reality.Sometimes models are Published by Copernicus Publications on behalf of the European Geosciences Union.calibrated to level or flow data from a few places in a sewer system during some months.However, you need not to have much experience with calibration of urban drainage models before you arrive at the conclusion that different parameter sets are optimal for different rain events, even when applying state-of-the-art, physically distributed models in combination with high-resolution rain gauges located close to the catchment in question.
Different parameter sets, sometimes referred to as different models, will obviously have different consequences when applied in a long-term simulation setting typically used as a basis for evaluating upgrade proposals, a fact that is however mostly ignored in practice.There is thus an urgent need for uncertainty assessment tools that can be used when evaluating upgrade proposals as well as for associated needs such as flow meter checking and evaluating the magnitude of the unintended infiltration contribution to the sewer flow, which constitutes a major problem in many flat coastal urban catchment areas.
In this paper we present an application of GLUE to a hybrid urban drainage system revealing the full complexity of reality in terms of flow variations (diurnal wastewater variations, fast rainfall runoff from paved areas and slow infiltration inflow from unknown sources), using flow data recorded by the responsible utility over three consecutive years.A state-of-the-art physically distributed model fed with comprehensive information about the system attributes is cur-rently used by the local utility to interpret the measurements.We use a lumped, conceptual model to reduce the computational burden, but this model however represents the complex flow contributions mentioned above in a similar manner to the physically distributed model used in practice.
Recently the GLUE methodology was criticized for being statistically incorrect and for generating prediction limits without statistical coherence (Mantovan and Todini, 2006;Mantovan et al., 2007;Stedinger et al., 2008).This is due to the subjectivity in adopting a likelihood measure and in the choice of a threshold value to distinguish "behavioural" from "non-behavioural" parameter sets.In GLUE, modelling errors associated with each acceptable model are usually treated under the assumption that error series associated with a particular parameter set (such as over-or under-prediction of flow peaks) will be similar in prediction to those found in evaluation (Blazkova and Beven, 2009b) and hence GLUE is in many cases a welcomed alternative to traditional statistical inference that requires the error series to conform to a statistical known distribution often difficult to justify in real hydrological applications (Beven et al., 2008).It is in this context worth noting that the aforementioned papers that have criticised the GLUE approach all have used synthetic data to illustrate and consolidate their critique, and hence there seems to be a lack of research papers that clearly demonstrate that the statistical error assumptions conform to the specified likelihood function in real-world hydrological applications.In the synthetic case the benefits of classical statistical inference are evident: trust in the model is build in the model construction phase and confidence bounds can be generated and used for prediction.In Beven and Freer (2001); Beven et al. (2011) it is claimed that any effects of model nonlinearity, covariation of parameter values and errors in model structure, input data or observed variables, with which the simulations are compared, are handled implicitly within the GLUE procedure.
The scope of this paper is to examine the GLUE assumption that the error series associated with a particular parameter set will be similar in prediction to those found in evaluation.If true, we would expect that the performance of the GLUE derived uncertainty limits obtained in a calibration period should be similar in a validation period.Aiming at an overall coverage of 90 % of the observations, we investigate how well the GLUE generated 90 % prediction limits cover the observations in both dry and wet weather periods as the number of behavioural parameter sets increases, and we moreover check the coverage for different flow magnitudes using half a year for calibration.Validation periods are included to test the consistency of the generated prediction limits, that is, we test if the coverage obtained in validation periods corresponds to the coverage obtained in the calibration period.We also show how the limits of the posterior parameter space increases as more parameter sets are retained and use this information to draw conclusions on the physical interpretation of important model parameters, such as the    Fig. 6.CR (upper panels) and ARIL (lower panels) vs. the number of retained parameter sets in the calibration year (2007) and the two validation years (2008 and 2009) for the total 6 months period (left panels), the dry weather periods (middle panels) and the wet weather periods (right panels).size of contributing paved area versus the size of the area contributing with slow infiltration inflow.After this brief introduction, we first present the case study area, the calibration and validation data, and the model in Sect. 2. This is followed by an elaboration of the applied uncertainty analysis methodology in Sect. 3 in which the GLUE steps are outlined, the used combined likelihood measure is defined, and some performance indicators are presented.Finally, the results are presented and discussed in Sect. 4 and conclusions are drawn in Sect.5, both with respect to the urban drainage engineering relevance and the method applicability.
2 Case study and model

Catchment and drainage system
The case study catchment with a total area of 1320 ha is situated in the western part of greater Copenhagen in the Ballerup Municipality, as shown in Fig. 1.Most of the area (93 %) is equipped with a separated sewer system, that is, a system with two parallel pipes for wastewater and stormwater, whereas only 7 % is equipped with a combined system where wastewater and stormwater flows into the same pipe (see Table 1).Such hybrid systems are quite common due to transition of the prevailing technological regime in urban drainage since the 1950s, from combined to separated systems.
In a recent calibration of a distributed hydrodynamic model with a rainfall dependent infiltration-inflow module   (DHI, 2009) the effectively contributing impermeable area of the combined sewer system was however found to be larger than that of the separated area (see Table 1), probably because of infiltration inflow or unintended connections of drainage water to the wastewater system.A flow meter has been installed downstream from the catchment (Fig. 1) with the aim of detecting these contributions.The flow meter is a semi-mobile ultrasonic Doppler type and is placed in an intercepting concrete pipe (d = 1.4 m and slope 1.1 ‰ i.e. a potential gravity driven flow capacity of approx 2000 L s −1 ), and logs every 5 min.There are roughly 50 000 inhabitants within the catchment area, which is one of several sub-catchments that divert water to the second largest wastewater treatment plant (WWTP) in Denmark, called Avedøre WWTP.There are a couple of small pumping stations and one larger storage basin within the catchment of approx 4000 m3 .The two closest rain gauges from the national Danish tipping bucket network (0.2 mm resolution; Jørgensen et al., 1998), P316 and P321 indicated on Fig. 1, are located outside the studied catchment area some 12 km apart.

Hydrological model
The primary scope with the paper is to test the usability of the GLUE methodology as a tool for uncertainty analysis and estimation in complex urban drainage modelling, that is, by evaluating how well the GLUE methodology performs when the behavioural parameter sets deduced from a calibration period are applied to generate prediction bounds in a validation period.For this test we decided to keep the model simple and compare observed and modeled flow at just one place downstream from the considered catchment.Hence we inferred that a state-of-the-art physically distributed model that calculates flow and levels in every pipe of the sewer www.hydrol-earth-syst-sci.net/17/4159/2013/ Hydrol.Earth Syst.Sci., 17, 4159-4176, 2013 system would be overly complex for the purpose considering both the computational requirements and the risk of overparameterization. Instead a simple modeling approach was chosen yet complex enough to describe the major flow components (diurnal wastewater variations, fast rainfall runoff from paved areas and slow infiltration inflow from unknown sources).The limitation of using such a simplistic modelling approach is that the model may be too simplistic, for example, in cases when system components, such as weirs, gates, pumping stations and storage tanks, play a significant impact on the observed flow or in cases with heavy backwater effects.
In a GLUE study of an urban drainage system Thorndahl et al. (2008) however applied a distributed hydrodynamic model and showed that the hydraulic parameters (Manning number and minor losses) played an insensitive role when extracting the behavioural parameters of the model, while the surface runoff part of the model (particularly the hydrological reduction factor and time of concentration) were very sensitive.Replacing a full hydrodynamic model normally used in practice with a lumped, conceptual hydrological model as depicted in Fig. 2 therefore seems adequate.
The model consists of two linear reservoirs for modelling the fast rainfall runoff relationship (representing the paved area of the system), and three linear reservoirs for modelling of the slow infiltration inflow to the sewer system.A double sinusoidal black box model was used for modelling the diurnal wastewater flow.Model equations are displayed in Table 2 while a nomenclature is provided in Table 3.
A time step of 15 min was used during both calibration and simulation, which is sufficient for a catchment this size where the concentration time is at least a few hours.The inputs to the model are measured precipitation from the two rain gauges, P 316 and P 321 , and α is a weighting factor governing the percentage of the total area that each rain gauge represent.

Calibration and validation data
Data from half a year (April-October, 2007) was used for calibration.This period was selected because summer normally carries the heaviest rains.The length of the calibration period was chosen by considering a typical length of measuring campaigns used for calibration of urban drainage models; these campaigns usually last only 3-4 months.Two subsequent years (2008 and 2009) of the same season (April-October) were included for validation.There have been no significant changes of the sewer system since 2007, and a good basis for validating the GLUE generated prediction limits thus exists.Some flow data from the calibration period (10 %) and validation periods (1 % and 1.5 %) had to be discarded from the analysis as they were obviously erroneous; the rain data had already been subject to standardised quality control as described by Jørgensen et al. (1998).
The measured precipitation in the studied period was quite different from one year to the other and large spatial variation was observed.Figure 3 shows the accumulated precipitation measured by each rain gauge plotted against each other on a shifted log scale, for each of the years considered.Events plotted for P 321 = 0 have only been recorded at P 316 , whereas events plotted for P 316 = 0 have only been recorded at P 321 , that is, these are probably convective events with limited spatial extent.The rest are events that have been recorded at both gauges with less than 1 h time difference.In 2007, the total precipitation registered at the two rain gauges amounted to 574 mm (P 316 ) and 562 mm (P 321 ).The calibration period Hydrol.Earth Syst.Sci., 17, 4159-4176, 2013 www.hydrol-earth-syst-sci.net/17/4159/2013/ alysis -consistency of containment ratios in calibration and validation?Fig. 6.CR (upper panels) and ARIL (lower panels) vs. the number of retained parameter sets in the calibration year ( 2007) and the two validation years (2008 and 2009) for the total 6 months period (left panels), the dry weather periods (middle panels) and the wet weather periods (right panels).was characterized by many heavy rain storms (4 events containing 35 mm or more).In the validation year 2008 the rain gauge P 316 was clearly malfunctioning, recording consistently less precipitation than P 321 and other rain gauges in the area.The total precipitation for the period amounted to 143 mm at P 316 compared with 341 mm at P 321 .The recordings from rain gauge P 316 in August 2008 were classified with the term "suspicious values" by DMI ( 2009) but were nevertheless included in the study.The validation year 2008 thus serves as an example of how input errors propagate to model output and affect the model performance.The second validation year, 2009, offered one extreme rain event (> 100 mm recorded at P 316 ; > 70 mm recorded at P 321 ), and a few medium events (see Fig. 3).The total precipitation amounted to 322 mm (P 316 ) and 302 mm (P 321 ), which again was much less precipitation than during the calibration period in 2007.

Implementation of GLUE
Prediction limits, or quantiles derived with the GLUE methodology are conditional on the choice of limits of acceptability, the choice of weighting function, the range of models (parameter sets) considered, the exploration of the model space (number of Monte Carlo runs and the method used for sampling the parameter space), the treatment of input and observation errors, and the assumption that the considered system remains unchanged within the validation period.The GLUE steps implemented in this investigation are detailed below.
1. Once a suitable model, M, and relevant input and observations has been selected for the purpose (Sects.2.1 and 2.2) determine a reasonably broad prior domain for each model parameter θ i based on the available background knowledge (for details see Sect.3.2 below).
2. Select an estimation period, N. We used half a year of measurements, April-October, 2007.Carefully check and leave out faulty input data and observations from the estimation (Sect.2.3).We omitted raw data which were obviously faulty, that is, when the measured velocity or level was zero.This happens occasionally when objects such as toilet paper, etc., clogs/attaches to the flow meter.Of course there might be cases when the gauge is only semi-clogged and hence unreliable measurements are sampled and included for the analysis but such data can be very hard to separate from good data.
3. Chose a likelihood measure L[M ( |u, y)] to distinguish the behavioural parameter sets B from all the parameter sets tried , conditioned on input data u = (u 1 .., u k , u k+1 , .., u N ) and observations y = (y 1 .., y k , y k+1 , .., y N ).We used two different likelihood measures.The Nash-Sutcliffe model efficiency coefficient was applied to dry weather periods, L dw , and an exponential likelihood measure, L ww , was applied to wet weather periods, see Eq. ( 1).The Nash-Sutcliffe likelihood was chosen for the dry weather case because of the desire to fit the dry weather diurnal flow pattern well, whereas an exponential likelihood was chosen to fit the peaks of the hydrographs better (Freer et al., 1996;Beven and Freer, 2001;Thorndahl et al., 2008).The exponential likelihood accentuates the peaks, and weights them higher compared to local minima.The flow peaks are normally an important output in sewer flow modeling to assess surcharge and flooding.
A flow threshold of 0.15 m 3 s −1 distinguishing dry and wet weather periods was determined from inspection of the flow observations.The likelihood measures are defined as where σ2 is the residual error variance assuming a zero mean bias, σ 2 o is the observation variance and k is the time index.H is a shaping factor that in this application is fixed to 1.A combined likelihood measure inspired by Choi and Beven (2007) was calculated by multiplication of the dry and wet weather likelihoods: weighting coefficients both set to 1, and i refers to each parameter set from the prior parameter domain.Equation ( 2) is effectively a Bayesian updating of likelihoods.The multiplicative form of the overall likelihood was chosen because we wanted to give equal weight to performance in dry and wet weather periods.If we had used one single performance measure for the whole calibration period we would have favored dry weather performance because dry weather periods constitute the majority of the considered calibration period.The more positive the likelihood values the better.Negative likelihood values are not considered because the observed mean in that case would be a better predictor than the model.4. Select a method and a distribution to draw random parameter sets i from.We consistently used uniform (non-informative) prior distributions and Latin Hypercube Monte Carlo sampling (LHS).The disadvantage with LHS is often argued to be the computational burden compared with a Markov chain Monte Carlo approach.A distributed hydrodynamic model would require extensive computational effort, but the lumped conceptual model presented here contains only 10 parameters, and thus the computational burden was not a challenge.
5. Dotty plots as described in Beven (2008) are used to (1) check where in the parameter space the higher likelihoods are located, to (2) check that prior parameter ranges have been chosen adequately broad, and to (3) evaluate parameter correlation.Sometimes it is necessary to adjust the prior domain and restart the Monte Carlo runs a couple of times.This could be necessary if the dotty plots show high likelihood values at the lower or upper end of any of the prior parameter ranges.However parameter ranges might also be constrained by physical considerations.
6. Decide how to extract the behavioural parameters, B .The procedure to derive the behavioural parameter sets has been either of four: (1) pre-define a likelihood threshold, (2) retain a pre-defined number of behavioural parameter sets, (3) retain a sufficient number of parameter sets to bracket a desired proportion of observations, or (4) use a limit of acceptability approach.
In our case we chose the third procedure aiming at a coverage of 90 % of the observations with the 90 % prediction interval generated from a sufficient number of retained parameter sets.In our search for a sufficient number of parameter sets, we calculated prediction intervals for a gradually increasing number of retained parameter sets K based on L, that is ; 500; 1, 000; 3, 000; 6, 000; 10 000}. (3) Ideally, we are satisfied if 90 % of the observations fall inside the generated 90 % prediction interval.
7. The following steps are used to determine the prediction intervals, see also Beven and Freer (2001): a.At each time step k rank the ith simulated flow y k sim,i produced by the retained parameter set B,i and its associated likelihood L[M B,i |u, y sim,i ] value in descending order with respect to flow magnitude.
b. Rescale the likelihoods to sum to unity where M B,i denotes the ith behavioural Monte Carlo sample so that at any time step k, prediction quantiles can be formed using where y max is some threshold flow.

Choice of prior parameter ranges
The fast runoff from the paved area is defined by the parameters A f , K f and α.The choice of a reasonable prior range for A f was inspired by the calibrated physically distributed hydrodynamic model of the catchment.A f represents the impermeable runoff area from both combined and separated catchment areas (the latter in case of illicit connections) which was calibrated to 43 ha (Table 1).To be on the safe side the prior of A f was here allowed to range between 10 and 70 ha.To find a reasonable prior range for the fast runoff concentration time K f of the system the distributed model was again used.A rain event with a duration of 1 h and with a constant intensity small enough not to exceed the pipe system's flow capacity was imposed on the system at different places in the catchment area, one place at a time, and the resulting hydrographs inspected.On this basis the prior range of K f was set to 1-8 h.We expected rain gauge P 316 to contribute most to the runoff because it is closer to the paved combined sewer area than P 321 (see Fig. 1) but decided to test this assumption by allowing α to range between zero and one.The slow runoff contribution (infiltration inflow) is defined by the parameters A s , K s and α.By inspection of the observed hydrographs following rain events we decided a range for the slow runoff concentration time K s of 8-80 h (0.33-3.33 days), that   is, K s was differentiated from K f .The area effectively contributing to infiltration inflow, A s , was allowed to vary between 0 and 80 ha because a considerable amount of unintended water was believed to infiltrate the system.A lower limit of zero was chosen to allow for investigation of possible interactions between the runoff components of the model.A reasonable estimate of the average dry weather flow, a 0 , could be derived by inspection of flow measurements in dry weather periods (60-90 L s −1 ).The lack of physical interpretation of the other wastewater parameters s 1 , s 2 , c 1 , c 2 made it difficult to decide prior ranges and therefore a trial and error approach was conducted before the final ranges displayed in Table 4 were selected.

Performance measures
Ideally we would like to have narrow prediction limits with a high bracketing of observations.This indicates good model performance and provides confidence in the model when also applied to a validation set.To evaluate this we introduce some performance measures that have been applied in other GLUE studies (Jin et al., 2010;Li et al., 2010;Xiong et al., 2009).The containing ratio (CR) refers to the percentage of observations that fall inside the prediction limits and the average band width (ABW) is the average distance between the lower 5 % and upper 95 % prediction quantile: where N is the total number of time steps and y k u and y k l are, respectively, upper and lower prediction quantiles at any given time step, k.Finally the Average Relative Interval Length (ARIL) weights the band width with respect to the observed flow magnitude: Note that when we refer to CR in the discussion of results we mean containment within the 90 % prediction limits and when referring to ABW and ARIL these are likewise calculated from 90 % upper and lower prediction limits.
. The Ballerup catchment area.
. The Conceptual model.Fig. 6.CR (upper panels) and ARIL (lower panels) vs. the number of retained parameter sets in the calibration year (2007) and the two validation years (2008 and 2009) for the total 6 months period (left panels), the dry weather periods (middle panels) and the wet weather periods (right panels).

Likelihood measure vs. number of retained parameter sets
Out of 200 000 sampled parameter sets, 18 720 returned positive likelihood values (as defined in Eq. 2).It is noted that we decided to limit the number of behavioural parameter sets to 10 000 although more parameter sets could have been included.Overall peak likelihood was found to 0.2644.Figure 4 shows how the overall likelihood, L, the dry weather likelihood, L dw , and the wet weather likelihood, L ww , generally decreases with increasing number of retained parameter sets.Note how both L dw and L ww are varying up and down, in the range of 0.2-0.6 for L dw and 0.1-0.6 for L ww , as more parameter sets are included, and that the decrease in overall likelihood primarily can be attributed to a decrease in L ww .

Dotty plots, correlation structure and posterior parameter sets
Figure 5 shows Dotty plots of the wet weather parameters (upper part) and wastewater parameters (lower part), respectively.Dots are marked according to the number of parameter sets retained, but for clarity reasons we decided to limit the classification of the shown dots to dim{ B } = {500; 3000; 10 000}.Thus, the best 500 parameter sets (with the 500 highest likelihoods) have been coloured black, the best 501-3000 parameter sets dark-grey and the best 3001-10 000 parameter sets are light-grey.White areas reflect the parameter space where the likelihood measure is below that  Table 4. Choice of prior parameter ranges.  of the best 10 000 parameter sets.Histograms have been generated for each parameter and marked in accordance with the number of retained parameter sets.

Para-
The histograms for the dry weather model parameters (Fig. 5, bottom) are all quite peaky, showing well-defined posterior ranges and no parameter correlation.However, the histograms for the wet weather parameters (Fig. 5      Fig. 6.CR (upper panels) and ARIL (lower panels) vs. the number of retained parameter sets in the calibration year ( 2007) and the two validation years (2008 and 2009) for the total 6 months period (left panels), the dry weather periods (middle panels) and the wet weather periods (right panels).Fig. 6.CR (upper panels) and ARIL (lower panels) vs. the number of retained parameter sets in the calibration year ( 2007) and the two validation years (2008 and 2009) for the total 6 months period (left panels), the dry weather periods (middle panels) and the wet weather periods (right panels).year, 2008, similar overall ARIL values are obtained while consistently higher values are found for the validation year 2009 to all K values, indicating that something may have changed in the system.When considering dry weather periods only (Fig. 6, middle top) it was shown possible to cover the desired 90 % (91.2 % exactly) of the observations during the calibration period by retaining 10 000 behavioural parameter sets.Note how the difference between the dry weather CR curves in the validation years decrease as the number of parameter sets approaches 10 000.However, the dry weather CRs of the validation years are consistently lower reaching a maximum of 80 % with 10 000 parameter sets retained.This inconsistency is unexpected because changes in the dry weather flow level or flow pattern normally occur due to changes in population size or in water consumption pattern, which could not be confirmed for the studied period.Other explanations could be changes in measurement conditions like calibration of the flow meter, flow meter placement in the pipe, or infiltration inflow occurring at a timescale larger than that can be accounted for with this model.Does the observed inconsistency suggest an inability of the GLUE methodology to fully describe the uncertainty of the system?We will take a closer look into this by considering selected hydrographs in Sect.4.5 and conclude on this question in Sect.4.8.The maximum dry weather ARIL (Fig. 6, middle bottom) was found to 0.65 for the calibration year 2007.A similar pattern was found for the validation year 2008 but not for 2009, which had consistently higher ARIL values and lower coverage, similar to what was found for the overall ARIL.
When considering the wet weather periods only, CR is generally lower than for the dry weather periods and for the simulated periods as a whole (Fig. 6, right top).The CR curves flatten out already after 1-3000 retained parameter sets at a level of just above 50 % for the calibration year,  and 55 and 50 % for the validation years.This poor coverage may be caused by a misfit between the recorded rainfall and the measured runoff for heavy convective rain events with limited spatial extent, where the two rain gauges do not well represent the effective rainfall over the catchment due to their locations several kilometer away.Wider prior parameter ranges could perhaps have increased the coverage.Note also from this plot how the consistency between calibration and validation years increase as the number of retained parameter sets is increased.The ARIL (Fig. 6, right bottom) increases to almost 0.6 with 10 000 retained parameter sets in both calibration and validation periods, which is close to the value obtained overall and in dry weather periods alone.The wet weather ARIL values are quite similar between calibration and validation periods.

Dependency of flow magnitude
Figure 7 shows how the performance measures CR, ABW and ARIL change with the flow magnitude using prediction limits generated from 10 000 parameter sets.Generally, the ABW (middle panel) increases proportionally with the flow, but the ability of the prediction limits to bracket the observations decreases with the flow magnitude (left panel).In the calibration year the CR drops from 90 % in dry weather to just 30 % for flows above 500 L s −1 , supporting the suggestion above about the influence of heavy convective rain events, and although the ABW (middle panel) increases from approx 50 L s −1 in dry weather to 380 L s −1 for flows higher than 500 L s −1 this is not enough to encompass the desired percentage of observations.Again a wider prior parameter space could probably increase CR but a likelihood measure e 7. Correlation between parameters based on 10,000 retained parameter sets.that favours enclosure of the largest events would also increase CR at higher flow rates.Interestingly, the ARIL (right panel) is rather constant in the calibration year 2007, that is, the uncertainty of flow predictions with the model used here is almost proportional to the flow magnitude.The validation years show some deviations from the calibration year, which may be attributed to the small sample sizes used to compute the performance measures for especially the larger flow intervals, as well as differences in precipitation recorded at the two gauges and artifacts associated with individual rain events.Whereas Fig. 7 (middle) shows the average band width (ABW) for different flow intervals, calculated as an average for all rain events in each year, Fig. 8 illustrates for each year how the band width evolves from time step to time step.It is seen that the average values actually cover up some large fluctuations in modelled band width.The "traces" of connected data illustrate how the band width evolves during individual rain events, how the band width generally increases with flow magnitude (corresponding to what is seen in Fig. 7

Analysis of hydrographs from calibration and validation periods
Figure 9 shows the rainfall input (accumulated rainfall per event for each rain gauge), flow observations and generated 90 % prediction limits for the whole calibration period (top panel) and an enlargement of a period with the largest events recorded (bottom panel).
The dry weather observations (flows of less than 150 L s −1 ) are generally well covered by the prediction limits, which was also concluded from the performance measures (Figs. 6 and 7), but they seem to be close to the upper prediction limit in April and to the lower prediction limit in October, indicating that the mean dry weather flow declines gradually during the period.Accounting for this trend in the dry weather model could perhaps have resulted in smaller ABW and higher CR for dry weather periods.
The wet weather flows (flows higher than 150 L s −1 ) are well covered for some events, for example, the events shown in the first half of the lower panel of Fig. 9 where 35-40 mm rainfall was recorded, but for the remaining events shown the observed peaks are higher than the upper prediction limit, the hydrograph tails are longer than the model suggests and the flow furthermore fluctuates in a way that cannot be described with the model used.The fast time constant K f as well as the impermeable area A f (or perhaps also A s and K s ) needs to be much larger for the prediction intervals to cover the last event shown (lower panel).This event as well as the other events shown explains why neither the Dotty plots nor the histograms in Fig. 5 (top) were able to clearly identify a higher likelihood area for these parameters.There is also the possibility of backwater effects in the system which are not dealt with in the model and this could perhaps explain the long tail of the last flow hydrograph seen in Fig. 9 (lower panel), but it cannot be excluded that the flow measurements are erroneous, or that the measured rainfall is nonrepresentative (the two gauges measured about 50 and 65 mm rainfall, i.e. a convective rainfall pattern with large spatial variation is likely).
Figure 10 shows the rainfall input, flow observations and generated 90 % prediction limits for selected periods in the validation year 2008, where rain gauge P 316 was malfunctioning for a longer period.The smallest ABW and ARIL for 2008 (Fig. 7, middle and right) occurs for the highest observed flow category (> 500 L s −1 ), which is due to the high flow observations on 11th July where only 5 mm rainfall was recorded at the two gauges (Fig. 10, left), which is also visible as the isolated, flat "trace" on Fig. 8 (middle).In this case a large convective rainfall event with limited spatial extent may have passed over the catchment without significantly affecting the rain recordings, or the flow observations are erroneous.Figure 10 (right) shows several significant flow events in August 2008 where gauge P 316 did not record any rainfall at all, probably due to technical malfunctioning, and this causes the flow predictions to be underestimated (the flow observations are consistently close to, or above the upper prediction limit for all the illustrated rain events).
Figure 11 shows the rainfall input, flow observations and generated 90 % prediction limits in the second validation year 2009 for a selected period where both the dry and wet weather flows were well covered by the prediction limits (left) and for a period where the largest event in 2009 occurred (right).In this latter case the gauges recorded very different rainfall amounts (50 and 100 mm), and the model underestimated the peak, the timing and the tailing of the observed hydrograph, which explains the S-shaped "trace" visible in Fig. 8 (right).Note also from the left figure that the flow observations in dry weather are very low and close to the lower bound which is general for 2009.The lower dry weather flow in 2009 explains the higher ARIL values obtained in dry weather periods of 2009 that were observed in Fig. 6.

Interpretation of posterior parameter ranges
In Sect.4.2 we saw that posterior ranges approached the priors for many of the wet weather parameters retaining just 500 parameter sets.With the large uncertainties that originate from inadequate rain inputs (spatial heterogeneity not represented by two rain gauges), as well as flow measurement errors and possible model structure inadequacies discussed above, it is hardly surprising that posterior parameter ranges become so wide and dotty plots look so scattered.It is important to recognize that each parameter set carries along with it an implicit (non-stationary) error series and that models (parameter sets) that underpredict in calibration are expected to underpredict in similar circumstances in prediction etc.The uncertainties are not being transferred to the model parameters, but the error series are (implicitly) weighted along with the simulated outputs from each model.The posterior parameters lack physical interpretation because of parameter compensation and thus cannot be used, for example for inference about the relative size of infiltration area versus size of paved area, which otherwise would be desired knowledge.Such parameter compensations will be apparent in any calibration exercise unless prior knowledge about what is acceptable or not acceptable for parameters and their interactions can be specified.

Consistency in model-failure?
Due to the rather consistent coverage of the observations between different periods for different behavioural thresholds it might be worth investigating if some form of non-stationary error correction can improve the results.This could be done for example by applying a transform or bias correction to the results.One way to check the model for non-stationary errors is to subdivide the whole range of observed runoff into an appropriate number of smaller intervals (windows), for example, 0-50 L s −1 , 50-75 L s −1 , etc., and then calculate the residuals between simulated and observed runoffs in each window.Figure 12 shows such a residual histogram with 10 bins.Clearly the residuals are not randomly distributed around zero for any of the windows.In each window the need for bias correction is different which is understood from the appearance of the histograms and the mean of the residuals µ.A negative µ means that the model underestimates the flows and vice versa.For dry weather flows, that is, flows less than 150 L s −1 , the model performs reasonably well, but for larger flows the model tends to underestimate the runoff.This suggest that some form of non-stationary error correction might be possible, for example, as outlined in Xiong and O'Connor (2008).Such a bias correction will probably improve the calibration results but generally there are more bins with positive outliers than negative which indicates that a bias correction is not trivial.A bias correction implementation is however beyond the scope of this paper.

Epistemic uncertainties
The evidence from the changing nature of the errors in this study between and within periods (epistemic uncertainties) suggests that it is very difficult to test GLUEs ability to provide uncertainty bounds that bracket observations in both calibration and validation, which would also be the case if a formal likelihood approach had been applied.The experiences from this investigation suggest that calibration of much more complex models (physically distributed, hydrodynamic) used in practical urban drainage engineering in catchments with insufficient rain gauge coverage to questionable flow measurements from shorter measuring campaigns is problematic not least because a calibrated model normally implies a reduction in the safety factor used in modelling of urban drainage systems in Denmark (Hansen et al., 2005).

Conclusions
In this study a simple conceptual hydrological model has been applied to simulate flow in a sewer system, that receives water from both combined and separated catchments.The GLUE methodology was applied to assess the uncertainty on flow simulation and parameter estimation.To be able to derive the behavioural parameters, a combined like-lihood measure was formulated.For the dry weather flow periods the Nash-Sutcliffe model efficiency coefficient was used, whereas an exponential likelihood measure, that has the property of fitting the peaks better, was used for the wet weather periods.Instead of preselecting the number of behavioural parameter sets, it was decided to retain an increasing proportion of parameter sets (100; 500; 1000; 3000; 6000; 10 000), ideally until the GLUE generated 90 % prediction limits encompassed 90 % of the observations.However, as the overall CR curve was shown to be flattening out at 10 000 retained parameter sets, this number was decided a sufficient maximum number to include.The overall CR increased from approx 58 % to 80 %, as the proportion of behavioural parameter sets included increased from 100 to 10 000 and hence it was not possible to obtain the desired coverage.Considering dry weather periods separately, the prediction limits generated from 10,000 parameter sets enclosed a little more than 90%, while in wet weather periods on average only around 55 % was enclosed.Furthermore, the proportion of observations enclosed decreased with increasing flow magnitude, despite that the prediction limits expanded proportionally with the flow.Two subsequent half-year summer periods were included for validation to check the consistency of the GLUE generated prediction limits.It was concluded that overall the obtained CRs in the validation periods were similar to that obtained in calibration for all the considered retained proportions of parameter sets, and thus good consistency was found.However, when looking separately at dry weather and wet weather periods, as well as at different flow levels, several inconsistencies were observed between calibration and validation periods.These inconsistencies could in dry weather presumably be attributed to changes in measurement conditions, and in wet weather attributed to inadequate rain input coverage, unreliable flow meter measurements, and/or model deficiencies (e.g.backwater effects not accounted for), etc. Retaining just 500 parameter sets meant that the wet weather posterior parameter ranges approached those of the priors, which is a clear sign of equifinality.Hence the obtained posterior parameter ranges cannot be used for interpretation of, for example, the size of contributing paved area vs. size of slow infiltration-inflow area.The posterior wastewater parameter limits were generally more well determined.
The observed inconsistencies between calibration and validation periods indicated by CR and ARIL would most likely also have been observed in the case a formal approach had been chosen, simply because events such as a sudden lower dry weather flow or malfunctioning rain gauges in a validation period are unexpected events (epistemic events) and cannot be predicted from a set of calibration data.Hence we cannot reject the GLUE methodology as a tool for uncertainty analysis on the basis of this study.The evidence from the changing nature of the errors in this study between and within periods suggests that it might be very difficult to find a valid error model for use in a formal likelihood approach, and that we should therefore try to learn from the significant discrepancies between model and observations.One way to do so could be to use some non-stationary error correction procedure to improve the predictive capability.This could even be applied to areas of wet and dry periods.So if the model was consistently under predicting for "types" of periods and events then this could be accounted for and then reasons why this type of correction improves predictions could be analysed and discussed.This was however beyond the scope of this paper.
In practical urban drainage engineering applications, it is not uncommon that large hydrodynamic models with many more parameters are calibrated to flow data, collected from measuring campaigns of shorter duration than used here, with equally poor rain input representation.Bearing in mind that these models are indispensable tools in redesign and upgrade proposals, and sometimes used for flow forecasting, it seems crucial from this study to (1) obtain more representative rain inputs (perhaps by radars), (2) use more reliable flow meters and (3) replace measuring campaigns with online monitoring to secure a higher coherence between model simulations and observations.

Fig. 4 .
Fig. 4. Likelihood vs. number of retained parameter sets.Shown for overall likelihood (L), dry weather likelihood (L dw ) and wet weather likelihood (L ww ).

Fig. 4 .
Fig. 4. Likelihood vs. number of retained parameter sets.Shown for overall likelihood (L), dry weather likelihood (L dw ) and wet weather likelihood (L ww ).

Fig. 4 .
Fig. 4. Likelihood vs. number of retained parameter sets.Shown for overall likelihood (L), dry weather likelihood (L dw ) and wet weather likelihood (L ww ).

Fig. 4 .
Fig. 4. Likelihood vs. number of retained parameter sets.Shown for overall likelihood (L), dry weather likelihood (L dw ) and wet weather likelihood (L ww ).
, top) are all more flat and the dotty plots are more scattered, showing less well-defined posterior ranges indicating these parameters are either insensitive to model performance or mutually correlated, or that the prior parameter ranges have been chosen too narrow.The latter is what we observe for K s , where the prior Hydrol.Earth Syst.Sci., 17, 4159-4176, 2013 www.hydrol-earth-syst-sci.net/17/4159/2013/

Fig. 4 .
Fig. 4. Likelihood vs. number of retained parameter sets.Shown for overall likelihood (L), dry weather likelihood (L dw ) and wet weather likelihood (L ww ).

Fig. 8 .
Fig. 8. Variation in band width versus observed flow magnitude.Band width is calculated from 10 000 behavioural parameter sets.
.: Informal uncertainty analysis -consistency of containment ratios in calibration and validation? 9. Rainfall input, flow observations and 90% flow prediction limits generated from 10,000 parameter sets.Whole calibration pe and enlargement for a period with wet weather flow conditions (bottom).Periods without flow observations were discarded from sis.

Fig. 9 .
Fig. 9. Rainfall input, flow observations and 90 % flow prediction limits generated from 10 000 parameter sets.Whole calibration period (top) and enlargement for a period with wet weather flow conditions (bottom).Periods without flow observations were discarded from the analysis.
, middle), and how less precipitation in 2008 and 2009 than in 2007 lead to smaller flows and band widths.

Fig. 10 .Fig. 11 .
Fig. 10.Rainfall input, flow observations and 90% flow prediction limits generated from 10,000 parameter sets for selected periods in the validation year 2008.

Fig. 12 .
Fig. 12. Histogram and mean of residuals µ for each window.The observed runoffs were subdivided into 10 intervals (windows) and the residuals calculated for all 10.000 simulations.

Fig. 12 .
Fig. 12.Histogram and mean of residuals µ for each window.The observed runoffs were subdivided into 10 intervals (windows) and the residuals calculated for all 10 000 simulations.

Table 4 .
Choice of prior parameter ranges.

Table 7 .
Correlation between parameters based on 10 000 retained parameter sets.