Effects of hydrologic conditions on SWAT model performance and 1 parameter sensitivity for a small , mixed land use catchment in New Zealand 2 3

13 The Soil Water Assessment Tool (SWAT) was configured for the Puarenga 14 Stream catchment (77 km), Rotorua, New Zealand. The catchment land use is 15 mostly plantation forest, some of which is spray–irrigated with treated wastewater. 16 A Sequential Uncertainty Fitting (SUFI–2) procedure was used to auto–calibrate 17 unknown parameter values in the SWAT model. Model validation was performed 18 using two datasets: 1) monthly instantaneous measurements of suspended 19 sediment (SS), total phosphorus (TP) and total nitrogen (TN) concentrations; and 20 2) high–frequency (1–2 h) data measured during rainfall events. Monthly 21 instantaneous TP and TN concentrations were generally not reproduced well (24% 22 bias for TP, 27% bias for TN, and R < 0.1, NSE < 0 for both TP and TN), in 23 contrast to SS concentrations (< 1% bias; R and NSE both > 0.75) during model 24 validation. Comparison of simulated daily mean SS, TP and TN concentrations 25 with daily mean discharge–weighted high–frequency measurements during storm 26 events indicated that model predictions during the high rainfall period 27 considerably underestimated concentrations of SS (44% bias) and TP (70% bias), 28 while TN concentrations were comparable (< 1% bias; R and NSE both ~0.5). 29 This comparison highlighted the potential for model error associated with quick– 30


Introduction
Catchment models are valuable tools for understanding natural processes occurring at basin scales and for simulating the effects of different management regimes on soil and water resources (e.g.Cao et al., 2006).Model applications may have uncertainties as a result of errors associated with the forcing variables, measurements used for calibration, and conceptualisation of the model itself (Lindenschmidt et al., 2007).The ability of catchment models to simulate hydrological processes and pollutant loads can be assessed through analysis of uncertainty or errors during a calibration process that Published by Copernicus Publications on behalf of the European Geosciences Union.
is specific to the application domain (White and Chaubey, 2005).
The Soil and Water Assessment Tool (SWAT) model is increasingly used to predict discharge, sediment and nutrient loads on a temporally resolved basis and to quantify material fluxes from a catchment to the downstream receiving environment such as a lake (e.g.Nielsen et al., 2013).The SWAT model is physically based and provides distributed descriptions of hydrologic processes at sub-basin scale (Arnold et al., 1998;Neitsch et al., 2011).It has numerous parameters, some of which can be fixed on the basis of pre-existing catchment data (e.g.soil maps) or knowledge gained in other studies.However, values for other parameters need to be assigned during a calibration process as a result of complex spatial and temporal variations that are not readily captured either through measurements or within the model algorithms themselves (Boyle et al., 2000).Such parameter values assigned during calibration are therefore lumped, i.e. they integrate variations in space and/or time and thus provide an approximation for real values which often vary widely within a study catchment.Model calibration is an iterative process whereby parameters are adjusted to the system of interest by refining model predictions to fit closely with observations under a given set of conditions (Moriasi et al., 2007).Manual calibration depends on the system used for model application, the experience of the modellers, and knowledge of the model algorithms.It tends to be subjective and time-consuming.By contrast, auto-calibration provides a less labour-intensive approach by using optimisation algorithms (Eckhardt and Arnold, 2001).The Sequential Uncertainty Fitting (SUFI-2) procedure has previously been applied to auto-calibrate discharge parameters in a SWAT application for the Thur River, Switzerland (Abbaspour et al., 2007), as well as for groundwater recharge, evapotranspiration and soil storage water considerations in western Africa (Schuol et al., 2008).Model validation is subsequently performed using measured data that are independent of those used for calibration (Moriasi et al., 2007).
Values for hydrological parameter values in the SWAT model can vary temporally.Cibin et al. (2010) found that the optimum calibrated values for hydrological parameters varied with different flow regimes (low, medium and high), thus suggesting that SWAT model performance can be optimised by assigning parameter values based on hydrological characteristics.Other work has similarly demonstrated benefits from assigning separate parameter values to low, medium, and high discharge periods (Yilmaz et al., 2008), or based on whether a catchment is in a dry, drying, wet or wetting state (Choi and Beven, 2007).Such temporal dependence of model parameterisation on hydrologic conditions has implications for model performance.Krause et al. (2005) compared different statistical metrics of hydrological model performance separately for base flow periods and storm events to evaluate the performance.The authors found that the logarithmic form of the Nash-Sutcliffe effi-ciency (NSE) value provided more information on the sensitivity of model performance for discharge simulations during storm events, while the relative form of NSE was better for base flow periods.Similarly, Guse et al. (2014) investigated temporal dynamics of sensitivity of hydrological parameters and SWAT model performance using a Fourier amplitude sensitivity test (Reusser et al., 2011) and cluster analysis (Reusser et al., 2009).The authors found that three groundwater parameters were highly sensitive during quick flow, while one evaporation parameter was most sensitive during base flow, and model performance was also found to vary significantly for the two flow regimes.Zhang et al. (2011) calibrated SWAT hydrological parameters for periods separated on the basis of six climatic indexes.Model performance improved when different values were assigned to parameters based on six hydroclimatic periods.Similarly, Pfannerstill et al. (2014) found that assessment of model performance was improved by considering an additional performance statistic for very low flow simulations amongst five hydrologically separated regimes.
To date, analysis of temporal dynamics of SWAT parameters has predominantly focussed on simulations of discharge rather than water quality constituents.This partly reflects the paucity of comprehensive water quality data for many catchments; near-continuous discharge data can readily be collected but this is not the case for water quality parameters such as suspended sediment or nutrient concentrations.Data collected in monitoring programmes that involve sampling at regular time intervals (e.g.monthly) are often used to calibrate water quality models, but these are unlikely to fully represent the range of hydrologic conditions in a catchment (Bieroza et al., 2014).In particular, water quality data collected during storm flow periods are rarely available for SWAT calibration, thus prohibiting opportunities to investigate how parameter sensitivity varies under conditions which can contribute disproportionately to nutrient or sediment transport, particularly in lower-order catchments (Chiwa et al., 2010;Abell et al., 2013).Failure to fully consider storm flow processes could therefore result in overestimation of model performance.Thus, further research is required to examine how water quality parameters vary during different flow regimes and to understand how model uncertainty may vary under future climatic conditions that affect discharge regimes (Brigode et al., 2013).
In this study, the SWAT model was configured to a relatively small, mixed land use catchment in New Zealand that has been the subject of an intensive water quality sampling programme designed to target a wide range of hydrologic conditions.A catchment-wide set of parameters was calibrated using the SUFI-2 procedure which is integrated into the SWAT Calibration and Uncertainty Program (SWAT-CUP).The objectives of this study were to (1) quantify the performance of the model in simulating discharge and fluxes of suspended sediments and nutrients at the catchment outlet, (2) rigorously evaluate model performance by comparing  3) used to calibrate the SWAT model were from the Forest Research Institute (FRI) stream gauge and were considered representative of the downstream/outlet conditions of the Puarenga Stream.
daily simulation output with monitoring data collected under a range of hydrologic conditions, and (3) quantify whether parameter sensitivity varies between base flow and quick flow conditions.

Study area
The Puarenga Stream is the second-largest surface inflow to Lake Rotorua (Bay of Plenty, New Zealand) and drains a catchment of 77 km 2 .The catchment is situated in the central North Island of New Zealand, which has a warm temperate climate.Annual mean temperature at Rotorua Airport (Fig. 1a) is 15 ± 4 • C and annual mean evapotranspiration is 714 mm yr −1 (1993-2012; National Climatic Data Centre; available at http://cliflo.niwa.co.nz/).Annual mean precipitation at the Kaituna rain gauge (Fig. 1a) is 1500 mm yr −1 (1993-2012; Bay of Plenty Regional Council).The catchment is relatively steep (mean slope = 9 %; Bay of Plenty Regional Council) with predominantly pumice soils that have high macroporosity, resulting in high infiltration rates and substantial subsurface lateral flow contributions to stream channels.Two cold-water springs (Waipa Spring and Hemo Spring) and one geothermal spring (Fig. 1b) are located in the catchment area.Two cold-water springs have annual mean discharge of ∼ 0.19 m 3 s −1 (Rotorua District Coun-cil) and one geothermal spring has annual mean discharge of ∼ 0.12 m 3 s −1 (White et al., 2004).
The predominant land use (47 %) is exotic forest (Pinus radiata).Approximately 26 % is managed pastoral farmland, 11 % mixed scrub and 9 % indigenous forest.Since 1991, treated wastewater has been pumped from the Rotorua Wastewater Treatment Plant and spray-irrigated over 16 blocks of total area of 1.93 km 2 in the Whakarewarewa Forest (Fig. 1a).Following this, it took approximately 4 years before elevated nitrate concentrations were measured in the receiving waters of the Puarenga Stream (Lowe et al., 2007).Prior to 2002, the irrigation schedule entailed applying wastewater to two blocks per day so that each block was irrigated approximately weekly.Since 2002, 10-14 blocks have been irrigated simultaneously at daily frequency.Over the entire period of irrigation, nutrient concentrations in the irrigated water have gradually decreased as improvements in treatment of the wastewater have been made (Lowe et al., 2007).
Measurements from the Forest Research Institute (FRI) stream gauge (1.7 km upstream of Lake Rotorua; Fig. 1b) were considered representative of the downstream/outlet conditions of the Puarenga Stream.The FRI stream gauge was closed in mid 1997, then reopened late in 2004 (Environment Bay of Plenty, 2007).Annual mean discharge at this site is 2.0 m 3 s −1 (1994-1997and 2004-2008;Bay of Plenty Regional Council).The Puarenga Stream receives a high proportion of flow from groundwater stores and has only moderate seasonality in discharge.On average, the lowest mean daily discharge is during summer (December-February; 1.7 m 3 s −1 ) and the highest mean daily discharge is during winter (June-August; 2.4 m 3 s −1 ).Discharge records during 1998-2004 were intermittent and this precluded a detailed comparison of measured and simulated discharge during that period.In July 2010, the gauge was repositioned 720 m downstream to the State Highway 30 (SH 30) bridge (Fig. 1b).

Model configuration
SWAT input data requirements included a digital elevation model (DEM), meteorological records, records of springs and water abstraction, soil characteristics, land use classification, and management schedules for key land uses (pastoral farming, wastewater irrigation, and timber harvesting).The SWAT model (version SWAT2009_rev488) was run on an hourly time step, but daily mean simulation outputs were used for this study.
The DEM was used to delineate boundaries of the whole catchment and individual sub-catchments, with a stream map used to "burn-in" channel locations to create accurate flow routings.Hourly rainfall estimates were used as hydrologic forcing data.The Penman-Monteith method (Monteith, 1965) was used to calculate evapotranspiration (ET) and potential ET.The Green and Ampt (1911) method was used to calculate infiltration, rather than the SCS (Soil Conservation Service) curve number method.Therefore, the hourly rainfall/Green and Ampt infiltration/hourly routing method (Neitsch et al., 2011) was chosen to simulate upland and in-stream processes.Ten sub-catchments were represented in the Puarenga Stream catchment, each comprising numerous hydrologic response units (HRUs).Each HRU aggregates cells with the same combination of land cover, soil, and slope.A total of 404 HRUs was defined in the model.Runoff and nutrient transport were predicted separately within SWAT for each HRU, with predictions summed to obtain the total for each sub-catchment.
Descriptions and sources of the data used to configure the SWAT model are given in Table 1.There were a total of 197 model parameters.Values of SWAT parameters were assigned based on (i) measured data (e.g.some of the soil parameters; Table 1), (ii) literature values from published studies of similar catchments (e.g.parameters for dominant land uses; Table 2), or (iii) by calibration where parameters were not otherwise prescribed.
SWAT simulates loads of "mineral phosphorus" (MINP) and "organic phosphorus" (ORGP) of which the sum is total phosphorus (TP).The MINP fraction represents soluble P either in mineral or in organic form, while ORGP refers to particulate P bound either by algae or by sediment (White et al., 2014).Soluble P may be taken up during algae growth, or released from benthic sediment.This fraction can be transformed to particulate P contained in algae or sediment.
SWAT simulates loads of nitrate-nitrogen (NO 3 -N), ammonium-nitrogen (NH 4 -N) and organic nitrogen (ORGN), the sum of which is total nitrogen (TN).Nitrogen parameters were auto-calibrated for each N fraction.The SWAT model does not account for the initial nitrate concentration in shallow aquifers, as also noted by Conan et al. (2003).Ekanayake and Davie (2005) indicated that SWAT underestimated N loading from groundwater and suggested a modification by adding a background concentration of nitrate in streamflow to represent groundwater nitrate contributions.Over the period of the first 5 years of wastewater irrigation, nitrate concentrations in shallow groundwater draining the Waipa Stream sub-catchment were estimated to have increased by ∼ 0.44 mg N L −1 (Paku, 2001).SWAT has no capability to dynamically adjust the groundwater concentration during a simulation run.Therefore, we added 0.44 mg N L −1 to all model simulations of TN concentration assuming that groundwater concentrations had equilibrated with the applied wastewater nitrogen.

Model calibration and validation
Daily mean discharge was firstly calibrated based on daily mean values of 15 min measurements (Table 3).Water quality variables were then calibrated in the sequence: SS, TP and TN.Modelled mean daily concentrations were compared with concentrations measured during monthly grab sampling, with monthly measurements assumed equal to daily mean concentrations (Table 3).One year (1993) was used for model warmup.The calibration period was from 2004 to 2008 and the validation period was from 1994 to 1997.
A validation period that pre-dated the calibration period was chosen because discharge records were available for two separate periods (1994-1997 and post 2004).In addition, the operational regime for the wastewater irrigation has varied since operations began in 1991, with a marked change occurring in 2002 when operations switched from applying the wastewater load to 2 blocks (rotated daily for a total of 14 blocks in a week; i.e. each block irrigated weekly), to 10-14 blocks each irrigated daily.This operational regime continues today and we therefore decided to assign the most recent (post-2002) period (2004-2008) to calibration to ensure that the model was configured to reflect current operations.
Parameter values that were not derived from measurements or the literature were assigned based on either automated or manual calibration (Table 4).Manual calibration was undertaken for 11 parameters related to TP, while a Sequential Uncertainty Fitting (SUFI-2) procedure was applied to auto-calibrate 21 parameters for discharge simulations, 9 parameters for SS simulations, and 17 parameters related to TN.The SUFI-2 procedure has been integrated into the SWAT Calibration and Uncertainty Program (SWAT-CUP).SUFI-2 is a procedure that efficiently quantifies and con- strains parameter uncertainties/ranges from default ranges with the fewest number of iterations (Abbaspour et al., 2004), and has been shown to provide optimal results relative to the use of alternative algorithms (Wu and Chen, 2015).SUFI-2 involves Latin hypercube sampling (LHS), which is a method that generates a sample of plausible parameter values from a multidimensional distribution and ensures that samples cover the entire parameter space, therefore ensuring that the optimum solution is not a local minimum (Marino et al., 2008).The SUFI-2 procedure analyses relative sensitivities of parameters by randomly generating combinations of values for model parameters (Abbaspour, 2015).A sample size of 1000 was chosen for each iteration of LHS, resulting in 1000 combinations of parameters and 1000 simulations.Model performance was quantified for each simulation based on the Nash-Sutcliffe efficiency (NSE).An objective function was defined as a linear regression of a combination of parameter values generated by each LHS against the NSE value calculated from each simulation.Each compartment was not given weight to formulate the objective function because only one variable was specifically focused on at each time.A parameter sensitivity matrix was then computed based on the changes in the objective function after 1000 simulations.Parameter sensitivity was quantified based on the p value from  (1994-1997; 2004-2008).(1994)(1995)(1996)(1997) Stream water quality Calibration Monthly grab samples for determination of suspended BoPRC; Abell et al. ( 2013) measurements (2004)(2005)(2006)(2007)(2008) sediment (SS), total phosphorus (TP) and total nitrogen (TN) Validation * concentrations (1994-1997; 2004-2008), and high-(1994-1997; frequency event-based samples for concentrations of SS 2010-2012) (9 events), TP and TN (both 14 events) at 1-2 h frequency (2010-2012), were also measured at FRI stream gauge (Fig. 1b) within the catchment.* Model validation was undertaken using two different data sets.The monthly measurements (1994)(1995)(1996)(1997) were predominantly collected when base flow was the dominant contributor to stream discharge.Data from high-frequency sampling during rain events (2010)(2011)(2012) were also used to validate model performance during periods when quick flow was high.a Student t test, which was used to compare the mean of simulated values with the mean value of measurements (Rice, 2006).A parameter was deemed sensitive if p ≤ 0.05 after 1000 simulations (one iteration).Numerous iterations of LHS were conducted.Values of p from numerous iterations were averaged for each parameter, and the frequency of iterations where a parameter was deemed sensitive was summed.Rankings of relative sensitivities of parameters were developed based on how frequently the sensitive parameter was identified and the averaged value of p calculated from several iterations.The most sensitive parameter was determined based on the frequency that the parameter was deemed sensitive and the smallest average p value from all iterations.SUFI-2 considers two criteria to constrain uncertainty in each iteration.One is the P factor, the percentage of measured data bracketed by 95 % prediction uncertainty (95PPU).Another is the R factor, the average thickness of the 95PPU band divided by the standard deviation of measured data.A range was first defined for each parameter based on a synthesis of ranges from similar studies or from the SWAT default range.Parameter ranges were updated after each iteration based on the computation of upper and lower 95 % confidence limits.The 95 % confidence interval and the standard deviation of a parameter value were derived from the diagonal elements of the covariance matrix, which was calculated from the sensitivity matrix and the variance of the objective function.Steps and equations used in the SUFI-2 procedure to constrain parameter ranges are outlined by Abbaspour et al. (2004).The total number of iterations performed for each simulated variable (Q, SS, MINP, ORGN, NH 4 -N and NO 3 -N) reflected the numbers required to ensure that > 90 % of measured data were bracketed by simulated output and the R factor was close to one.The "optimal" parameter value was obtained when the NSE criterion was satisfied (NSE > 0.5; Moriasi et al., 2007).Auto-calibrated parameters for simulations of Q, SS, and TN were changed by absolute values within the given ranges.Some of those given ranges were restricted based on the optimum values calibrated in similar studies.Parameter values for TP simulations were manually calibrated based on the relative percent deviation from the predetermined values of those auto-calibrated parameters for MINP simulations, given by the objective functions (e.g.NSE).Parameters related to the physical characteristics of the catchment were not changed because their values were considered to be representative of the catchment characteristics.In addition, high-frequency (1-2 h) water quality sampling was undertaken at the FRI stream gauge during 2010-2012 (Table 3) to derive estimates of daily mean contaminant loads during storm events.Samples were analysed for SS (9 events), TP and TN (both 14 events) over sampling periods of 24-73 h.The sampling programme was designed to encompass pre-event base flow, storm-generated quick flow and post-event base flow (Abell et al., 2013).These data permitted calculation of daily discharge-weighted (Q-weighted) mean concentrations to compare with modelled daily mean estimates.We did not use the high-frequency observations to calibrate the model, because of the limited number of high-frequency (1-2 h) samples (9 events for SS and 14 events for TP and TN in 2010-2012).The use of the high-frequency observations for model validation allowed examining how the model performed during short (1-3 day) high flow periods.The Q-weighted mean concentrations C QWM were calculated as where n is number of samples, C i is contaminant concentration measured at time i, and Q i is discharge measured at time i.

Hydrograph and contaminant load separation
The Web-based Hydrograph Analysis Tool (Lim et al., 2005) was applied to partition both measured and simulated discharges into base flow (Q b ) and quick flow (Q q ).An Eckhardt filter parameter of 0.98 and ratio of base flow to total discharge of 0.8 were assumed (cf.Lim et al., 2005).There was a total of 60 days without quick flow during the calibration period (2004)(2005)(2006)(2007)(2008)  the sensitivity of water quality parameters during base flow and quick flow: C b for each contaminant was estimated as the average concentration for the 60 days with no quick flow.C q for each contaminant was calculated by rearranging Eq. (2).
To ensure that C q is positive, C b is constrained to be the minimum of C sep and C sep .Measured and simulated base flow and quick flow contaminant loads were then calculated.
A one-at-a-time (OAT) routine proposed by Morris (1991) was applied to investigate how parameter sensitivity varied between the two flow regimes (base flow and quick flow), based on the ranking of relative sensitivities of parameters that were identified by randomly generating combinations of values for model parameters for each individual variable using the SUFI-2 procedure.OAT sensitivity analysis was then employed by varying the parameter of interest among 10 equidistant values within the default range.The natural logarithm was used by Krause et al. (2005) and therefore the standard deviation (SD) of the ln-transformed NSE was used to indicate parameter sensitivity for the two flow regimes.
Parameters were ranked from most to least sensitive on the basis of the sensitivity metric (SD of ln-transformed NSE), using a value of 0.2 as a threshold above which parameters were deemed particularly "sensitive".The threshold value of 0.2 was chosen in this study, based on the median value derived from the calculations of the SD of ln-transformed NSE.Methods used to separate the two flow constituents and to quantify parameter sensitivity are illustrated in Fig. 2.

Model evaluation
Model goodness of fit was assessed graphically and quantified using coefficient of determination (R 2 ), NSE and percent bias (PBIAS; Table 5).R 2 (range from 0 to 1) and NSE (range from −∞ to 1) values are commonly used to evaluate SWAT model performance (Gassman et al., 2007).The PBIAS value indicates the average tendency of simulated outputs to be larger or smaller than observations (Gupta et al., 1999).
Model uncertainty was evaluated by two criteria: R factor and P factor (see Sect. 2.3).These were used to constrain parameter ranges during the calibration using measured Q and loads of SS, MINP, ORGN, NH 4 -N and NO 3 -N in the SUFI-2 procedure.The R software (R Development Core Team) was used to graphically show the 95 % confidence and prediction intervals for measurement data (Neyman, 1937) and model prediction intervals (Geisser, 1993) for Q and concentrations of SS, TP and TN during the calibration period (2004)(2005)(2006)(2007)(2008).

Model performance and uncertainty
Numerous rounds (each comprising 1000 iterations) of LHS were conducted for each simulated variable until the perfor-Table 5. Criteria for model performance.Note: o n is the nth-observed datum, s n is the nth-simulated datum, o is the observed mean value, s is the simulated daily mean value, and N is the total number of observed data.Performance rating criteria are based on Moriasi et al. (2007) for Q: discharge, SS: suspended sediment, TP: total phosphorus and TN: total nitrogen.Moriasi et al. (2007) derived these criteria based on extensive literature review and analysing the reported performance ratings for recommended model evaluation statistics.

Statistic equation Constituent Performance ratings
Unsatisfactory Satisfactory Good Very good (3) All < 0.5 0.5-0.6 0.6-0.7 0.7-1 mance criteria were satisfied.The total number of rounds of LHS for each simulated variable was as follows (number in parentheses): Q (7), SS (7), MINP (11), ORGN (10), NH 4 -N (4) and NO 3 -N (4).The parameters that provided the best statistical outcomes (i.e.best match to observed data) are given in Table 4. Two criteria (R factor and P factor) were used to show model uncertainties for simulations of discharge and contaminant loads, with values as follows: Q (0.97, 0.43), SS (0.48, 0.19), MINP (2.64, 0.14), ORGN (0.47, 0.17), NH 4 -N (1.16, 0.56) and NO 3 -N (1.2, 0.29).Model uncertainties for simulations of Q and SS, TP and TN concentrations are shown in Fig. 6.Modelled and measured base flow showed high correspondence, although measured daily mean discharge during storm peaks was often underestimated (Fig. 3a, e).Annual mean percentages of lateral flow recharge, shallow aquifer recharge and deep aquifer recharge to total water yield were predicted by SWAT as 30, 10 and 58 %, respectively.Modelled SS concentrations overestimated measurements of monthly grab samples by an average of 18.3 % during calibration and 0.32 % during validation (Fig. 3b, f).Measured TP concentrations in monthly grab samples were underestimated by 23.8 % during calibration (Fig. 3c) and 24.5 % during validation (Fig. 3g).Similarly, measured TP loads were underestimated by 34.5 and 38.4 % during calibration and validation, respectively.Modelled and measured TN concentrations were generally better aligned during base flow (Fig. 3d), apart from a mismatch prior to 1996 when monthly measured TN concentrations were substantially lower than model predictions, although the concentrations gradually increased (Fig. 3h) during the validation period (1994)(1995)(1996)(1997).The average measured TN load increased from 134 kg N day −1 prior to 1996 to 190 kg N day −1 post-1996, and the comparable increase in modelled TN load was from 167 to 205 kg N day −1 , respectively.
Statistical evaluations of goodness of fit are shown in Table 6.The R 2 values for discharge were 0.77 for calibration and 0.68 for validation, corresponding to model performance ratings (cf.Moriasi et al., 2007) of "very good" and "good" (Table 5).Similarly, the NSE values for discharge were 0.73 (good) for calibration and 0.62 (satisfactory) for validation.Positive PBIAS (7.8 % for calibration and 8.8 % for validation) indicated a tendency for underestimation of daily mean discharge; however, the low magnitude of PBIAS values corresponded to a performance rating of "very good".The R 2 values for SS were 0.42 (unsatisfactory) for calibration and 0.80 for validation (very good).Similarly, the NSE values for SS were −0.08 (unsatisfactory) for calibration and 0.76 (very good) for validation.The model did not simulate trends well for monthly measured TP and TN concentrations.The R 2 values for TP and TN were both < 0.1 (unsatisfactory) during calibration and validation and NSE values were both < 0 (unsatisfactory).Values of PBIAS corresponded to "good" or "very good" performance ratings for TP and TN.
Observed Q-weighted daily mean concentrations derived from hourly measurements and simulated daily mean concentrations of SS, TP and TN during an example 2-day storm event are shown in Fig. 4a-c TP.Comparisons of Q-weighted daily mean concentrations (C QWM ) during storm events from 2010 to 2012 are shown in Fig. 4d-f for SS (9 events), TP and TN (both 14 events).The C QWM of TP exceeded the simulated daily mean by between 0.02 and 0.2 mg P L −1 and, on average, the model underestimated measurements by 69.4 % (Fig. 4e).Although R 2 and NSE values for C QWM of TN were unsatisfactory (Table 6), they were both close to the threshold for satisfactory performance (0.5).For C QWM of SS and TP, R  7).Simulations of discharge and constituent loads under quick flow were more closely related to the measurements (i.e. higher values of R 2 and NSE) than simulations under base flow.Base flow TN load simulations during the validation period showed better model performance than simulations under quick flow.Additionally, measurements under quick flow were better reproduced by the model than the measurements for the whole simulation period.Simulations of contaminant loads matched measurements much better than for contaminant concentrations, as indicated by statistical values for model performance given in Tables 6 and 7.

Separated parameter sensitivity
Based on the ranking of relative sensitivities of hydrological and water quality parameters derived from the SUFI-2 procedure (see Table 8), the OAT sensitivity analysis undertaken separately for base flow and quick flow identified three parameters that most influenced the quick flow estimates, and  (2004-2008) and validation (1994-1997) periods (note time discontinuity).Measured instantaneous loads of SS, TP, and TN correspond to monthly grab samples.
five parameters that most influenced the base flow estimates (parameters above the dashed line in Fig. 7a).Channel hydraulic conductivity (CH_K2) is used to estimate the peak runoff rate (Lane, 1983).Lateral flow slope length (SLSOIL) and lateral flow travel time (LAT_TIME) have an important controlling effect on the amount of lateral flow entering the stream reach during quick flow.Both slope (HRU_SLP) and soil available water content (SOL_AWC) were particularly sensitive for the base flow simulation because they affect lateral flow within the kinematic storage model in SWAT (Sloan and Moore, 1984).The aquifer percolation coefficient (RCHRG_DP) and the base flow alpha factor (ALPHA_BF) strongly influenced base flow calculations (Sangrey et al., 1984), as did the channel's Manning N value (CH_N2), which is used to estimate channel flow (Chow, 2008).
For SS loads, 12 and four parameters, respectively, were identified as sensitive in relation to the simulations of base flow and quick flow (parameters above the dashed line in Fig. 7b).Parameters that control main channel processes (e.g.CH_K2 and CH_N2) and subsurface water transport processes (e.g.LAT_TIME and SLSOIL) were found to be much more sensitive for base flow SS load estimations.Exclusive parameters for SS estimations, such as SPCON (linear parameter), PRF (peak rate adjustment factor), SPEXP (exponent parameter), CH_COV1 (channel erodibility factor), and CH_COV2 (channel cover factor) were found to be much more sensitive in base flow SS load, while LAT_SED (SS concentration in lateral flow and groundwater flow) was more sensitive in quick flow SS load.Parameters that control overland processes, e.g.CN2 (the curve number), OV_N (overland flow of Manning's N value) and SLSUBBSN (subbasin slope length), were found to be much more sensitive for quick flow SS load estimations.
Of the sensitive parameters, BC4 (ORGP mineralisation rate) was particularly sensitive for the simulation of base flow MINP load (Fig. 7c).RCN (nitrogen concentration in rainfall) related specifically to the dynamics of the base flow NO 3 -N load and NPERCO (nitrogen percolation coefficient) significantly affected the quick flow NO 3 -N load (Fig. 7d).Parameter CH_ONCO (channel ORGN concentration) similarly affected both flow components of ORGN load (Fig. 7e) and SOL_CBN (organic carbon content) was most sensitive for the simulations of quick flow ORGN and NH 4 -N loads.Parameter BC1 (nitrification rate in reach) was particularly sensitive for the simulation of the base flow NH 4 -N load (Fig. 7f).

Discussion
This study examined temporal dynamics of model performance and parameter sensitivity in a SWAT model application that was configured for a small, relatively steep and lower-order stream catchment in New Zealand.This country faces increasing pressures on freshwater resources (Parliamentary Commissioner for the Environment, 2013) and models such as SWAT potentially offer valuable tools to inform management of water resources although, to date, the SWAT model has received limited consideration in New Zealand (Cao et al., 2006).Model evaluation on the basis of the data collected during an extended monitoring programme enabled a detailed examination of how model performance varied during different flow regimes.It also permitted the error in daily mean estimates of contaminant loads to be quantified with relative precision, which allows assessing the ability of the SWAT model to simulate contaminant loads during storm events when lower-order streams typically exhibit considerable sub-daily variability in both discharge and contaminant concentrations (Zhang et al., 2010).Separating discharge and loads of sediments and nutrients into those associated with base flow and quick flow for separate OAT sensitivity analyses provided important insights into the varying dependency of parameter sensitivity on hydrologic conditions.

Temporal dynamics of model performance
The modelled estimates of deep aquifer recharge (58 %) and combined lateral flow and shallow aquifer recharge (40 %) were comparable with estimates derived by Rutherford et al. (2011), who used an alternative catchment model to derive respective estimates of 30 and 70 % for these two fluxes.Our decision to deliberately select a validation period (1994)(1995)(1996)(1997) during which the boundary conditions of the system (specifically anthropogenic nutrient loading) differed considerably from the calibration period allowed us to rigorously assess the capability of SWAT to accurately predict water quality under an altered management scenario (i.e. the purpose of most SWAT applications).
Overestimation of TN concentrations prior to 1996 reflects higher NO 3 -N concentrations in groundwater during the cal- A median value (0.2) derived from the SD of ln-transformed NSE was chosen as a threshold above which parameters were deemed to be "sensitive".Definitions of each parameter are shown in Table 4. ibration period (2004-2008) due to the wastewater irrigation operation.Nitrate concentrations appeared to reach a new quasi-steady state as wastewater loads and in-stream attenuation came into balance.SWAT may not adequately represent the dynamics of groundwater nutrient concentrations (Bain et al., 2012) particularly in the presence of changes in catchment inputs (e.g. with start-up of wastewater irrigation).The groundwater delay parameter was set to 5 years (cf.Rotorua District Council, 2006), but this did not appear to capture adequately the lag in response to increases in stream nitrate concentrations following wastewater irrigation from 1991.
The poor fit between simulated daily mean TP concentrations and monthly instantaneous measurements may partly reflect a mismatch between the dominant processes affecting phosphorus cycling in the stream and those represented in SWAT.The ORGP fraction that is simulated in SWAT includes both organic and inorganic forms of particulate phosphorus; however, the representation of particulate phosphorus cycling only focusses on organic phosphorus cycling, with limited consideration of interactions between inorganic streambed sediments and dissolved reactive phosphorus in the overlying water (White et al., 2014).This contrasts with phosphorus cycling in the study stream where it has been shown that dynamic sorption processes between the dissolved and particulate inorganic phosphorus pools exert major control on phosphorus cycling (Abell and Hamilton, 2013).
Our finding that measured Q-weighted mean concentrations (C QWM ) of TP and SS during storm events (2010)(2011)(2012) were greatly underestimated relative to simulated daily mean TP and SS concentrations has important implications for studies that examine effects of altered flow regimes on

Key uncertainties
Model uncertainty in this study may arise from four main factors: (1) model parameters, (2) forcing data, (3) in measurements used for evaluation of model fit, and (4) model structure or algorithms (Lindenschmidt et al., 2007).The values of most parameters assigned for model calibration, although specific to different soil types (e.g.soil parameters), were lumped across land uses and slopes in this study.They integrated spatial and temporal variations, thus neglect- ing any variability throughout the study catchment.In terms of forcing data, the assumption of constant values of spring discharge rate and nutrient concentrations may inadequately reflect the temporal variability and therefore increase model uncertainty, although this should contribute little to the model error term.Most water quality data used for model calibration comprised monthly instantaneous samples taken during base flow conditions.The use of those measurements for model calibration would likely lead to considerable underestimation of constituent concentrations (notably SS and TP) due to failure to account for short-term high flow events.Inadequate representation of groundwater processes in the model structure is another key factor that is likely to affect model uncertainty, particularly for nitrogen simulations.The analysis of model performance based on data sets separated into base flow and quick flow constituents enabled uncertainties in the structure of hydrological models to be identified, denoted by different model performance between these two flow constituents.Furthermore, the disparity in goodness-offit statistics between discharge (typically "good" or "very good") and nutrient variables (often "unsatisfactory") highlights the potential for catchment models which inadequately represent contaminant cycling processes (manifest in unsatisfactory concentration estimates) to nevertheless produce satisfactorily load predictions (e.g.compare model performance statistics for prediction of nutrient concentrations in Table 6 with statistics for prediction of loads in Table 7).This highlights the potential for model uncertainty to be underestimated in studies which aim to predict the effects of scenarios associated with changes in contaminant cycling, such as increases in fertiliser application rates.

Temporal dynamics of parameter sensitivity
To date, studies of temporal variability of parameters have focused on hydrological parameters, rather than on water quality parameters.The characteristics of concentrationdischarge relationships for SS and TP are different to that for TN (Abell et al., 2013).In quick flow, there is a positive relationship between Q and concentrations of SS and TP, reflecting mobilisation of sediments and associated particulate P. Total nitrogen concentrations declined slightly in quick flow, reflecting the dilution of nitrate from groundwater.Defining separate contaminant concentrations in base flow and quick flow enabled us to examine how the sensitivity of water quality parameters varied depending on hydrologic conditions.
In a study of a lowland catchment (481 km 2 ), Guse et al. (2014) found that three groundwater parameters, RCHRG_DP (aquifer percolation coefficient), GW_DELAY (groundwater delay) and ALPHA_BF (base flow alpha factor) were highly sensitive in relation to simulating discharge during quick flow, while ESCO (soil evaporation compensation factor) was most sensitive during base flow.This is counter to the findings of this study for which the base flow discharge simulation was sensitive to RCHRG_DP and AL-PHA_BF.This result may reflect that, relative to our study catchment, the catchment studied by Guse et al. (2014) had moderate precipitation (884 mm yr −1 ) with less forest cover and flatter topography.Although the GW_DELAY parameter reflects the time lag that it takes water in the soil water to enter the shallow aquifers, its lack of sensitivity under both base flow and quick flow conditions in this study is a reflection of higher water infiltration rates and steeper slopes.The ESCO parameter controls the upwards movement of water from lower soil layers to meet evaporative demand (Neitsch et al., 2011).Its lack of sensitivity in our study may reflect relatively high and seasonally consistent rainfall (1500 mm yr −1 ), in addition to extensive forest cover in the Puarenga Stream catchment, which reduces soil evaporative demand by shading.Soil texture is also likely a contributor to this result.The predominant soil horizon type in the Puarenga Stream catchment was A, indicating high macroporosity which promotes high water infiltration rate and inhibits upward transport of water by capillary (Neitsch et al., 2011).The variability in the sensitivity of the parameter SURLAG (surface runoff lag coefficient) between this study (relatively insensitive) and that of Cibin et al. (2010; relatively sensitive) likely reflects differences in catchment size.The Puarenga Stream catchment (77 km 2 ) is much smaller than the study catchment (St Joseph River; 2800 km 2 ) of Cibin et al. (2010) and, consequently, distances to the main channel are much shorter, with less potential for attenuation of surface runoff in off-channel storage sites.The curve number (CN2) parameter was found to be insensitive in both this study and Shen et al. ( 2012), because surface runoff was simulated based on the Green and Ampt (1911) method requiring the hourly rainfall inputs, rather than the curve number equation which is an empirical model.By contrast, the most sensitive parameters in our study are those that determine the extent of lateral flow, an important contributor to streamflow in the catchment, due to a general lack of ground cover under plantation trees and formation of gully networks on steep terrain.
Parameters that control surface water transport processes (e.g.LAT_TIME and SLSOIL) were found to be much more sensitive for base flow SS load estimation than parameters that control groundwater processes (e.g.ALPHA_BF and RCHRG_DP), reflecting the importance of surface flow processes for sediment transport.Sensitive parameters for quick flow SS load estimation related to overland flow processes (e.g.OV_N and SLSUBBSN), thus reflecting the fact that sediment transport is largely dependent on rainfall-driven processes, as is typical of steep and lower-order catchments.Modelled base flow NO 3 -N loads were most sensitive to the RCN because of rainfall as a predominant contributor to recharging base flow.The NPERCO was more influential for quick flow NO 3 -N load estimation, probably indicating that the quick flow NO 3 -N load is more influenced by the mobilisation of concentrated nitrogen sources associated with agriculture or treated wastewater distribution.High sensitivity of the organic carbon content (SOL_CBN) for quick flow ORGN load estimates likely reflects mobilisation of N associated with organic material following rainfall.The finding that base flow NH 4 -N load was more sensitive to nitrification rate in reach (BC1) likely reflects that base flow provides more favourable conditions to complete this oxidation reaction, as NH 4 -N is less readily leached and transported.Similarly, the ORGP mineralisation rate (BC4) strongly influenced base flow MINP load estimation, reflecting that base flow phosphorus transport is relatively more influenced by cycling from channel bed stores, whereas quick flow phosphorus transport predominantly reflects the transport of phosphorus that originated from sources distant from the channel.

Conclusions
The performance of a SWAT model was quantified for different hydrologic conditions in a small catchment with mixed land use.Discharge-weighted mean concentrations of TP and SS measured during storm events were greatly underestimated by SWAT, highlighting the potential for uncertainty to be greatly underestimated in catchment model applications that are validated using a sample of contaminant load measurements that is over-represented by measurements made during base flow conditions.Monitoring programmes which collect high-frequency and event-based data should be considered further to support more robust calibration and validation of SWAT model applications.Accurate simulation of nitrogen concentrations was constrained by the nonsteady state of groundwater nitrogen concentrations due to historic variability in anthropogenic nitrogen applications to land.Improved representation of groundwater processes in the model structure would reduce this aspect of model uncertainty.The sensitivity of many parameters varied depending on the relative dominance of base flow and quick flow, while curve number, soil evaporation compensation factor, surface runoff lag coefficient, and groundwater delay were largely invariant to the two flow regimes.Parameters relating to main channel processes were more sensitive when estimating variables (particularly Q and SS) during base flow, while those relating to overland processes were more sensitive for simulating variables associated with quick flow.Temporal dynamics of both parameter sensitivity and model performance due to dependence on hydrologic conditions should be considered in further model applications.This study has important implications for modelling studies of similar catchments that exhibit short-term temporal fluctuations in stream flow.In particular these include small catchments with relatively steep terrain and lower-order streams with moderate to high rainfall.

Figure 1 .
Figure 1.(a) Location of Puarenga Stream surface catchment in New Zealand, Kaituna rain gauge, climate station and managed land areas for which management schedules were prescribed in SWAT.(b) Location of the Puarenga Stream, major tributaries, monitoring stream gauges, two cold-water springs and the Whakarewarewa geothermal contribution.Measurement data (Table3) used to calibrate the SWAT model were from the Forest Research Institute (FRI) stream gauge and were considered representative of the downstream/outlet conditions of the Puarenga Stream.

Figure 3 .
Figure 3. Measurements and daily mean simulated values of discharge, suspended sediment (SS), total phosphorus (TP) and total nitrogen (TN) during calibration (a-d) and validation (e-h).Measured daily mean discharge was calculated from 15 min observations and measured concentrations of SS, TP and TN correspond to monthly grab samples.
2 and NSE values indicated that the model performance was unsatisfactory.The PBIAS value of −0.87 for C QWM of TN corresponded to model performance ratings of "very good", while the PBIAS values for C QWM of SS and TP were 43.9 and 69.4,respectively, indicating satisfactory model performance.Measured and simulated discharge and contaminant loads separated for the two flow regimes (base flow and quick flow) are shown in Fig. 5. Model performance statistics differed between the two flow regimes (Table

Figure 4 .
Figure 4. Example of a storm event showing derivation of discharge (Q)-weighted daily mean concentrations (dashed horizontal line) based on hourly measured concentrations (black dots) of suspended sediment (SS), total phosphorus (TP) and total nitrogen (TN) over 2 days (ac).Comparisons of Q-weighted daily mean concentrations with simulated daily mean estimates of SS, TP and TN (scatter plot, d-f).The horizontal bars show the ranges in hourly measurements during each storm event in 2010-2012.

Figure 5 .
Figure 5. Measurements and simulations derived using the calibrated set of parameter values.Data are shown separately for base flow and quick flow.(a) Daily mean base flow and quick flow; (b) suspended sediment (SS) load; (c) total phosphorus (TP) load; (d) total nitrogen (TN) load.Vertical lines in (b)-(d) show the contaminant load in quick flow.Time series relate to calibration(2004-2008) and validation (1994- 1997)  periods (note time discontinuity).Measured instantaneous loads of SS, TP, and TN correspond to monthly grab samples.

Figure 6 .
Figure 6.Regression of measured and simulated (a) discharge (Q), concentrations of (b) suspended sediment (SS), (c) total phosphorus (TP), and (d) total nitrogen (TN) including lower and upper 95 % confidence limits (LCL and UCL) and lower and upper 95 % prediction limits (LPL and UPL).Note that the "indistict" shape of confidence limits shown in (b)-(d) resulted from the few data points (< 50) in the regressions of measured and simulated SS, TP and TN concentrations.

Figure 7 .
Figure 7.The standard deviation (SD) of the ln-transformed Nash-Sutcliffe efficiency (NSE) used to indicate parameter sensitivity based on one-at-a-time (OAT) sensitivity analysis for separate base flow and quick flow components: (a) Q (discharge); (b) SS (suspended sediment); (c) MINP (mineral phosphorus); (d) NO 3 -N (nitrate-nitrogen); (e) ORGN (organic nitrogen); (f) NH 4 -N (ammonium-nitrogen).A median value (0.2) derived from the SD of ln-transformed NSE was chosen as a threshold above which parameters were deemed to be "sensitive".Definitions of each parameter are shown in Table4.

Table 2 .
Previously estimated parameter values for three dominant types of land cover in the Puarenga Stream catchment.Values of other land use parameters were based on the default values in the SWAT database.

Table 3 .
Description of data used to calibrate the SWAT model.Data were measured at the Forest Research Institute (FRI) stream gauge and were considered representative of the downstream/outlet conditions of the Puarenga Stream.

Table 4 .
Summary of calibrated SWAT parameters.Discharge (Q), suspended sediment (SS) and total nitrogen (TN) parameter values were assigned using auto-calibration, while total phosphorus (TP) parameters were manually calibrated.SWAT default ranges and input file extensions are shown for each parameter.

Table 6 .
Model performance ratings for simulations of discharge (Q), concentrations of suspended sediment (SS), total phosphorus (TP) and total nitrogen (TN).n indicates the number of measurements.Q-weighted mean concentrations were calculated using Eq.(1).

Table 7 .
Model performance statistics for simulations of discharge (Q), and loads of suspended sediment (SS), total phosphorus (TP) and total nitrogen (TN).Statistics were calculated for both overall and separated simulations.Q all and L all indicate the overall simulations; Q b and L b indicate the base flow simulations; Q q and L q indicate the quick flow simulations.

Table 8 .
Rankings of relative sensitivities of parameters (from most to least) for variables (header row) of Q (discharge), SS (suspended sediment), MINP (mineral phosphorus), ORGN (organic nitrogen), NH 4 -N (ammonium-nitrogen), and NO 3 -N (nitrate-nitrogen).Relative sensitivities were identified by randomly generating combinations of values for model parameters and comparing modelled and measured data with a Student t test (p ≤ 0.05).Bold text denotes that a parameter was deemed sensitive relative to more than one simulated variable.Italic text denotes that parameter was deemed insensitive to any of the two flow components (base flow and quick flow; see Fig.7) using one-at-a-time sensitivity analysis.Definitions and units for each parameter are shown in Table4.