Interactive comment on “ Technical note : A novel technique to improve the hydrological estimates at ungauged basins by swapping workspaces

We want to extend a quick gratitude to the reviewer for highly constructive criticism. We needed such a detailed review for the improvement of our work. We also thank him for providing us the option of re-submission. We assure the reviewer that his criticism is very well taken and will be addressed to the extent of his satisfaction in our revised draft. For the time being, we’ll wait for the comments of the second reviewer and will inculcate all the suggest comments, simultaneously.

drawbacks, we propose a novel technique which assists in identifying a better individual regional model for the prediction of hydrological data at each ungauged basin.The new procedure treats each flow regime as a complete hydrological object.Whereas, the variability in regime shape is determined by using dissimilarity values arranged in a distance matrix executed by considering normalized values of three types of dissimilarities viz; point-to-point dissimilarity, vertical dissimilarity and lateral dissimilarity.On the basis of defined statistical routines, the flow distance matrix is linked with the distance matrices of basin characteristics, acquired by simple comparison of descriptors values, to select most suitable descriptors from the pool of 74 descriptors to form regionalized models.
The dissimilarity-based regionalization model thus obtained is primarily coupled with nearest neighbor algorithm to constitute a model space for the initial predictions of the monthly flow regimes.Afterwards, based on the orientation of nearest neighbors of ungauged basin in descriptor space __ the prediction is improved by swapping the model space with the other available models provided the set criteria are fulfilled.The proposed study is conducted in northwestern Italy and the proposed method is tested on the dataset of 124 basins.The basins where the set criteria of model swapping are complied with; the results obtained are statistically better than the initial estimates.

Introduction
The prediction of flow regimes in general is important for flood mitigation, hydropower generation, dam storage management and irrigation water management.The topic has been widely studied over the last two decades and a number of methods have been proposed for the prediction of hydrological data (Blöschl et al., 2013;Viglione et al., 2013;Qamar et al., 2015Qamar et al., , 2016)).Among the available methods, dissimilarity-based methods have extensively been used in the recent times owing to their better predictability and simplicity in application (Ganora et al., 2009;Qamar et al., 2016).Theoretically, these methods define hydrological properties of the basins as the function of their climatic, geomorphological and land-use dynamics (also known as descriptors).The descriptors are arranged in a multi-dimensional space to form a workspace in which prediction on hydrologic data is made.The ability of model prediction is generally defined for the selected study-area (or cluster) containing variable number of basins having homogeneous descriptive properties.With the availability of GIS procedure, several descriptors can be computed to investigate the complex basin dynamics: however, the process of model constitution results in a large number of models having almost similar global performances (models exhibiting a very small difference in performance parameters).Afterwards, the predictive model with better global performance is selected from the rest of constituted models by making restrictive assumptions.However, the model selection criteria are not strictly defined but merely the tradeoffs between various statistical parameters (Hall, 2001).Moreover, the selection of the predictive model is based on the redundant information provided by the average predictive performance (of the model) over the selected study area instead for the localized ungauged basin (  ).Therefore, the predictive model, selected from a very competitive domain of models having almost similar predictive abilities, can have the largest prediction uncertainty for the   in the study area.Conclusively, it is pertinent for the sake of predictive efficiency to devise such a mechanism that could, somehow, hunch the better model for the considered   from the competing models.
We argue that instead of using single model for the overall workspace, there should be a mechanism to define basinspecific model which could statistically execute better predictive results for   .For this to be done, in our work, we plan to merge the distance based approach with nearest neighbor () method to make initial estimates on hydrological data of   .The estimates will then be improved by swapping the originally selected model with another model, provided the predefined conditions are satisfied.
Unlike other hydrologic entities (e.g., flow duration curve), where flow values are deliberately arranged in the specific order of magnitude; the flow regimes are complex in shape owing to the dependence of flow values on the time parameter.Therefore, the prediction of flow regimes requires not only the predicted flow values to be closer to the actual values but the pattern of occurrence (with respect to time) should also be similar to the actual regime.To reflect this generic difference between flow duration curves and flow regimes in the process of predictive model selection, we used three modes of dissimilarities__ normalized to comprehensively define the dissimilarity between the flow regimes.The hydrological dissimilarities thus executed are related to descriptive dissimilarities, both arranged in the form of distance matrices, to select a so-called original model (), for the initial estimates.The initial estimates are then potentially improved by swapping the  with another model having almost similar global performance; defined by   statistical results of swapped model () are accepted or rejected by scrutinizing: 1) the extent to which the space around the   is covered (  ) by its ; and 2) the error generated by SM (∆   ) in predicting the hydrological data of  of   .We hypothesize that the results of  can be considered as favorable if and only if ∆   < ∆   and    >    .

Study Area
The technique in tested in the Northwestern part of Italy.The dataset representing the hydrological and descriptive characteristics of 124 basins are used in this study (see Figure 1).

Figure 1
The time span of hydrological data varies from 5 years to 52 years with the mean length of 12 years.The runoff data is extracted from previous publications of former Italian Hydrographic Service updated with the recent measurements provided by the Regional Environmental Agency (ARPA) of the Piemonte Region.The flow data is normalized by using global average monthly runoff values at each station.The entire hydrological data is summed up in Ganora et al. (2013).
The hydrological data is further complimented with the comprehensive compilation of geomorphological and climatic descriptors obtained for all the selected basins of the study area (Gallo et al., 2013;Farr et al., 2007).The maximum, minimum and average values of some of the descriptors (out of 74 descriptors) used in our research work are depicted in Table 1; The annual flow regimes are executed by summing daily data () for each month () to extract an average monthly representative value through , where  is the index of the month under consideration,  represents the particular day of the month, and  is the number of days in the month.The monthly runoff regime at any station is ultimately computed by averaging yearly regimes thus obtaining a single representative flow regime for each station.The representative regime interprets within-year streamflow variability.This pre-processing forms a normalized set of data to allow an easier comparison of the flow regimes within the given framework of dissimilarity.In this work, our primary focus is on the accurate prediction of average monthly runoff magnitudes and yearly peak flow with respect to time.We are, therefore, interested in a model that is not only able to predict the correct annual flow volume but also the peak pattern.

Dissimilarity between Regimes
The dissimilarity between flow regimes is executed by calculating three types of dissimilarities, viz; point to point  (1 where  is the index for monthly values starting from January (when,  = 1) and   is the point-to-point difference between flow regimes of the stations  and .It is important to note that equation ( 1) is applicable only for separating flow regimes on the basis of difference in monthly values, but it does not consider the difference in time between the occurrence of peak flow values (at  and ) which is the main characteristic of flow regime (Fig. 2).To cater the orientation of peak flow in regime, we introduced lateral distance measure (  ) which describes the time difference between the event of peaks in two regimes by considering initial () and shift () states of the regimes using following equation The valuation of   requires the identification of peaks in the flow regimes that are being compared.In our work, peaks are considered to be the maximum values in a particular regime.Afterwards a circular procedure is used to compute lateral separation, in which any of the two regimes is shifted towards the other following least possible time-steps until both the peaks are exactly underneath each other.For example, in Figure (3)   between the flow regimes belonging to station  and  is calculated.The peak flows of former and later stations occur at 4 th and 6 th time-steps, respectively.The shifting of  towards  through 5 th and 4 th time-steps, takes least number of time-steps (2-only) to match the peaks; instead of alternative path that requires 10 time-steps (through 7 th , 8 th , 9 th , 10 th , 11 th , 12 th , 1 st , 2 nd , 3 rd , 4 th ).Each step of peak-shifting is followed by the application of eq. ( 1), which computes the dissimilarity between initial and shifted state.It should be noted that the shifted state becomes initial state once the regime is shifted to the next time step.The dissimilarities obtained during each step are ultimately summed-up to find the total   .

Figure 3
To ensure that the estimated peaks are not only correct with respect to time but are also closer in terms of magnitude; a vertical distance measure (  ) which quantifies difference between the peaks is added to the total distance as Finally, the dissimilarities (  ,   , and   ) are normalized by ) and added, to calculate a single representative total dissimilarity value (  ) between the two flow regimes where superscript   indicates normalized dissimilarities.A comparison, for   , is made between 124 stations used in our work to construct a comprehensive dissimilarity matrix of hydrological data.
Unlike hydrological data, the descriptive data is varying in nature (geomorphological, climatic, etc.).The types of descriptors used in our work include: (1) single number values (e.g., basin elevation, basin area etc.); (2) monotonic function, such as hypsographic curve; and (3) complex descriptors like rainfall regimes.The dissimilarity between the descriptor is computed depending on the type of descriptors.For single value descriptors, absolute difference is taken between their values.While, in case of monotonic descriptors, eq. ( 1) is used.Whereas, the dissimilarity function between regime descriptors is executed in a similar way to that of flow regimes (as   ).
The hydrological and descriptor dissimilarity matrices are expected to assist in the identification of predictive regional models having efficient temporal and magnitudinal prediction abilities for peak and monthly flow values, respectively.

Regional Model
The predictive models are identified by linking descriptor distance matrices with discharge distance matrices through linear regression to identify the dominating descriptors.The linear model reads as where  represents the number of descriptors, β i as generic regression coefficient, ε symbolizes residual element and M D depicts descriptor distance matrix transformed into a vector by following a procedure outlined by Lichstein (2007); which describes, in detail, a methodology for multiple regression (MRM) on distance matrices.The significance of the regression is quantified through modified Mantel test against 0.05 significance level.The models sieving through the defined criteria are listed in decreasing   2 order, determined by In the above equation ( 6),  2 stands for coefficient of determination,  is number of descriptors and  is the total number of basins.
Due to large number of descriptors used in our analysis there is always a possibility of mutual correlation between descriptors.To identify this mutual correlation between descriptor, VIF test is put to service.A cutoff value of 5 is used below which a selected model is classified as "inutilizable" (Ganora et al., 2009;Gallice et al., 2015).
The selected models are further tested for average error generation (∆) in the overall workspace framed by the descriptors constituting the models.The error test is carried out by assuming one station at a time as an ungauged and removing its descriptor and hydrological data from the database.Afterwards models are recalibrated to estimate the unknown flow regimes by using k-nearest neighbors () algorithm which relies on the selection of optimum where   defines the total dissimilarity between the actual (  ) and simulated (  ) regimes and the index k expresses the station number.
The application of equations ( 6) and ( 7) to execute   2 and ∆ values, respectively, is trivial in the selection of .
The model with comparatively higher   2 and least ∆ value is selected to make initial estimation.However, the implementation of  to the entire study area is always argued as problematic owing to the dynamic hydrological response of basins to the changing descriptors.Besides extensive research done in the field of predictive hydrology, hydrological response of basins could never be precisely quantified against the basin characteristics.The primary advantage of using distance-based model workspace is that it can suggest an alternative workspace to counter the issue of generalization due to the extension of  to the overall study area thus suggesting an appropriate workspace for the prediction of hydrological data even at the localized level (for individual basin and ∆ values of  and  (Qamar et al., 2016).The criteria are not strict in intrinsic sense.
However, the higher variation allowance will increase the risk of increased localized error.Whereas, allowing lower variation will further complicate the selection of .

Model swapping: logic, assumption, and implementation
The alternative space is selected under the hypothesis that the   and its  form a unique region of influence (ROI) (Korn and Muthukrishnan, 2000).Inside ROI, the orientation of   among its  and the average error The hydrological data of  of   in descriptors space are averaged to acquire the flow regime.By definition, the executed mean for   will always be located in the middle of its .The transformation of descriptive data to hydrological data is more meaningful if the same location pattern is actually depicted by the descriptive values of   and its .Broadly speaking, the actual location of   in descriptors space should, ideally, overlap or align closely to the center formed by the mean of descriptors values of its  (see Figure 4).

Figure 4
For example, referring to the Figure ( 4), the mean of hydrological data of  of   in the workspace of the models (  ,   ) and (  ,   ) is always converged to the center (  1 and   2 respectively).Whereas, the actual position of   in the workspace formed by (  ,   ) is closer to the virtual center formed by the descriptive values of its  as compared to that of (  ,   ).Therefore, the workspace (  ,   ), in comparative terms, better satisfies the condition of meaningful transformation.Whereas,   is ideally located in (  ,   ) owing to the overlapping of its hypothetical and actual positions in the given workspace.The selected workspace is further tested for the localized error generation (∆  ) by estimating hydrological data of  of   and computing average error by utilizing equation ( 7) in ROI of   .
It should be noted that with almost similar error magnitude in the overall workspace (∆), the lower magnitude of ∆  ensures the better prediction ability (with lower error) of the  in the localized area containing   .Although the application of  is straight forward but it has been severely criticized for not taking into the account, the descriptive dissimilarity (or distance) between the selected  and   by allocating equal weightage to the selected neighbors.To address the stated problem in , Hechenbichler and Schliep (2004) proposed a weighted coefficient to increase the weightage of closer neighbor in the estimating hydrological data of   basin.Since the effect of descriptors on the river flows varies unpredictably over a shorter distance, no standard method exists in literature for the quantification of error magnitude per unit increase in distance (or dissimilarity) between the basins, therefore, the method is not applicable for the proposed methodology.However, the location of   in the middle of its  ensures the equitable distance of each  from   and hence legitimizing equal weightage for each .
The proposed methodology is carried out in the  statistical environment.The technique is very useful because nonmonotonic functions like rainfall can be introduced with a scalar descriptor to define suitable workspace for the selection of .

Results and Discussion
Following the procedure outlined for the selection of most appropriate model, we enlist the models, in Table (2), which fulfilled the set criteria.The model with lower ∆ value and higher   2 value, nominated as an , is used for the assessment of hydrological data in an   .Within the workspace of , the flow regimes of predefined number of  of   are averaged to predict the hydrological regime of   .

Table (2)
The descriptive models in Table (2) are constituted by 2-descriptors.The previous research works have shown that the increased number of descriptors in the predictive model will increase the efficiency of the model output (Kjeldsen and Jones, 2009;Kjeldsen et al., 2014).However, due to computational limitations, we opted to execute the results by using models with 2-descriptors.
Out of numerous diverse descriptors used in our work, the climatic and geomorphological descriptors constituted the most suitable models for the prediction.More specifically, the model constituted by (_, 70) is used for the initial estimations about hydrological data at   .The defined model evaluation parameters viz;   2 and ∆ equaled 0.291 and 0.660, respectively.The formation of better predictive models by climatic and geomorphological descriptors is in line with the typology of the study area containing the selected basins.For example, the descriptor (70) which is one of the constituent descriptor in the selected models is relevant because of its strong influence on the basin response in the mountainous study area.Whereas, the dominating geographical descriptor (_) maintains its significance by providing a synthetic explanation of flow pattern.The methodology, thus, not only gives us luxury of simulating complicated flow regimes while maintaining significance of peak discharge with fewer descriptors but also explains a logical connection between flow magnitudes and selected descriptors.
The values of ∆  and   for the selected  and  for 124 stations are summed up in

Figure (6)
The above figure clearly states that apart from descriptors ( 3 and _), the desired degree of scatterness is obtained for the remaining descriptors.Therefore, the enlisted models containing one of ( 3 and _) are sieved out due to difficulty in nominating a unique  of   .
Eventually, after satisfying all the formalities, the selected  are ultimately exercised for the statistical improvement of the prediction.The results for 45 stations are compared in Table (3) by using performance indexes such as Root Mean Square Error (R), Nash-Sutcliffe Efficiency (N), and Mean Absolute Error (M).On average  produced lesser error than the .

Table (3)
The results in Table (3 Although the output of prediction is more efficient using newly developed technique, however the result obtained for station (82), are comparatively weaker than the  besides the fulfillment of swapping criteria for 3.The obvious reason, of deviation from the expected output, seems to be the simplified approach which is followed to execute the error magnitude in the overall workspace and cluster (constituted by   ,    , and NNs of    ).
However, the issue can be effectively addressed by studying the change in error magnitude per unit change in distance between the stations, which is ignored in our work.Moreover, it can be argued that the criteria defined for model swapping is tough owing to which only 36% of the total basins could satisfy it.Nevertheless, with increasing availability of meaningful descriptors around the globe, the proposed technique will become more effective.The methodology holds a wide application spectrum in the fields of water management, flow trend analysis, reconstitution of hydrological regimes, and temporal-and-magnitudinal prediction of peak discharge.

Conclusion
In this study, the distance matrices of descriptors and hydrological data are estimated and linked through regression modelling to identify the most effective descriptive models.
distance (  ), lateral separation (  ), and vertical distance (  ) __ which comprehensively define the difference in hydrological behavior of the compared basins.The figurative elaboration of three dissimilarities is provided below in Figure (2);

Figure ( 2 )
Figure (2)Assuming, { 1, ,  2, ,  3, , …  12, } and { 1, ,  2, ,  3, , …  12, } to be the hydrological data belonging to two gauged basins  and , respectively; the point to point distance between monthly values can be executed by the following formula Hydrol.Earth Syst.Sci.Discuss., https://doi.org/10.5194/hess-2018-418Manuscript under review for journal Hydrol.Earth Syst.Sci. Discussion started: 29 August 2018 c Author(s) 2018.CC BY 4.0 License.numbers of  of   .The selection of appropriate number of unique  is an important step in the procedure, because too small number of neighbors can result in over simplification of results; while too many neighbors may cause error in the final results.Following the procedure proposed by Samaniego et al. 2010, we opted for 5  after thoroughly scrutinizing from 1 to 9 (for details please refer to Samaniego et al., 2010).The unique  in the distance-based workspace are defined as the ones having distinct descriptive values.With workspace formulated by multiple descriptors, the duplication in any of the descriptor values especially for the basins positioned near   , will result in adding extraneous (or junk) variable to the predicting model resulting in inflated standard errors.The singularity in descriptor values ensures that the dissimilarity between the basins is evenly shared by the descriptors developing the predictive model.Furthermore, many basins having same descriptor values make it difficult to nominate predefined number of  of   .The obtained results are compared with the original flow regimes to acquire the value of total dissimilarity magnitude (  ).The test, in totality, requires extraordinary computation power owing to the involvement of a number of statistical operations.To minimize the computational burden, only a limited number of regression models having, comparatively, good   2 values, are used to execute the regional regimes.The overall error (∆) for each model (classified as having a better   2 value) is deduced by the following equation (7);

FigureFigure ( 5 )
Figure (5) ) are the best examples to interpret the effectiveness of underlying assumptions of statistical improvement of hydrological data by creating better spatial coverage and reducing the neighboring error around   .For example, the output of stations 90 and 95 are significantly improved after swapping the  with the  due to the comprehensive fulfillment of the set criteria for model swapping.Whereas, for stations 9 and 15 the results are marginally elevated due to border line contentment of the swapping criteria.It can further be noted that the present methodology provides comparatively better results when served with model based on climatic-geomorphologic descriptors while the land use descriptors execute the least accurate results.The reason lies in the fact that the flow magnitudes are directly dependent on the climatic-geomorphologic descriptors, while land use descriptors have comparatively lesser effect on the magnitude of flow and occurrence of peak flows in the study area(Confortola et al. 2013).During the dissimilarity measurement between the flow regime, the peak flow position and magnitude are given specific importance by introducing   and   .Therefore, the prediction abilities are further explored to measure the efficiency of the peak flow position w.r.t time and are elaborated in Table (4);The monthly difference of "zero" represents the exact temporal estimation of the peak flow.Whereas, the values greater than "zero" indicates the monthly temporal difference between the predicted and actual peak.For example, the monthly difference of 2 indicates that the peak flow is estimated two months prior or post the occurrence of peak flow in actual regime.It can clearly be noted that  better predicts the peak flow w.r.t time as compared to , which misses it more frequently.It should be noted that the proposed methodology only provides a comparative performance signature for the prediction of flow regimes at   .The procedure comprehensively defines the comparative performance of 2-models ( and ) beforehand by thoroughly investigating   and ∆  .It should also be borne in mind that the procedure does not give any numeric value about the model performance indices (,  and ) in advance, however it definitely identifies the better predictive model, statistically.This unique ability makes it an ideal tool for the use in hydrological data prediction.

Figure 3 :
Figure 3: Step wise shifting of peak R towards S.

Figure 5 :
Figure 5: Analyzing   and ∆  values against the set criteria of model swapping.The black dots above the bars plots represent the stations where the set criteria of swapping are fulfilled.

Figure 6 :
Figure 6: Analyzing frequency of occurrence of descriptor values.The plots with green background represent the descriptors having better degree of scatterness while the ones with red background could not show uniqueness in descriptor values.
generated in the estimation of hydrological data of  of   can act as comparative performance indicators of the alternative model space against the originally selected model space.The application of model swapping for the improvement of predicted hydrological regime at   commences by splitting the workspace of  around   into six equal sectors (see Figure4).The number of sectors occupied by  of   are counted to define a so-called coverage factor (   ).Afterwards, the hydrological data of each  of   is predicted to estimate average error (∆   ) as defined by equation (7), in the ROI of   .The factor ∆  is useful in the sense that it transpires the model performance in the localized area containing   .The same parameters (   and ∆   ) are estimated for the workspace of .The statistical results of  are accepted, if and only if Afterwards, based on the values of   2 and ∆, statistically most feasible model is selected.The dissimilarity based-regionalization model is then coupled with  method to constitute the model space for initial predictions of flow regimes.The predicted results are then improved by swapping it with another model having similar global performance.The aims of changing the workspace of   are; to have the better orientation of   among its NNs to increase the coverage factor, and to reduce ∆  in the cluster formed by   ,    and the NNs of    .Once the defined criteria are fulfilled, SM is used to produce the flow regimes.The statistical performance parameters in terms of ,  and evaluated for  are better than the .It is, however, not easy to fulfill the set requirements of model swapping due to difficulty in orientating   in the middle of it  while ensuring lower ∆   than ∆   .Nevertheless, with

Table 2 :
List of selected models with    and ∆ values.