Deep, Wide, or Shallow? Artificial Neural Network Topologies for Predicting Intermittent Flows
 TTU Water Resources Center, Department of Civil, Environmental and Construction Engineering, Texas Tech University
 TTU Water Resources Center, Department of Civil, Environmental and Construction Engineering, Texas Tech University
Abstract. Intermittent Rivers and Ephemeral Streams (IRES) comprise 60 % of all streams in the US and about 50 % of the streams worldwide. Furthermore, climatedriven changes are expected to force a shift towards intermittency in currently perennial streams. Most modeling studies have treated intermittent streamflows as a continuum. However, it is better to envision flow data of IRES as a “mixturetype”, comprised of both flow and noflow regimes. It is therefore hypothesized that datadriven models with both classification and regression cells can improve the streamflow forecasting abilities in these streams. Deep and wide Artificial Neural Networks (ANNs) comprising of classification and regression cells were developed here by stacking them in series and parallel configurations. These deep and wide network architectures were compared against the commonly used single hidden layer ANNs (shallow), as a baseline, for modeling IRES flow series under the continuum assumption. New metrics focused on noflow persistence and transitions between flow and noflow states were formulated using contingency tables and Markov chain analysis. Nine IRES across the state of Texas, US, were used as a wide range of testbeds with different hydroclimatic characteristics. Model overfitting and the curseofdimensionality were reduced using extreme learning machines (ELM), and balancing training data using the synthetic minority oversampling technique (SMOTE), greedy learning and Least Absolute Shrinkage and Selection Operator (LASSO). The addition of classifier cells greatly improved the ability to distinguish between noflow and flow states, in turn, improving the ability to capture noflow persistence (dryness) and transitions to and from flow states (dryness initiation and cessation). The wide network topology provided better results when the focus was on capturing low flows and the deep topology did well in capturing extreme flows (zero and > 75th percentile).
Farhang Forghanparast et al.
Status: closed

RC1: 'Comment on hess2021176', Anonymous Referee #1, 07 May 2021
The manuscript suggests the combination of classification and regression models (deep and wide topology) to increase the accuracy of the current datadriven models available for streamflow forecasting in intermittent rivers. Overall, the topic is very interesting, and the manuscript was written well. The suggested models are new, and the results are well discussed. My comments are listed below:
 line 98: update the references of the current regressionbased models regarding the following paper:
 Mehr, A. D., & Gandomi, A. H. (2021). MSGPLASSO: An improved multistage genetic programming model for streamflow prediction. Information Sciences, 561, 181195.
 line 118119: In the hydrological modeling community ANNs are known as regressors; however, the authors claimed ANNs as highperformance classifiers. The given references in line 119 are out of the hydrological forecasting community. It is better to remove lines 118119. Furthermore, please justify why you don’t select a wellknown classifier such as SVM or random forest?
 Section 3 is a part of the methodology of this paper. It could be combined with section 2. The authors must avoid providing literature review in this section and section 4 as well. For example, lines 203209 must be removed, or lines 216236 must be substantially shortened. Regarding the organization of the manuscript, I prefer to see Figure A1, Table A1, and Table A2 within the main text. The manuscript does not need an appendix.
 Line 149: remove the full expression of ANNs as you already provided in line 118.
 Line 181187: redundancy in the citation is seen in this paragraph. Remove some of them.
 Remove capitalization of each word in section 4.4.
 Flow rate or flowrate? Use a fixed one in the whole text.
 In section 5, lines 341342 are irrelevant. Please remove.
 At the end of Section 5, list the selected inputs clearly. Statistical features of inputs must be given.
 Section 7.1. Calibration must be replaced with training.

AC1: 'Reply on RC1', Elma Annette Hernandez, 08 Jul 2021
Reviewer 1
Comment: The manuscript suggests the combination of classification and regression models (deep and wide topology) to increase the accuracy of the current datadriven models available for streamflow forecasting in intermittent rivers. Overall, the topic is very interesting, and the manuscript was written well. The suggested models are new, and the results are well discussed.Response: We thank the reviewer for this comment and for finding our work to be interesting, innovative, and wellwritten. We also appreciate the reviewer’s other detailed and helpful comments. We have made the necessary modifications as described below.
 Comment: line 98: update the references of the current regressionbased models regarding the following paper: Mehr, A. D., & Gandomi, A. H. (2021). MSGPLASSO: An improved multistage genetic programming model for streamflow prediction. Information Sciences, 561, 181195.
Response: Thank you for your suggestion and pointing us to this relevant reference The references in Line 98 are updated and the mentioned reference is added. The paragraph at line 97 is updated as:
“Conventionally used models for intermittent streamflow forecasting only include the regression cell of the wide network (Cigizoglu, 2005; Kisi, 2009; Makwana and Tiwari, 2014; RahmaniRezaeieh et al., 2020; Mehr and Gandomi, 2021). This configuration typically has a single input layer, a hidden layer and an output layer and is referred to as shallow topology (or shallow model) in this study.” Comment: line 118119: In the hydrological modeling community ANNs are known as regressors; however, the authors claimed ANNs as highperformance classifiers. The given references in line 119 are out of the hydrological forecasting community. It is better to remove lines 118119.
Response: Thank you for your comment. We agree that, unlike regression, classification is not a common approach in hydrological modeling. We, therefore, provided citations from other fields where ANN classifiers have been used over a wide range of datasets in an effort to justify the testing of this approach in this application.
The paragraph at line 118 is rewritten as:
“Unlike the regression approach, which is widely used for streamflow forecasting, classification is not a common methodology in hydrological modeling. Artificial Neural Networks (ANN) are however known to provide high performance in both regression and classification over a wide range of datasets and applications in other fields (e.g., Araulampalam and Bouzerdoum, 2003; Rocha et al., 2007; Landi et al., 2010; AlShayea, 2011; Amato et al., 2013; Wang et al., 2017; Bektas et al., 2017) and thus provide a strong basis for testing their use here. While the developed topologies in this study are independent of the algorithm, for the sake of brevity, the same family of ANN models was used for the regression and classification cells in this study.”
 Comments: Furthermore, please justify why you don’t select a wellknown classifier such as SVM or random forest?
Response: Thank you for this important comment.
As we state in our paper, any classification and regression modeling scheme can be used with our proposed approach. As our focus was on the presentation of an integrated classification + regression methodology for modeling intermittent flows, a detailed comparison of suitable algorithms for classification and regression cells was outside the scope (but we plan to pursue this important question in a separate paper). Secondly, ANNs were chosen because they are known to perform well in both classification and regression tasks and picking a single approach helps maintain the brevity of the paper and keep the focus on the presentation of the coupling framework. We have added a comment to this regard in the manuscript.
The choice of ANNs was made here as they are known to perform both regression and classification tasks with a high degree of accuracy (e.g., Araulampalam and Bouzerdoum, 2003; Rocha et al., 2007; Landi et al., 2010; AlShayea, 2011; Amato et al., 2013; Wang et al., 2017; Bektas et al., 2017) and selecting a similar architecture helps with brevity and keep the focus of the presentation on the proposed modeling frameworks. However, the proposed approach is model agnostic and any other suitable classification and regression scheme can be used instead of the ANNs schemes used here for illustrative purposes.
 Comment: Section 3 is a part of the methodology of this paper. It could be combined with section 2.
Response: Thank you for your suggestion. We agree and the manuscript has been updated with “Parameter estimation for deep and wide artificial neural network architectures” as in section 2.6. All the following sections and subsections are updated subsequently (Please refer to the attached PDF, the additional information for Comment 4, Reviewer 1).
 Comment: The authors must avoid providing literature review in this section and section 4 as well. For example, lines 203209 must be removed, or lines 216236 must be substantially shortened.
Response: Thank you for your comment. We have revised the manuscript, reduced some citations, and shortened the parts on “Greedy learning”, “Extreme Learning Machine configuration”, and “Regularization for robust estimation for hidden node selection” from lines 180 to 244. However, several choices can be made while training ANN architectures and we retained some references here to justify our choices and provide readers with suitable context and additional references to look at while replicating or extending this work.
The paragraphs from line 180 to 244 are revised as: (The parts in brackets are removed)
2.6.1.1 Greedy learning
Greedy learning is a widely used strategy in machine learning for training sequential models such as regression trees, random forests, and deep neural networks (Friedman, 2001; Hinton et al., 2006; [Bengio et al., 2006;] Larochelle et al., 2009; [Johnson and Zhang, 2013; Liu et al., 2017] Naghizadeh et al., 2021). In this approach, parameter estimation is not carried out on a global objective function but conducted in a piecewise manner. This simplification reduces the number of parameters to be estimated and therefore makes the optimization problem mathematically tractable. [Despite the lack of a global objective function,] Greedy learning algorithms are known to produce useful machine learning models that exhibit a high degree of accuracy (Knoblock et al., 2003; Su et al., 2018; (Wu et al., 2018) Belilovsky et al., 2019).
Adopting the greedy learning approach here essentially decouples the global objective function (Eq. (9)) into two separate optimization problems whose objective functions are given by Eq. (7) and Eq. (8). In other words, the models in the classification and regression cells are fit separately to estimate the unknowns within each cell. Generally, the increased computation burden of solving two optimization problems is offset by the gains obtained by separating the overall search space of the global objective function. Therefore, the greedy optimization approach was adopted here to solve Eq. (9).
2.6.1.2 Extreme Learning Machine configuration
An Extreme Learning Machine (ELM) is a special form of MLP wherein the weights for the inputhidden nodes connections and the associated bias terms are randomly assigned, rather than being estimated via optimization. This strategy greatly reduces the complexity of the parameter estimation process as [the weights connecting the inputs to hidden nodes and the associated bias terms need not be estimated and] only the weights and bias associated with the output node need to be estimated.
From a conceptual standpoint, as the inputoutput computations (Eq. (2) and Eq. (3)) are not part of the parameter estimation process, they only need to be performed once. This is tantamount to applying a randomized nonlinear transformation to the original inputs to create a transformed set of variables (i.e., the outputs of the hidden nodes). As the hidden nodeoutput submodel is a logistic regression formulation in case of a classification problem and linear regression formulation in case of a continuous output, the optimization can be performed with relative ease using analytical approaches.
[Despite the random nature of inputhidden node transformation, ELMs have been shown to have universal approximation capabilities (Huang et al., 2006; Cocco Mariani et al., 2019). From a practical standpoint, they are noted to perform well and provide results that are comparable to other machine learning methods, especially MLPs that have been fitted using nonlinear gradient descent approaches (Zeng et al., 2015; Yaseen et al., 2019; Adnan et al., 2019).]
ELMs are increasingly being used in hydrology for a wide range of problems (Deo and ¸ Sahin, 2015; Atiquzzaman and Kandasamy, 2015; Deo et al., 2016; Mouatadid and Adamowski, 2017; Seo et al., 2018; Afkhamifar and Sarraf, 2020), especially streamflow forecasting (Lima et al., 2016; RezaieBalf and Kisi, 2017; Yaseen et al., 2019; Niu et al., 2020).
The use of Greedy learning and ELM configuration greatly reduces the mathematical complexity of the parameter estimation process for the proposed deep and wide topologies for predicting intermittent flow timeseries. However, the problem of overfitting (Uddameri, 2007) cannot be ruled out, especially when the hidden layer contains a large number of nodes. Overfitting must be addressed to ensure the proposed deep and wide topologies learn the insights in the training dataset and are able to generalize to other inputs that are presented to the model during the calibration phase.
2.6.1.3 Regularization for robust estimation for hidden node selection
[While the ELM greatly reduces the computational complexity, the randomization of inputhidden node weights implies that the overall model fits are subject to chance.] The number of hidden nodes is an important hyperparameter that critically controls the performance of ANNs, in general, and ELMs, in particular (Huang and Chen, 2007; [Wrong et al., 2008;] Feng et al., 2009; Lan et al., 2010; [Zhang et al., 2012;] Ding et al., 2014). If the number of hidden nodes is set too low, then the improper specification of hidden node weights due to random selection is hard to correct. Having a large number of hidden nodes improves the chances of at least some of them having high weights. However, the nodes with the smaller weights tend to learn the noise in the data resulting in poor generalizing capabilities. Reducing overfitting while maintaining a sufficient number of hidden nodes to capture nonlinear inputoutput relationships using ELM has received a significant amount of attention in recent years (Yu et al., 2014; [Shukla et al., 2016; Feng et al., 2017;] Zhou et al., 2018; [Duan et al., 2018;] Lai et al., 2020).
[The second part of the ELM develops a linear leastsquares relationship between the output of the hidden nodes and the ultimate output (predictand).] When there are a large number of hidden nodes, correlations between them are to be expected. The presence of correlated inputs results in multicollinearity issues when performing ordinary least squares regression (Hamilton,1992). Regularization approaches are commonly used to reduce the impacts of correlated inputs and have been used with ELMs to minimize the overfitting problem (Inaba et al., 2018; Zhang et al., 2020). In this approach, an additional term, which is a function of the weights connecting the hidden node and output weights, is added to the loss function (and is referred to as Lnorm). The revised objective function (see Eq. (10)) not only minimizes the sum of squares of residuals but also the number of hidden nodes.
The L2norm, also referred to as Ridge norm or Tikhanov regularization, is a function of squares of the weights (see Eq. (11)). This approach typically forces weights with small singular values to be small numbers (as close to zero as possible), which can be ignored during predictions. The L1norm, also referred to as LASSO norm (Eq. (12)), minimizes the absolute value of the weights and actually sets the insignificant weights to a value of zero. The loss function with L1norm results in a convex optimization problem that can be solved via linear programming and, therefore, commonly adopted (Zhang and Xu, 2016). Furthermore, the L1norm is shown to induce a greater degree of sparseness [than the L2norm without sacrificing prediction accuracy (Fakhr et al., ) The L1norm is also] and to be more robust to outliers in comparison to the L2norm (Zhang and Luo, 2015). Outliers are of particular concern when dealing with highly variable intermittent flow. The value in Equation 10 is a weighting factor that denotes the relative importance of the regularization term visavis the error minimization term and can be obtained via crossvalidation procedure (MartínezMartínez et al., 2011).
 Comment: Regarding the organization of the manuscript, I prefer to see Figure A1, Table A1, and Table A2 within the main text. The manuscript does not need an appendix.
Response: Thank you for your comment. Based on your recommendation and in the interest of brevity, Figure A1, Table A1, and Table A2 have been moved to the supplementary material. The updated manuscript has no appendix. Please refer to the attached PDF, the additional information on Comment6, Reviewer1.
 Comment: Line 149: remove the full expression of ANNs as you already provided in line 118.
Response: Thank you for your comment. The full expression of ANNs is removed from line 118.
 Comment: Line 181187: redundancy in the citation is seen in this paragraph. Remove some of them.
Response: Thank you for your comment. Some of the citations have been removed. The paragraph of line 181 is revised as:
"Greedy learning is a widely used strategy in machine learning for training sequential models such as regression trees, random forests, and deep neural networks (Friedman, 2001; Hinton et al., 2006; Larochelle et al., 2009; Naghizadeh et al., 2021). In this approach, parameter estimation is not carried out on a global objective function but conducted in a piecewise manner. This simplification reduces the number of parameters to be estimated and therefore makes the optimization problem mathematically tractable. Greedy learning algorithms are known to produce useful machine learning models that exhibit a high degree of accuracy (Knoblock et al., 2003; Su et al., 2018; Belilovsky et al., 2019)."
 Comment: Remove capitalization of each word in section 4.4.
Response: Thank you for your comment. The capitalization of each word is removed from this title.
 Comment: Flow rate or flowrate? Use a fixed one in the whole text.
Response: Thank you for your comment. We have made modifications and used “flowrate” throughout the updated manuscript.
 Comment: In section 5, lines 341342 are irrelevant. Please remove.
Response: Thank you for your comment. The sentence beginning in line 341, is revised as:
“Coarsescale runoff estimates generated from an ensemble of regional Variable Infiltration Capacity (VIC) models were obtained and used as an input to condition model predictions (i.e., inform the model of the best initial guess of the likely streamflow).”
 Comment: At the end of Section 5, list the selected inputs clearly. Statistical features of inputs must be given.
Response: Thank you for your comment. These lines were added to the end of the “Input specification for deep and wide ANNs for predicting intermittent streamflows” part:
“Ultimately, precipitation, potential evapotranspiration, soil moisture index, and their lags as well as the VICestimated runoffs formed the final set of inputs used for each stream. Table S3 (in Supplementary Materials) provides a summary of these inputs and their statistical features at each station.”
Please refer to the attached PDF, the additional information on Comment 11, Reviewer1, for TableS3.
 Comment: Section 7.1. Calibration must be replaced with training.
Response: Thank you for your comment. In the subtitle of Section 7.1. “calibration” is now replaced with “training”.

RC2: 'Comment on hess2021176', Anonymous Referee #2, 06 Jun 2021
Artificial Neural Network (ANN) can used as a regression model to simulate streamflow as a continuous variable. This paper added a classification model on top of the regression model to simulate the flow status of intermittent streams. If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model; if the classification model outputs a flowing status, the regression model will be run to predict a flowrate. Based on this idea, the authors developed two separate ANN models with different structures (wide vs. deep) to simulate streamflow for nine intermittent streams in the Texas, US, and compared the results with that from a solely regression model.
Although the authors argued that the wide and deep models are different in their structures, I disagree and would say that the only difference is the input data to the regressor of the models: the regressor in the wide model takes all input data including both flowing and nonflowing values, while the regressor in the deep model only takes flowing values as input data. Therefore, the wide and deep models are essentially the same and the difference in results are only due to different input data. The study was actually testing the impact of different input data (a full dataset or a partial dataset) on simulation outputs. This fact compromises the whole structure of the manuscript, and the finding that the wide model that takes the full dataset as input showed better performance in simulating flowrates than the deep model that was built only on part of the input data is not surprising.
In addition, more justification should be added to the Introduction (probably to the paragraph beginning from Line 30) to explain why a datadriven method is chosen simulate streamflow, rather than a hydrological model? By the way, has the authors thought of combining the classification ANN model with a hydrological model to better simulate streamflow in an intermittent stream?
The structure of the Methodology needs improvement as well. Probably starting with an ANN regression model that is conventionally used to simulate streamflow, followed by the introduction of a classification model on top of the regression model. Instead of proposing a deep and wide model, only develop one of them, since they are the same (see previous argument). More descriptive information should be provided for the model evaluation testbeds, such as what is the calibration period / testing period, why choose that, etc.
The caption of Figures and Tables in this study should be standing alone, with more information added.
As there are many comparisons made in the results, log transformed/no transformation, with/without SMOTE, continuous/wide/deep, it is very easy to confuse readers of what the main point of the study. I would suggest the authors only focus on the comparison of regression vs. regression + classification, taking the pathway of SMOTE and log transformation, since they are shown to provide better results, and other comparisons can be included as supporting information.

AC2: 'Reply on RC2', Elma Annette Hernandez, 08 Jul 2021
Reveiwer2:
1. Comment: Artificial Neural Network (ANN) can used as a regression model to simulate streamflow as a continuous variable. This paper added a classification model on top of the regression model to simulate the flow status of intermittent streams. If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model; if the classification model outputs a flowing status, the regression model will be run to predict a flowrate. Based on this idea, the authors developed two separate ANN models with different structures (wide vs. deep) to simulate streamflow for nine intermittent streams in the Texas, US, and compared the results with that from a solely regression model.Response: Thank you for broadly summarizing our study. However, the statement  “If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model.” is only correct for the proposed deep configuration. The continuous model is run irrespective of the flow status decided by the classification model in the wide configuration even when there is an outcome of no flow by the classifier. Therefore, the information flow (i.e., how the input data propagate through the architectures) is different in these two configurations.
 Comment: Although the authors argued that the wide and deep models are different in their structures, I disagree and would say that the only difference is the input data to the regressor of the models: the regressor in the wide model takes all input data including both flowing and nonflowing values, while the regressor in the deep model only takes flowing values as input data. Therefore, the wide and deep models are essentially the same and the difference in results are only due to different input data. The study was actually testing the impact of different input data (a full dataset or a partial dataset) on simulation outputs. This fact compromises the whole structure of the manuscript, and the finding that the wide model that takes the full dataset as input showed better performance in simulating flowrates than the deep model that was built only on part of the input data is not surprising.
Response: Thank you for this comment. We very much appreciate it as it has helped clarify some differences between the two architectures so that the presentation is less confusing.
The structure of machine learning models is based on how the Data and information flow through them (e.g., Schmidhuber, 2012). In a deep network, the information flows through a horizontally stacked (sequential) architecture while in the wide network it flows through a vertically stacked (or parallel) architecture. Our categorization follows this standard of machine learning literature and the differences in information flow were the basis for suggesting that these models have different structures.
We could have also developed another deep (horizontally stacked model) wherein the information would flow sequentially from a classifier and only invoke a regressor that would be trained on the entire dataset (both flow and noflow) if the classifier cell resulted in a noflow estimate. This model would be in line with the approach you are proposing here as information would flow sequentially from the classification to regression cells (With the regression cell having an ANN trained on the entire dataset). In this case, the structure of the two models would be exactly the same with only a difference in the training dataset.
Part of the confusion could also be stemming from the fact that the same set of inputs were adopted to evaluate deep and wide architectures. We adopted this approach here to ensure as much similarity between the two models and block other factors (e.g., different model architectures and algorithms) during our comparison. Even then, the differences between the two models can be seen. The two models essentially perform the same when the classifier correctly classifies noflow (i.e., classifier controlled cases). However, they do provide different results when there is flow or there is a misclassification of flow state (regression controlled cases). The deep model, in its present configuration, does well when there are large flows (which is often seen in intermittent streams) while the wide model does well when the system is dominated by low flows.
In a deep architecture, the classification cell and regression cell need to be trained on the same dataset (as information flow is sequential). On the other hand, this need not be the case with a wide network. The wide network, therefore, has the advantage of coupling with existing datadriven and physicallybased modeling tools that may be available for the intermittent stream of interest. We have modified Figure 1 (Please refer to the attached PDF, the additional information for Comment2, Reviewer2, for the updated version of Figure 1) to make this distinction clear and also mentioned the above points to draw a clear distinction between the two configurations and added verbiage to the text to make this clearer.
 Comment: In addition, more justification should be added to the Introduction (probably to the paragraph beginning from Line 30) to explain why a datadriven method is chosen simulate streamflow, rather than a hydrological model?
Response: Thank you for your comment. Datadriven models are largely preferred in intermittent streams because the assumption of continuum is strictly not valid, especially when the stream dries up. The streamflow in intermittent streams exhibits sharp discontinuities which, in turn, cause significant nonlinearity in the datasets. Being empirical in nature, datadriven models are not based on continuum assumption and generally exhibit a greater ability to capture nonlinearities. Both these factors suggest better suitability of datadriven models to modeling intermittent flows and were the primary factors for using them in this study. We have added relevant statements to the revised manuscript to better clarify this issue.
 Comment: By the way, has the authors thought of combining the classification ANN model with a hydrological model to better simulate streamflow in an intermittent stream?
Response: Thank you for your suggestion. The integration of a classification ANN model with a physicallybased hydrological model is certainly possible in the case of a wide ANN architecture but not with a deep ANN architecture. The coupling of a hydrologic model (based on the continuum assumption) with a discrete classifier would help address the issue of modeling ‘mixture data’ types arising in intermittent streams with a combination of continuum and discrete classifier approaches. This is clearly one of the advantages of the wide formulation. The deep architecture exhibits greater fidelity to the mixture type data in that discrete and continuous portions of the data are modeled separately (but within the same model). The wide architecture does not exhibit complete fidelity in the sense that the discrete and continuous portions of data are used in the regression cell instead of just the continuous portion. However, the wide architecture is more practical in that existing models (both hydrologic and datadriven) based on the continuum assumption can be integrated with a classifier and improved upon. We have added this discussion to the revised version.
As our focus here is on comparing deep and wide architectures, the comparison with a hydrological model is clearly out of the scope here. But we certainly envision a future study that is built along these lines. We thank the reviewer for this question as it certainly helps clarify the differences between deep and wide architectures.
 Comment: The structure of the Methodology needs improvement as well. Probably starting with an ANN regression model that is conventionally used to simulate streamflow, followed by the introduction of a classification model on top of the regression model.
Response: Thank you for your comment. As singlelayer ANNs are wellknown and widely applied in the field of hydrology, we did not feel an introduction on them was needed in the interest of brevity. However, we agree with the author that a short discussion on MLPs for classification and regression would make the paper complete. Therefore, we have added a section in the supplementary material explaining the workings of ANNs for both regression and classification. Please refer to the attached PDF, the additional information on Comment5, Reviewer2, for the added introductory section on ANNs.
6. Comment: Instead of proposing a deep and wide model, only develop one of them, since they are the same (see previous argument).
Response: Thank you for your suggestion. We have clarified why the two architectures are different and as such retained the presentation on both models. However, we thank you for your ideas and suggestions as a comparison of different types of wide models (e.g., datadriven and hydrologic) coupled with a classifier would be of interest to the hydrologic modeling community and expect to continue our research along the lines suggested by you.
 Comment: More descriptive information should be provided for the model evaluation testbeds, such as what is the calibration period / testing period, why choose that, etc.
Response: Thank you for your comments. We have added a statement to indicate that the first 75% of the records were used for training and the remaining 25% was used for Testing. The choice of this split was based on our goal of evaluating the proposed architectures to make short (a few months ahead) to mediumterm forecasts (a few years ahead) necessary for water resources management in these streams.
In addition to the above clarifications, we have also modified Table 1 (Please refer to the attached PDF, the additional information on Comment7, Reviewer2, for the updated version of Table 1) to provide additional details pertinent to the calibration period and validation period.
 Comments: The caption of Figures and Tables in this study should be standing alone, with more information added.
Response: Thank you for your comments. All the captions of figures and tables were reviewed, and the captions were updated with more information added (Please refer to the attached PDF, the additional information on Comment 8, Reviewer2, for a list of updated captions.)
 Comment: As there are many comparisons made in the results, log transformed/no transformation, with/without SMOTE, continuous/wide/deep, it is very easy to confuse readers of what the main point of the study. I would suggest the authors only focus on the comparison of regression vs. regression + classification, taking the pathway of SMOTE and log transformation, since they are shown to provide better results, and other comparisons can be included as supporting information.
Response: Thank you for your suggestion. All results associated with the “notransformation” mode have been removed from the main body and Table 2. That information has been moved to the supplementary material (Table S4). Please refer to the attached PDF, the additional information on Comment9, Reviewer2, for Table S4. Also, after the positive impact of SMOTEbalancing was depicted in Figure 6, only the results of SMOTE and log transformation pathway are presented, per your recommendation. Furthermore, the captions of the figures and tables were updated to clarify the results and various comparisons that are being made in each section.

AC2: 'Reply on RC2', Elma Annette Hernandez, 08 Jul 2021
Status: closed

RC1: 'Comment on hess2021176', Anonymous Referee #1, 07 May 2021
The manuscript suggests the combination of classification and regression models (deep and wide topology) to increase the accuracy of the current datadriven models available for streamflow forecasting in intermittent rivers. Overall, the topic is very interesting, and the manuscript was written well. The suggested models are new, and the results are well discussed. My comments are listed below:
 line 98: update the references of the current regressionbased models regarding the following paper:
 Mehr, A. D., & Gandomi, A. H. (2021). MSGPLASSO: An improved multistage genetic programming model for streamflow prediction. Information Sciences, 561, 181195.
 line 118119: In the hydrological modeling community ANNs are known as regressors; however, the authors claimed ANNs as highperformance classifiers. The given references in line 119 are out of the hydrological forecasting community. It is better to remove lines 118119. Furthermore, please justify why you don’t select a wellknown classifier such as SVM or random forest?
 Section 3 is a part of the methodology of this paper. It could be combined with section 2. The authors must avoid providing literature review in this section and section 4 as well. For example, lines 203209 must be removed, or lines 216236 must be substantially shortened. Regarding the organization of the manuscript, I prefer to see Figure A1, Table A1, and Table A2 within the main text. The manuscript does not need an appendix.
 Line 149: remove the full expression of ANNs as you already provided in line 118.
 Line 181187: redundancy in the citation is seen in this paragraph. Remove some of them.
 Remove capitalization of each word in section 4.4.
 Flow rate or flowrate? Use a fixed one in the whole text.
 In section 5, lines 341342 are irrelevant. Please remove.
 At the end of Section 5, list the selected inputs clearly. Statistical features of inputs must be given.
 Section 7.1. Calibration must be replaced with training.

AC1: 'Reply on RC1', Elma Annette Hernandez, 08 Jul 2021
Reviewer 1
Comment: The manuscript suggests the combination of classification and regression models (deep and wide topology) to increase the accuracy of the current datadriven models available for streamflow forecasting in intermittent rivers. Overall, the topic is very interesting, and the manuscript was written well. The suggested models are new, and the results are well discussed.Response: We thank the reviewer for this comment and for finding our work to be interesting, innovative, and wellwritten. We also appreciate the reviewer’s other detailed and helpful comments. We have made the necessary modifications as described below.
 Comment: line 98: update the references of the current regressionbased models regarding the following paper: Mehr, A. D., & Gandomi, A. H. (2021). MSGPLASSO: An improved multistage genetic programming model for streamflow prediction. Information Sciences, 561, 181195.
Response: Thank you for your suggestion and pointing us to this relevant reference The references in Line 98 are updated and the mentioned reference is added. The paragraph at line 97 is updated as:
“Conventionally used models for intermittent streamflow forecasting only include the regression cell of the wide network (Cigizoglu, 2005; Kisi, 2009; Makwana and Tiwari, 2014; RahmaniRezaeieh et al., 2020; Mehr and Gandomi, 2021). This configuration typically has a single input layer, a hidden layer and an output layer and is referred to as shallow topology (or shallow model) in this study.” Comment: line 118119: In the hydrological modeling community ANNs are known as regressors; however, the authors claimed ANNs as highperformance classifiers. The given references in line 119 are out of the hydrological forecasting community. It is better to remove lines 118119.
Response: Thank you for your comment. We agree that, unlike regression, classification is not a common approach in hydrological modeling. We, therefore, provided citations from other fields where ANN classifiers have been used over a wide range of datasets in an effort to justify the testing of this approach in this application.
The paragraph at line 118 is rewritten as:
“Unlike the regression approach, which is widely used for streamflow forecasting, classification is not a common methodology in hydrological modeling. Artificial Neural Networks (ANN) are however known to provide high performance in both regression and classification over a wide range of datasets and applications in other fields (e.g., Araulampalam and Bouzerdoum, 2003; Rocha et al., 2007; Landi et al., 2010; AlShayea, 2011; Amato et al., 2013; Wang et al., 2017; Bektas et al., 2017) and thus provide a strong basis for testing their use here. While the developed topologies in this study are independent of the algorithm, for the sake of brevity, the same family of ANN models was used for the regression and classification cells in this study.”
 Comments: Furthermore, please justify why you don’t select a wellknown classifier such as SVM or random forest?
Response: Thank you for this important comment.
As we state in our paper, any classification and regression modeling scheme can be used with our proposed approach. As our focus was on the presentation of an integrated classification + regression methodology for modeling intermittent flows, a detailed comparison of suitable algorithms for classification and regression cells was outside the scope (but we plan to pursue this important question in a separate paper). Secondly, ANNs were chosen because they are known to perform well in both classification and regression tasks and picking a single approach helps maintain the brevity of the paper and keep the focus on the presentation of the coupling framework. We have added a comment to this regard in the manuscript.
The choice of ANNs was made here as they are known to perform both regression and classification tasks with a high degree of accuracy (e.g., Araulampalam and Bouzerdoum, 2003; Rocha et al., 2007; Landi et al., 2010; AlShayea, 2011; Amato et al., 2013; Wang et al., 2017; Bektas et al., 2017) and selecting a similar architecture helps with brevity and keep the focus of the presentation on the proposed modeling frameworks. However, the proposed approach is model agnostic and any other suitable classification and regression scheme can be used instead of the ANNs schemes used here for illustrative purposes.
 Comment: Section 3 is a part of the methodology of this paper. It could be combined with section 2.
Response: Thank you for your suggestion. We agree and the manuscript has been updated with “Parameter estimation for deep and wide artificial neural network architectures” as in section 2.6. All the following sections and subsections are updated subsequently (Please refer to the attached PDF, the additional information for Comment 4, Reviewer 1).
 Comment: The authors must avoid providing literature review in this section and section 4 as well. For example, lines 203209 must be removed, or lines 216236 must be substantially shortened.
Response: Thank you for your comment. We have revised the manuscript, reduced some citations, and shortened the parts on “Greedy learning”, “Extreme Learning Machine configuration”, and “Regularization for robust estimation for hidden node selection” from lines 180 to 244. However, several choices can be made while training ANN architectures and we retained some references here to justify our choices and provide readers with suitable context and additional references to look at while replicating or extending this work.
The paragraphs from line 180 to 244 are revised as: (The parts in brackets are removed)
2.6.1.1 Greedy learning
Greedy learning is a widely used strategy in machine learning for training sequential models such as regression trees, random forests, and deep neural networks (Friedman, 2001; Hinton et al., 2006; [Bengio et al., 2006;] Larochelle et al., 2009; [Johnson and Zhang, 2013; Liu et al., 2017] Naghizadeh et al., 2021). In this approach, parameter estimation is not carried out on a global objective function but conducted in a piecewise manner. This simplification reduces the number of parameters to be estimated and therefore makes the optimization problem mathematically tractable. [Despite the lack of a global objective function,] Greedy learning algorithms are known to produce useful machine learning models that exhibit a high degree of accuracy (Knoblock et al., 2003; Su et al., 2018; (Wu et al., 2018) Belilovsky et al., 2019).
Adopting the greedy learning approach here essentially decouples the global objective function (Eq. (9)) into two separate optimization problems whose objective functions are given by Eq. (7) and Eq. (8). In other words, the models in the classification and regression cells are fit separately to estimate the unknowns within each cell. Generally, the increased computation burden of solving two optimization problems is offset by the gains obtained by separating the overall search space of the global objective function. Therefore, the greedy optimization approach was adopted here to solve Eq. (9).
2.6.1.2 Extreme Learning Machine configuration
An Extreme Learning Machine (ELM) is a special form of MLP wherein the weights for the inputhidden nodes connections and the associated bias terms are randomly assigned, rather than being estimated via optimization. This strategy greatly reduces the complexity of the parameter estimation process as [the weights connecting the inputs to hidden nodes and the associated bias terms need not be estimated and] only the weights and bias associated with the output node need to be estimated.
From a conceptual standpoint, as the inputoutput computations (Eq. (2) and Eq. (3)) are not part of the parameter estimation process, they only need to be performed once. This is tantamount to applying a randomized nonlinear transformation to the original inputs to create a transformed set of variables (i.e., the outputs of the hidden nodes). As the hidden nodeoutput submodel is a logistic regression formulation in case of a classification problem and linear regression formulation in case of a continuous output, the optimization can be performed with relative ease using analytical approaches.
[Despite the random nature of inputhidden node transformation, ELMs have been shown to have universal approximation capabilities (Huang et al., 2006; Cocco Mariani et al., 2019). From a practical standpoint, they are noted to perform well and provide results that are comparable to other machine learning methods, especially MLPs that have been fitted using nonlinear gradient descent approaches (Zeng et al., 2015; Yaseen et al., 2019; Adnan et al., 2019).]
ELMs are increasingly being used in hydrology for a wide range of problems (Deo and ¸ Sahin, 2015; Atiquzzaman and Kandasamy, 2015; Deo et al., 2016; Mouatadid and Adamowski, 2017; Seo et al., 2018; Afkhamifar and Sarraf, 2020), especially streamflow forecasting (Lima et al., 2016; RezaieBalf and Kisi, 2017; Yaseen et al., 2019; Niu et al., 2020).
The use of Greedy learning and ELM configuration greatly reduces the mathematical complexity of the parameter estimation process for the proposed deep and wide topologies for predicting intermittent flow timeseries. However, the problem of overfitting (Uddameri, 2007) cannot be ruled out, especially when the hidden layer contains a large number of nodes. Overfitting must be addressed to ensure the proposed deep and wide topologies learn the insights in the training dataset and are able to generalize to other inputs that are presented to the model during the calibration phase.
2.6.1.3 Regularization for robust estimation for hidden node selection
[While the ELM greatly reduces the computational complexity, the randomization of inputhidden node weights implies that the overall model fits are subject to chance.] The number of hidden nodes is an important hyperparameter that critically controls the performance of ANNs, in general, and ELMs, in particular (Huang and Chen, 2007; [Wrong et al., 2008;] Feng et al., 2009; Lan et al., 2010; [Zhang et al., 2012;] Ding et al., 2014). If the number of hidden nodes is set too low, then the improper specification of hidden node weights due to random selection is hard to correct. Having a large number of hidden nodes improves the chances of at least some of them having high weights. However, the nodes with the smaller weights tend to learn the noise in the data resulting in poor generalizing capabilities. Reducing overfitting while maintaining a sufficient number of hidden nodes to capture nonlinear inputoutput relationships using ELM has received a significant amount of attention in recent years (Yu et al., 2014; [Shukla et al., 2016; Feng et al., 2017;] Zhou et al., 2018; [Duan et al., 2018;] Lai et al., 2020).
[The second part of the ELM develops a linear leastsquares relationship between the output of the hidden nodes and the ultimate output (predictand).] When there are a large number of hidden nodes, correlations between them are to be expected. The presence of correlated inputs results in multicollinearity issues when performing ordinary least squares regression (Hamilton,1992). Regularization approaches are commonly used to reduce the impacts of correlated inputs and have been used with ELMs to minimize the overfitting problem (Inaba et al., 2018; Zhang et al., 2020). In this approach, an additional term, which is a function of the weights connecting the hidden node and output weights, is added to the loss function (and is referred to as Lnorm). The revised objective function (see Eq. (10)) not only minimizes the sum of squares of residuals but also the number of hidden nodes.
The L2norm, also referred to as Ridge norm or Tikhanov regularization, is a function of squares of the weights (see Eq. (11)). This approach typically forces weights with small singular values to be small numbers (as close to zero as possible), which can be ignored during predictions. The L1norm, also referred to as LASSO norm (Eq. (12)), minimizes the absolute value of the weights and actually sets the insignificant weights to a value of zero. The loss function with L1norm results in a convex optimization problem that can be solved via linear programming and, therefore, commonly adopted (Zhang and Xu, 2016). Furthermore, the L1norm is shown to induce a greater degree of sparseness [than the L2norm without sacrificing prediction accuracy (Fakhr et al., ) The L1norm is also] and to be more robust to outliers in comparison to the L2norm (Zhang and Luo, 2015). Outliers are of particular concern when dealing with highly variable intermittent flow. The value in Equation 10 is a weighting factor that denotes the relative importance of the regularization term visavis the error minimization term and can be obtained via crossvalidation procedure (MartínezMartínez et al., 2011).
 Comment: Regarding the organization of the manuscript, I prefer to see Figure A1, Table A1, and Table A2 within the main text. The manuscript does not need an appendix.
Response: Thank you for your comment. Based on your recommendation and in the interest of brevity, Figure A1, Table A1, and Table A2 have been moved to the supplementary material. The updated manuscript has no appendix. Please refer to the attached PDF, the additional information on Comment6, Reviewer1.
 Comment: Line 149: remove the full expression of ANNs as you already provided in line 118.
Response: Thank you for your comment. The full expression of ANNs is removed from line 118.
 Comment: Line 181187: redundancy in the citation is seen in this paragraph. Remove some of them.
Response: Thank you for your comment. Some of the citations have been removed. The paragraph of line 181 is revised as:
"Greedy learning is a widely used strategy in machine learning for training sequential models such as regression trees, random forests, and deep neural networks (Friedman, 2001; Hinton et al., 2006; Larochelle et al., 2009; Naghizadeh et al., 2021). In this approach, parameter estimation is not carried out on a global objective function but conducted in a piecewise manner. This simplification reduces the number of parameters to be estimated and therefore makes the optimization problem mathematically tractable. Greedy learning algorithms are known to produce useful machine learning models that exhibit a high degree of accuracy (Knoblock et al., 2003; Su et al., 2018; Belilovsky et al., 2019)."
 Comment: Remove capitalization of each word in section 4.4.
Response: Thank you for your comment. The capitalization of each word is removed from this title.
 Comment: Flow rate or flowrate? Use a fixed one in the whole text.
Response: Thank you for your comment. We have made modifications and used “flowrate” throughout the updated manuscript.
 Comment: In section 5, lines 341342 are irrelevant. Please remove.
Response: Thank you for your comment. The sentence beginning in line 341, is revised as:
“Coarsescale runoff estimates generated from an ensemble of regional Variable Infiltration Capacity (VIC) models were obtained and used as an input to condition model predictions (i.e., inform the model of the best initial guess of the likely streamflow).”
 Comment: At the end of Section 5, list the selected inputs clearly. Statistical features of inputs must be given.
Response: Thank you for your comment. These lines were added to the end of the “Input specification for deep and wide ANNs for predicting intermittent streamflows” part:
“Ultimately, precipitation, potential evapotranspiration, soil moisture index, and their lags as well as the VICestimated runoffs formed the final set of inputs used for each stream. Table S3 (in Supplementary Materials) provides a summary of these inputs and their statistical features at each station.”
Please refer to the attached PDF, the additional information on Comment 11, Reviewer1, for TableS3.
 Comment: Section 7.1. Calibration must be replaced with training.
Response: Thank you for your comment. In the subtitle of Section 7.1. “calibration” is now replaced with “training”.

RC2: 'Comment on hess2021176', Anonymous Referee #2, 06 Jun 2021
Artificial Neural Network (ANN) can used as a regression model to simulate streamflow as a continuous variable. This paper added a classification model on top of the regression model to simulate the flow status of intermittent streams. If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model; if the classification model outputs a flowing status, the regression model will be run to predict a flowrate. Based on this idea, the authors developed two separate ANN models with different structures (wide vs. deep) to simulate streamflow for nine intermittent streams in the Texas, US, and compared the results with that from a solely regression model.
Although the authors argued that the wide and deep models are different in their structures, I disagree and would say that the only difference is the input data to the regressor of the models: the regressor in the wide model takes all input data including both flowing and nonflowing values, while the regressor in the deep model only takes flowing values as input data. Therefore, the wide and deep models are essentially the same and the difference in results are only due to different input data. The study was actually testing the impact of different input data (a full dataset or a partial dataset) on simulation outputs. This fact compromises the whole structure of the manuscript, and the finding that the wide model that takes the full dataset as input showed better performance in simulating flowrates than the deep model that was built only on part of the input data is not surprising.
In addition, more justification should be added to the Introduction (probably to the paragraph beginning from Line 30) to explain why a datadriven method is chosen simulate streamflow, rather than a hydrological model? By the way, has the authors thought of combining the classification ANN model with a hydrological model to better simulate streamflow in an intermittent stream?
The structure of the Methodology needs improvement as well. Probably starting with an ANN regression model that is conventionally used to simulate streamflow, followed by the introduction of a classification model on top of the regression model. Instead of proposing a deep and wide model, only develop one of them, since they are the same (see previous argument). More descriptive information should be provided for the model evaluation testbeds, such as what is the calibration period / testing period, why choose that, etc.
The caption of Figures and Tables in this study should be standing alone, with more information added.
As there are many comparisons made in the results, log transformed/no transformation, with/without SMOTE, continuous/wide/deep, it is very easy to confuse readers of what the main point of the study. I would suggest the authors only focus on the comparison of regression vs. regression + classification, taking the pathway of SMOTE and log transformation, since they are shown to provide better results, and other comparisons can be included as supporting information.

AC2: 'Reply on RC2', Elma Annette Hernandez, 08 Jul 2021
Reveiwer2:
1. Comment: Artificial Neural Network (ANN) can used as a regression model to simulate streamflow as a continuous variable. This paper added a classification model on top of the regression model to simulate the flow status of intermittent streams. If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model; if the classification model outputs a flowing status, the regression model will be run to predict a flowrate. Based on this idea, the authors developed two separate ANN models with different structures (wide vs. deep) to simulate streamflow for nine intermittent streams in the Texas, US, and compared the results with that from a solely regression model.Response: Thank you for broadly summarizing our study. However, the statement  “If the classification model outputs a zeroflow status, the flow status of the stream is decided without further running the regression model.” is only correct for the proposed deep configuration. The continuous model is run irrespective of the flow status decided by the classification model in the wide configuration even when there is an outcome of no flow by the classifier. Therefore, the information flow (i.e., how the input data propagate through the architectures) is different in these two configurations.
 Comment: Although the authors argued that the wide and deep models are different in their structures, I disagree and would say that the only difference is the input data to the regressor of the models: the regressor in the wide model takes all input data including both flowing and nonflowing values, while the regressor in the deep model only takes flowing values as input data. Therefore, the wide and deep models are essentially the same and the difference in results are only due to different input data. The study was actually testing the impact of different input data (a full dataset or a partial dataset) on simulation outputs. This fact compromises the whole structure of the manuscript, and the finding that the wide model that takes the full dataset as input showed better performance in simulating flowrates than the deep model that was built only on part of the input data is not surprising.
Response: Thank you for this comment. We very much appreciate it as it has helped clarify some differences between the two architectures so that the presentation is less confusing.
The structure of machine learning models is based on how the Data and information flow through them (e.g., Schmidhuber, 2012). In a deep network, the information flows through a horizontally stacked (sequential) architecture while in the wide network it flows through a vertically stacked (or parallel) architecture. Our categorization follows this standard of machine learning literature and the differences in information flow were the basis for suggesting that these models have different structures.
We could have also developed another deep (horizontally stacked model) wherein the information would flow sequentially from a classifier and only invoke a regressor that would be trained on the entire dataset (both flow and noflow) if the classifier cell resulted in a noflow estimate. This model would be in line with the approach you are proposing here as information would flow sequentially from the classification to regression cells (With the regression cell having an ANN trained on the entire dataset). In this case, the structure of the two models would be exactly the same with only a difference in the training dataset.
Part of the confusion could also be stemming from the fact that the same set of inputs were adopted to evaluate deep and wide architectures. We adopted this approach here to ensure as much similarity between the two models and block other factors (e.g., different model architectures and algorithms) during our comparison. Even then, the differences between the two models can be seen. The two models essentially perform the same when the classifier correctly classifies noflow (i.e., classifier controlled cases). However, they do provide different results when there is flow or there is a misclassification of flow state (regression controlled cases). The deep model, in its present configuration, does well when there are large flows (which is often seen in intermittent streams) while the wide model does well when the system is dominated by low flows.
In a deep architecture, the classification cell and regression cell need to be trained on the same dataset (as information flow is sequential). On the other hand, this need not be the case with a wide network. The wide network, therefore, has the advantage of coupling with existing datadriven and physicallybased modeling tools that may be available for the intermittent stream of interest. We have modified Figure 1 (Please refer to the attached PDF, the additional information for Comment2, Reviewer2, for the updated version of Figure 1) to make this distinction clear and also mentioned the above points to draw a clear distinction between the two configurations and added verbiage to the text to make this clearer.
 Comment: In addition, more justification should be added to the Introduction (probably to the paragraph beginning from Line 30) to explain why a datadriven method is chosen simulate streamflow, rather than a hydrological model?
Response: Thank you for your comment. Datadriven models are largely preferred in intermittent streams because the assumption of continuum is strictly not valid, especially when the stream dries up. The streamflow in intermittent streams exhibits sharp discontinuities which, in turn, cause significant nonlinearity in the datasets. Being empirical in nature, datadriven models are not based on continuum assumption and generally exhibit a greater ability to capture nonlinearities. Both these factors suggest better suitability of datadriven models to modeling intermittent flows and were the primary factors for using them in this study. We have added relevant statements to the revised manuscript to better clarify this issue.
 Comment: By the way, has the authors thought of combining the classification ANN model with a hydrological model to better simulate streamflow in an intermittent stream?
Response: Thank you for your suggestion. The integration of a classification ANN model with a physicallybased hydrological model is certainly possible in the case of a wide ANN architecture but not with a deep ANN architecture. The coupling of a hydrologic model (based on the continuum assumption) with a discrete classifier would help address the issue of modeling ‘mixture data’ types arising in intermittent streams with a combination of continuum and discrete classifier approaches. This is clearly one of the advantages of the wide formulation. The deep architecture exhibits greater fidelity to the mixture type data in that discrete and continuous portions of the data are modeled separately (but within the same model). The wide architecture does not exhibit complete fidelity in the sense that the discrete and continuous portions of data are used in the regression cell instead of just the continuous portion. However, the wide architecture is more practical in that existing models (both hydrologic and datadriven) based on the continuum assumption can be integrated with a classifier and improved upon. We have added this discussion to the revised version.
As our focus here is on comparing deep and wide architectures, the comparison with a hydrological model is clearly out of the scope here. But we certainly envision a future study that is built along these lines. We thank the reviewer for this question as it certainly helps clarify the differences between deep and wide architectures.
 Comment: The structure of the Methodology needs improvement as well. Probably starting with an ANN regression model that is conventionally used to simulate streamflow, followed by the introduction of a classification model on top of the regression model.
Response: Thank you for your comment. As singlelayer ANNs are wellknown and widely applied in the field of hydrology, we did not feel an introduction on them was needed in the interest of brevity. However, we agree with the author that a short discussion on MLPs for classification and regression would make the paper complete. Therefore, we have added a section in the supplementary material explaining the workings of ANNs for both regression and classification. Please refer to the attached PDF, the additional information on Comment5, Reviewer2, for the added introductory section on ANNs.
6. Comment: Instead of proposing a deep and wide model, only develop one of them, since they are the same (see previous argument).
Response: Thank you for your suggestion. We have clarified why the two architectures are different and as such retained the presentation on both models. However, we thank you for your ideas and suggestions as a comparison of different types of wide models (e.g., datadriven and hydrologic) coupled with a classifier would be of interest to the hydrologic modeling community and expect to continue our research along the lines suggested by you.
 Comment: More descriptive information should be provided for the model evaluation testbeds, such as what is the calibration period / testing period, why choose that, etc.
Response: Thank you for your comments. We have added a statement to indicate that the first 75% of the records were used for training and the remaining 25% was used for Testing. The choice of this split was based on our goal of evaluating the proposed architectures to make short (a few months ahead) to mediumterm forecasts (a few years ahead) necessary for water resources management in these streams.
In addition to the above clarifications, we have also modified Table 1 (Please refer to the attached PDF, the additional information on Comment7, Reviewer2, for the updated version of Table 1) to provide additional details pertinent to the calibration period and validation period.
 Comments: The caption of Figures and Tables in this study should be standing alone, with more information added.
Response: Thank you for your comments. All the captions of figures and tables were reviewed, and the captions were updated with more information added (Please refer to the attached PDF, the additional information on Comment 8, Reviewer2, for a list of updated captions.)
 Comment: As there are many comparisons made in the results, log transformed/no transformation, with/without SMOTE, continuous/wide/deep, it is very easy to confuse readers of what the main point of the study. I would suggest the authors only focus on the comparison of regression vs. regression + classification, taking the pathway of SMOTE and log transformation, since they are shown to provide better results, and other comparisons can be included as supporting information.
Response: Thank you for your suggestion. All results associated with the “notransformation” mode have been removed from the main body and Table 2. That information has been moved to the supplementary material (Table S4). Please refer to the attached PDF, the additional information on Comment9, Reviewer2, for Table S4. Also, after the positive impact of SMOTEbalancing was depicted in Figure 6, only the results of SMOTE and log transformation pathway are presented, per your recommendation. Furthermore, the captions of the figures and tables were updated to clarify the results and various comparisons that are being made in each section.

AC2: 'Reply on RC2', Elma Annette Hernandez, 08 Jul 2021
Farhang Forghanparast et al.
Farhang Forghanparast et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

672  240  19  931  6  7 
 HTML: 672
 PDF: 240
 XML: 19
 Total: 931
 BibTeX: 6
 EndNote: 7
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1