Deep, Wide, or Shallow? Artiﬁcial Neural Network Topologies for Predicting Intermittent Flows

. Intermittent Rivers and Ephemeral Streams (IRES) comprise 60% of all streams in the US and about 50% of the streams worldwide. Furthermore, climate-driven changes are expected to force a shift towards intermittency in currently perennial streams. Most modeling studies have treated intermittent streamﬂows as a continuum. However, it is better to envision ﬂow data of IRES as a “mixture-type”, comprised of both ﬂow and no-ﬂow regimes. It is therefore hypothesized that data-driven models with both classiﬁcation and regression cells can improve the streamﬂow forecasting abilities in these streams. Deep 5 and wide Artiﬁcial Neural Networks (ANNs) comprising of classiﬁcation and regression cells were developed here by stacking them in series and parallel conﬁgurations. These deep and wide network architectures were compared against the commonly used single hidden layer ANNs (shallow), as a baseline, for modeling IRES ﬂow series under the continuum assumption. New metrics focused on no-ﬂow persistence and transitions between ﬂow and no-ﬂow states were formulated using contingency tables and Markov chain analysis. Nine IRES across the state of Texas, US, were used as a wide range of testbeds with dif- 10 ferent hydro-climatic characteristics. Model overﬁtting and the curse-of-dimensionality were reduced using extreme learning machines (ELM), and balancing training data using the synthetic minority oversampling technique (SMOTE), greedy learning and Least Absolute Shrinkage and Selection Operator (LASSO). The addition of classiﬁer cells greatly improved the ability to distinguish between no-ﬂow and ﬂow states, in turn, improving the ability to capture no-ﬂow persistence (dryness) and transitions to and from ﬂow states (dryness initiation and cessation). The wide network topology provided better results when 15 the focus was on capturing low ﬂows and the deep

Conventionally used models for streamflow forecasting only include the regression cell of the wide network (Cigizoglu, 2005;Kişi, 2009;Makwana and Tiwari, 2014;Rahmani-Rezaeieh et al., 2020). This configuration typically has a single input layer, a hidden layer and an output layer and is referred to as shallow topology (or shallow model) in this study.

100
Imbalanced datasets (i.e., having an unequal number of zero and non-zero values) are likely to be the norm in IRES datasets rather than an exception. Data imbalance is a major constraint when performing discrete state classification and must be properly dealt with to obtain reliable classifications (Thabtah et al., 2020). Over-sampling of the minority class, under sampling of the majority class and/or random (equal) sampling of both classes are three techniques that have been used to induce balance in classification datasets (Haixiang et al., 2017). 105 In this study, the Synthetic Minority Oversampling TEchnique (SMOTE) proposed by Chawla et al. (2002) and since extended to higher-dimensional datasets (Blagus and Lusa, 2013) was used to minimize the effects of data imbalance as this approach has been widely used in many applications and known to provide excellent compensation even under high levels of imbalance (Fernández et al., 2018;Raghuwanshi and Shukla, 2020). SMOTE reduces the imbalance by creating synthetic samples for the minority class. These new samples are obtained by combining two similar samples (x and x r ) via weighted 110 addition (see Eq. (1)): Where µ is a random weight (0 ≤ µ ≤ 1) and x r is randomly chosen from a set of nearest neighbors of x. As SMOTE creates synthetic data, it can only be applied to the training dataset used to obtain the model coefficients and not to the independent testing dataset. Once the SMOTE operation is performed to balance the training dataset, a suitable classification scheme can 115 be adopted.

Algorithms for classification cell
Artificial Neural Networks (ANN) are known to provide high classification performance over a wide range of datasets and applications (e.g.,Al-Shayea, 2011; Amato et al., 2013;Wang et al., 2017;Bektas et al., 2017). connected feed-forward architecture, all input nodes are connected to all hidden nodes, and all hidden nodes are connected to the output node. Therefore, the outputs of the hidden node serve as the input of the output node. The connection weight, w i,j , between any two nodes defines the strength of the relationship between them.
The weighted sum of the inputs are first aggregated and then passed through a (nonlinear) activation function.
Where Z is the aggregated sum at the j th hidden node, I i corresponds to the i th input, M is the number of inputs, and H is the total number of hidden nodes within the hidden layer. The bias term (intercept) of the j th node is denoted by, b j .
The output of the hidden node is obtained by passing the aggregated sum from Eq. (2) through an activation function: In recent years, the non-saturation type activation functions are gaining popularity as they overcome some of the deficiencies associated with saturation activation functions (e.g., tanh), such as the development of vanishing gradients during the parameter estimation process (Ramachandran et al., 2018;Eger et al., 2018). The Rectified Linear Unit (ReLU) and the Leaky Rectified Linear Unit (LRELU) are two widely used non-saturation activation functions. As model results are sensitive to the choice of the activation function, and the ideal activation function for a given application is not known a-priori it must be ascertained 135 empirically.
The output node also performs a weighted aggregation of the input it receives (i.e., the outputs of the hidden node), and in the case of a binary classifier, it passes the summation value through a sigmoid activation function to obtain a value between 0 and 1.
The subscript, o, refers to the output node in the above equation. Using Eq. (4), the output (O) can be computed as:

Algorithm Selection for the Regression Cell
There is a long history of using artificial neural networks (ANNs) for modeling streamflows, with applications dating back to the early 1990s (e.g., Kang et al., 1993). Based on this large body of literature, the MLP ANN model was again selected as the 150 algorithm for the regression cell. The MLP ANN for a regression problem is very similar to the MLP ANN for classification in that it also has one input, hidden, and output layers with weights defining the strength of the connection between the nodes.
The operations at the hidden nodes are also represented by Eq.
(2) and Eq. (3). However, as the output of a regression model is continuous and not dichotomous, a linear activation function is used in the output node: Where Q p is the predicted flowrate within the regression cell, and all other variables have the same meaning as before.
Notice that Eq. (6) has the same mathematical form as the ordinary linear regression.

Parameter estimation for deep and wide artificial neural network architectures
The presented classification model contains several unknown model coefficients (w i,j ,b j , w j,o and b o for all i = 1,2,...,M; j = 1,...,J) that have to be estimated from a synthetically balanced training dataset. The binary cross-entropy (BCE) loss function 160 is commonly minimized to obtain the unknown model parameters of the classification cell. The minimization of the BCE loss function (Eq. (7)) is similar to the maximization of the log-likelihood function.
(1 − Q co,r )P (Q c = 0|I r ) + Q co,r P (Q c = 1|I r ) Where, Q co,r is the state of the rth record of the observed streamflow (0 or 1) and I r is the corresponding input record. N is the total number of records in the training dataset.

165
Using the minimization of squared errors as the basis, the loss function for estimating the parameters of the model within the regression cell can be written as: For the wide network topology, N + = N, the length of the training dataset, while N + denotes the number of records with positive streamflow values in the case of the Deep network topology (N + ≤ N). The loss function presented in Eq. (8) seeks to 170 minimize the variance and is analogous to the objective function of the least-squares regression.
The combined loss function (Lcr) can be written by combining Eq. (7) and Eq. (8): The parameter estimation problem entails minimization of the combined loss function (Eq. (9)) to obtain a total of (I c * H c + 2*H c + I r * H r + 2*H r + 2) unknowns. Optimization algorithms such as the gradient descent and its variants often have 175 difficulties as the number of parameters to be estimated increases. The estimation of parameters farther from the input node suffer from vanishing (or exploding) gradient problems which lead to weights that are close to zero and leading to models with poor predictive strengths. Therefore, strategies that help mitigate this "curse of dimensionality" become important.
3.1 Strategies for improving the tractability of parameter estimation 3.1.1 Greedy learning 180 Greedy learning is a widely used strategy in machine learning for training sequential models such as regression trees, random forests, and deep neural networks (Friedman, 2001;Hinton et al., 2006;Bengio et al., 2006;Larochelle et al., 2009;Johnson and Zhang, 2013;Liu et al., 2017;Naghizadeh et al., 2021). In this approach, parameter estimation is not carried out on a global objective function but conducted in a piece-wise manner. This simplification reduces the number of parameters to be estimated and therefore makes the optimization problem mathematically tractable. Despite the lack of a global objective function, greedy 185 learning algorithms are known to produce useful machine learning models that exhibit a high degree of accuracy (Knoblock et al., 2003;Su et al., 2018;Wu et al., 2018;Belilovsky et al., 2019).
Adopting the greedy learning approach here essentially decouples the global objective function (Eq. (9)) into two separate optimization problems whose objective functions are given by Eq. (7) and Eq. (8). In other words, the models in the classification and regression cells are fit separately to estimate the unknowns within each cell. Generally, the increased computation 190 burden of solving two optimization problems is offset by the gains obtained by separating the overall search space of the global objective function. Therefore, the greedy optimization approach was adopted here to solve Eq. (9).

Extreme Learning Machine configuration
An Extreme Learning Machine (ELM) is a special form of MLP wherein the weights for the input-hidden nodes connections and the associated bias terms are randomly assigned, rather than being estimated via optimization. This strategy greatly reduces 195 the complexity of the parameter estimation process as the weights connecting the inputs to hidden nodes and the associated bias terms need not be estimated, and only the weights and bias associated with the output node need to be estimated.
From a conceptual standpoint, as the input-output computations (Eq. (2) and Eq. (3)) are not part of the parameter estimation process, they only need to be performed once. This is tantamount to applying a randomized nonlinear transformation to the original inputs to create a transformed set of variables (i.e., the outputs of the hidden nodes). As the hidden node-output sub- model is a logistic regression formulation in case of a classification problem and linear regression formulation in case of a continuous output, the optimization can be performed with relative ease using analytical approaches.
The use of Greedy learning and ELM configuration greatly reduces the mathematical complexity of the parameter estimation 210 process for the proposed deep and wide topologies for predicting intermittent flow time-series. However, the problem of overfitting (Uddameri, 2007) cannot be ruled out, especially when the hidden layer contains a large number of nodes. Overfitting must be addressed to ensure the proposed deep and wide topologies learn the insights in the training dataset and are able to generalize to other inputs that are presented to the model during the calibration phase.

215
While the ELM greatly reduces the computational complexity, the randomization of input-hidden node weights implies that the overall model fits are subject to chance. The number of hidden nodes is an important hyper-parameter that critically controls the performance of ANNs, in general, and ELMs, in particular (Huang and Chen, 2007;Rong et al., 2008;Feng et al., 2009;Lan et al., 2010;Zhang et al., 2012;Ding et al., 2014). If the number of hidden nodes is set too low, then the improper specification of hidden node weights due to random selection is hard to correct.

220
Having a large number of hidden nodes improves the chances of at least some of them having high weights. However, the nodes with the smaller weights tend to learn the noise in the data resulting in poor generalizing capabilities. Reducing overfitting while maintaining a sufficient number of hidden nodes to capture nonlinear input-output relationships using ELM has received a significant amount of attention in recent years (Yu et al., 2014;Shukla et al., ;Feng et al., 2017;Zhou et al., 2018;Duan et al., 2018;Lai et al., 2020).

225
The second part of the ELM develops a linear least-squares relationship between the output of the hidden nodes and the ultimate output (predictand). When there are a large number of hidden nodes, correlations between them are to be expected. The presence of correlated inputs results in multicollinearity issues when performing ordinary least squares regression (Hamilton, 1992).
Regularization approaches are commonly used to reduce the impacts of correlated inputs and have been used with ELMs 230 to minimize the overfitting problem (Inaba et al., 2018;Zhang et al., 2020). In this approach, an additional term, which is a function of the weights connecting the hidden node and output weights, is added to the loss function (and is referred to as L-norm). The revised objective function (see Eq. (10)) not only minimizes the sum of squares of residuals but also the number of hidden nodes. The L2-norm, also referred to as Ridge norm or Tikhanov regularization, is a function of squares of the as possible), which can be ignored during predictions.
The L1-norm, also referred to as LASSO norm (Eq. (12)), is a special case of the sparsity norm, which minimizes the absolute value of the weights and actually sets the insignificant weights to a value of zero. The loss function with L1-norm results in a convex optimization problem that can be solved via linear programming and, therefore, commonly adopted (Zhang and Xu, 2016). Furthermore, the L1-norm is shown to induce a greater degree of sparseness than the L2-norm without sacrificing 240 prediction accuracy (Fakhr et al., ). The L1-norm is also more robust to outliers than the L2-norm (Zhang and Luo, 2015).
Outliers are of particular concern when dealing with highly variable intermittent flow. The λ value in Equation 10 is a weighting factor that denotes the relative importance of the regularization term vis-a-vis the error minimization term and can be obtained via cross-validation procedure (Martínez-Martínez et al., 2011).
Custom scripts were developed in R (R Core Team, 2019) to implement the deep and wide network models presented above. In particular, the L1-logistic regression for binary flow and no-flow categorization and L1-linear regression necessary for solving the ELM models within classification and regression cells were implemented using procedures outlined by Hastie 250 et al. (2021). The "glmnet" package (Friedman et al., 2010) and the "caret" package (Kuhn, 2008) were used in the model development phase, the "DMwR" package (Torgo, 2010) was used to apply SMOTE, the "FSelectorRcpp "package (Zawadzki and Kosinski, 2020) was used to assess the variable importance via the information gain algorithm in the pre-processing phase, and the "pROC" package (Robin et al., 2011) and the "hydroGOF" package (Zambrano-Bigiarini, 2017) were utilized to assess the discrete and continuous performance of the models, respectively. The developed R scripts are presented in Supplementary   255 Information.

Metrics for evaluation of intermittent flow predictions
It is important to keep the purpose of the model in mind when evaluating its performance (Oreskes, 1998). The primary goal of this modeling effort is to improve the prediction capabilities of zero flow as it denotes the flow intermittency in a stream.
Current evaluation metrics largely focus on assessment of continuous flows and as such are not readily suitable for appraising 260 intermittent flows. Therefore, a new set of metrics were developed in this study.
Several questions related to flow intermittency that are of interest in practical applications form the basis for model evaluation scheme presented here. They are based on contingency table presented in Table A1 and a first-order Markov Chain Transition Matrix presented in Table A2.
4.1 How well does the model predict zero-flow states?

265
Information from contingency tables can be used to estimate how well the model predicts the zero flow rates as: A perfect score for AZF is unity. A value of AZF < 1 indicates the model underpredicts the zero flow occurrences. The AZF metric is not defined for perennial streams as the denominator will contain a zero value in that case.
There are two types of errors associated with flow and no-flow state prediction which can also provide valuable insights 270 related to the model's utility. The over-prediction error (OPE) occurs when the model predicts that there is flow when there is no flow observed. This error is particularly problematic in applications where the existence of a flow is beneficial. For example, when the water in the intermittent stream is needed for sustaining instream aquatic habitats or when the water can be used for other applications (e.g., ranching or agriculture). OPE can be computed from the contingency table as: 275 An under-prediction error (UPE) occurs when the model predicts a no-flow when there is actually an observed flow. From a water availability standpoint, this assumption is conservative. However, when the magnitude of the observed flow is large this under-prediction can actually be problematic as it does not forecast a non-zero flow, and as such limits the ability to react to potential flooding or water logging conditions in riparian areas.

280
The contingency table can also be used to construct other receiver operating characteristic (ROC) based metrics. In particular, the area under the curve (AUC) of the ROC is an important metric that evaluates whether a classifier's predictions are better than random guessing (AUC > 0.5) and can be used for initial screening of models.

How well does the model predict no-flow persistence?
In most intermittent streams, once zero-flow conditions are set, they often tend to persist for some period of time. While the 285 AZF metric measures paired occurrences of zero flows in observed and predicted data, it does not measure the persistence of the zero flow state. No flow persistence restricts transport of mass and energy across habitats, alters habitat and biodiversity makeup, and affects the trophic structure and carrying capacity of the stream. Therefore, this no-flow persistence is an important ecological indicator (Rolls et al., 2012).
No-flow persistence (NFP) can be defined as the typical duration the stream spends in the no-flow state. However, NFP (i.e., 290 duration of zero flow events) is a random variable. For a confirmatory analysis, the null hypothesis that both observed and predicted no-flow durations arise from the same distribution can be tested against the alternative hypothesis that they are from different distributions using for example, the two sample Kolmogorov-Smirnov (KS) test.
For a more qualitative comparison the expected value of no-flow persistence for observed and predicted data can be separately obtained and compared. The qualitative comparison is useful to comparing the performance of multiple modeling 295 schemes. The expected value can be computed from a theoretical or empirical CDF as: The selection of appropriate inputs should be guided by the physics of the system to the maximum extent possible, but statistical and information-theoretic methods are useful to select among competing input variables (Ghaseminejad and Uddameri, 2020). available, therefore the soil moisture index (SMI), which represents the interaction of temperature and rainfall, was chosen as the surrogate variable following Augustin et al. (2008).
Baseflow is largely negligible in intermittent streams and the observed runoff is largely a function of quickflow component of the streamflow (Jothiprakash et al., 2009). In the drier arid and semi-arid regions, most of the precipitation that falls within a watershed converts into runoff immediately (Sazib et al., 2020). Coarse-scale runoff estimates generated from regional Variable 340 Infiltration Capacity (VIC) models using 21 different GCM model forcings (see Table S1 in Supplementary Information for a list of models), were obtained (Cherkauer et al., 2003;Brekke et al., 2013) and used as an input to condition model predictions (i.e., inform the model of the best initial guess of the likely streamflow). From a Bayesian perspective, the runoff from the regional-scale VIC model serves as the prior information of the runoff which is then refined (downscaled) by point-scale hydrological fluxes (P, PET) and antecedent soil moisture conditions (SMI). Given the monthly time-step, the first lag of all the 345 variables were also considered to account for conditions that may have initiated in the latter parts of the month and persisted into a significant portion of the next month.
Lagged values of observed streamflows have also been used as inputs to explicitly capture streamflow persistence in many streamflow forecasting studies ( Badrzadeh et al., 2017;Hadi and Tombul, 2018;Danandeh Mehr, 2018;Rahmani-Rezaeieh et al., 2020). However, the use of lagged streamflows (which is an output of the model) is not suitable when multi-step ahead forecasts are needed as forecasts often deteriorate at large times (Uddameri et al., 2019;Li et al., 2020). Furthermore, as the intermittent outflows contain both zero and non-zero flows, persistence of flow regimes tends to be complex in intermittent rivers and ephemeral streams. Therefore, lagged values of streamflows were not used in this study. Limiting the model inputs to other hydroclimatic variables allows the models developed here to be used in long-term climate studies, as projections of various hydrological fluxes (inputs) are available from downscaling of global climate model inputs.

6 Model evaluation testbeds
The performance of the proposed models was evaluated at 9 intermittent streamflow sites (gaging stations) in Texas. Table 1 presents salient characteristics of the selected stations and the location of these sites are shown in Fig. 2. The selected sites exhibited a wide range of flow intermittency and are also scattered across different land use ( Fig. 2(a)), soil types ( Fig. 2  7 Results and discussion

Model calibrations and testing
The time-series predictions of models over training and testing periods are shown in Figs. 3-5. In addition to dichotomous flow states, IRES also exhibit very high flow variability often ranging over two orders of magnitude or more. Capturing such 365 large variability is a common problem while forecasting streamflows. The log transformation with a unit shift parameter, i.e., log(Q+1), is commonly used to shrink the range of variability while preserving zero flows (Curran-Everett, 2018; Gao and Martos, 2019). This strategy was also adopted here to reduce the range of streamflows prior to their calibration. Generally speaking, the testing period exhibited greater variability comprising of both sequences of no flows but also higher than normal rainfalls, especially in some parts of the state (see stations ST2, ST3, and ST9). As can be seen from Figs  The predictive performance of the shallow, deep, and wide models was compared across the nine IRES of study over the entire range of flowrates (Table 2). In general, the shallow and deep model had lower Mean Absolute Errors (MAE), while the deep model has smaller root mean square error (RMSE). As Errors in estimating higher flow rates lead to greater RMSE errors, 375 the deep model did better in capturing these extreme high end flows. This result is to be expected because, the regression cell of the deep model is trained only with non-zero flow data while the shallow and wide models are trained using both zero and nonzero flows, which also explains their relative ability to capture lower flows well. Therefore, misclassification of the zero flow state in the classification cell leads to over-estimation of flows by the deep model. A separate comparison of misclassification flows indicated that the over-estimation was no more than 0.15 m3/s (typically < 0.05 m3/s) in comparison to predictions by 380 shallow and wide models.
The correlation coefficients (R and ρ) were generally similar for all models at most stations, a few exceptions were noted when the testing dataset was influenced by flow abnormalities (e.g., ST6, ST7, ST9). The Spearman rank correlation coefficients were generally higher at stations were the parametric Pearson product moment correlation coefficient, indicating the models did a better job capturing monotonic trends at these stations. Comparing goodness-of-fit metrics in log(Q+1) transform was useful to better capture observed flows at stations dominated by low flows. However, as most stations exhibited high flow variability, the sensitivity the transformation to high flows could not be fully ruled out (Krause et al., 2005). The use of raw (non-transformed) flowrates helped better capture these high flows as even small prediction errors in the log-transformed data at high flowrates gets magnified when the predictions are transformed back to the raw (original) scale.
Overall, the performance of all three models were comparable to the reported streamflow forecasting models in the literature 390 (Adnan et al., 2017;Danandeh Mehr, 2018;Adnan et al., 2019;Yaseen et al., 2019;Cheng et al., 2020;Kisi et al., 2021) when viewed over the entire flow regime. As the focus of the study is on improving zero flow predictions, additional tests were carried out to evaluate this aspect further.

Deep and wide topologies for predicting zero flow states
If the same set of inputs are used in wide and deep formulation, then the classification cell essentially has the same architecture 395 in both these topologies. Therefore, the comparison here is limited to a single classification cell. As seen from Fig. 6, the Area under the curve (AUC) of the receiver operator characteristics (ROC) curve was greater than 0.5 for all models indicating the classifiers yielded better results than random guessing. The application of SMOTE to balance the training dataset generally improved the prediction accuracy of testing data (which was not subject to SMOTE). At sites where intermittency ratios (#zeros/total # of records) were less than 50%, SMOTE was instrumental in reducing OPE and thus improving zero flow 400 predictions by over 50% on average overall sites. The reduction in OPE was generally accompanied by a small increase in UPE (∼ 15% overall). When the intermittency ratio is high, the non-zero flows represent the minority class that SMOTE tries to balance. Here, the application of SMOTE caused a reduction in the accuracy of predicting zero flows but improved the accuracy of predicting non-zero flows. The overall accuracy of predictions as well as the area under the curve (AUC) the Receiver Operating Characteristics (ROC) curve generally improved with the application of SMOTE except under very high 405 intermittency conditions (e.g., ST9 which has an intermittency of 78%). Thus, SMOTE is a useful data balancing method to improve the predictions of zero flow states in streams with low intermittency and to enhance the accuracy of non-zero flows in streams with high intermittency.

Comparison of zero flow predictions with shallow model
The shallow model, treating the flow time-series of IRES as a continuum, failed to predict any absolute zero flowrates. While a 410 pattern of over-prediction of low flows were observed in the results of the shallow model, in 45% of cases, physically unrealistic, negative flowrates were also predicted by shallow networks (Fig. A1). The addition of a classification cell effectively addressed this issue as the deep and wide models were capable of predicting absolute zero flowrates with no negative predictions.

Comparison of no-flow state persistence and transitions
As the continuous model did not provide absolute zero flow predictions, a nominal flow of 0.0003 m3/s (0.01 cfs -the in-  threshold were assumed to be no-flow events predicted by each model. Then, NFP, FP, NF2FT and F2NFT metrics were computed to compare the performance of the shallow, deep, and wide models in capturing the persistence of no-flow state and transitions between the wet and dry states (Fig. 7-10).
The results presented in Fig. 7 highlight the superiority of the deep and wide topologies in capturing no-flow persistence 420 especially when the intermittency ratio was less than 65%. As the deep and shallow models utilize the same classifier, they provide similar results. The extent of over-prediction of non-zero flows (Q >0.0003 m3/s) was high for the shallow models that in the 55% of cases no zero-flow states were predicted despite applying the cut-off.
NFP, observed to be less than 2 months for 88% of cases, were closely approximated by the deep and wide models (within than half a month) for most streams. The shallow models either failed to capture any dry states or exhibited under-prediction 425 of NFP and hence, were outperformed by the deep and wide models. The deep and wide architectures were also superior in capturing the flow persistence, as the shallow models were prone to over-prediction (up to 5 months). Therefore, the inclusion of a classification cell was essential to capture no-flow persistence. The two-sample Kolmogorov-Smirnov test was used to test the null hypothesis that the observed and modeled zero-flow persistence came from the same distribution against the alternative hypothesis that they came from different distributions. The null hypothesis could not be rejected for any of the models indicating that empirical CDF of the predicted NFP was similar to observed NFP. However, the K-S statistic which is a measure of the distance (D) between two CDFs was shorter (D∼ 0.18) in case of deep and wide topologies as compared to shallow topologies (D∼ 0.3) indicating a better fit with the observations. The shallow architecture over-predicted flow persistence (FP) in all cases which is to be expected because of its inability to correctly predict no-flow states (Fig. 8). The deep and wide models performed much better in predicting flow persistence except 435 when the intermittency ratio was too low (ST2 and ST5 with 12% and 17% intermittency, respectively). However, there was no general trend between the extent of intermittency and flow persistence otherwise. The testing period was generally wetter than normal and also more erratic at these locations compared to the training periods. These microclimatic fluctuations plausibly explain the noted discrepancies at these stations as ANNs are universal interpolators with minimal extrapolation abilities, if any.

440
The shallow model as before could not predict flow persistence in five of the nine stations. In the four stations that the shallow model was able to discern zero flow and flow states the model was able to not reject the null hypothesis of both observed data and modeled results were from the same distribution in three of them (p-value > 0.05 at ST4, ST5 and ST7) and Unlike NFP, the flow persistence varies much more widely across the stations (1 month -16 months) and the deep and wide 450 topologies were able to capture these transitions reasonably well and certainly better than the continuous (Shallow) model.
Wetter than normal conditions during the testing phase did affect some results locally but overall the results were reasonable for the most part indicating the utility of treating intermittent flow data as a mixture and thus warranting a separate treatment of flow and no-flow states both for prediction of no-flow and flow persistence. The F2NFT metric indicates the expected time it takes to move to a no-flow state when the system enters a flow state and is 455 presented in Fig. 9. For highly intermittent streams this value tends to be small (e.g., ST1 and ST9) while for less intermittent streams longer transition times are to be expected (e.g, ST2 and ST5). The shallow model either over-predicts the F2NFT or does not provide any information due to lack of transitions. The over-prediction is mainly caused by over-estimation of no-flow values.
The predictions of deep and wide topologies tend to be fairly close in almost all cases and they definitely capture the F2NFT 460 better than the shallow model. The predictions of deep and wide topologies are also close to observed values in most cases except for ST2 and ST5 which, as explained previously, had significantly different climatic conditions over the testing period as compared to their training. The F2NFT show similar trends as FP (see Fig. 8) but tend to be of shorter duration. This is to be expected because,for an example, a no-flow event is followed by a 3 month flow event and a no-flow event. Then, the FP is 3 months but F2NFT equals the average of 1, 2 and 3 months, i.e., 2 months.  Unlike the F2NFT discussed above, the NF2FT represents the expected time necessary for transitioning from a no-flow to a flow state. The NF2FT were much shorter than F2NFT (compare the Y-axis of Fig. 9 and 10). While the deep and wide models are unable to correctly predict the NF2FT transitions, the errors tend to be within a month (the time-scale of prediction) at 8 of the 9 stations. ST9 with highest level of intermittency has an expected NF2FT of 3 months while the deep and wide models underpredict this value by two months. This result largely arises because the models tend to have higher OPE (over-prediction 470 error) at this station (see Fig. 6d). The performance of the shallow model is not too far off from the deep and wide topologies at the 4 stations where the model is able to discern between flow and no-flow states after application of the cut-off. However, the deep and wide topologies consistently provide estimates of NF2FT at all stations while, the shallow model is unable to do so at 55% of the sites.
The prediction of the non-zero flow states were compared to observed values by constructing empirical CDFs using Gringorten plotting positions (Gringorten, 1963) and generating P-P plots (Gan et al., 1991) and are depicted in Fig. 11 for log(Q+1) transformation and in Fig. 12 for non-transformed data. As seen in Fig. 11 and 12 the application of the log(Q+1) transformation is recommended for low flow analysis of IRES.
The models trained on the transformed datasets provided closer predictions with lower errors. Also, for 55% of the studied 480 IRES, the models trained on transformed flowrates provided better predictions for the entire flow range with notably lower errors (see Table 2). For the other four streams (ST4, ST5, ST6, and ST7), using the transformed data resulted in large errors (by orders of magnitude) in the prediction of extreme high flows. These latter stations were the ones which had one or more recording-breaking flood events during the testing period, several times greater than the maximum peak flow of the training period. Extrapolation beyond the training set, as discussed before, limits the performance of the neural network models and the 485 risk of predicting with large errors is higher when using log(Q+1) transformation.
Three distinct flow regimes were observed in the observed and predicted intermittent flow time-series: the zero and extreme low flows, the mid-flow pulses, and the extreme high flows. The wide models dominate the prediction of the no-and low-flow events for all IRES of study. For predicting the lower mid-flow pulses, the wide topologies were still reasonable and the deep models provided closer approximations for higher flow rates. Following the same predictive behavior, for the extreme high 490 flows that were several orders of magnitude greater than mid-flow pulses and very limited (less than 5 events during the testing period), the deep models had the least errors and the superior performance among the three models. Capturing the extreme large peak flows was a challenging task, even for the deep models and there were multiple cases of under-predictions and few cases of over-prediction. Overall, the deep network topology is recommended as a more appropriate tool when zero and extremely high flows are of simultaneous interest, such as maintaining dry and flooding cycles to ensure sustainable spawning 495 of both small and large fauna over the course of a year (Postel and Richter, 2012). .

Summary and conclusions
In this paper, three Artificial Neural Network-type architectures were investigated for streamflow forecasting in IRES: The shallow model which treated IRES flow time-series as a continuum, and the deep and wide topologies that treated the IRES flow data as a mixture-type consisting of flow and no-flow records using a separate classification cell that is stacked either 500 in series or parallel with a regression cell for the deep and wide models, respectively. The performance of the models was evaluated with a suite of continuous and discrete goodness-of-fit metrics and was explored across nine IRES with varying hydro-climatic features.
The addition of a separate classifier is recommended here, as without one, the shallow model was unable to predict absolute zero-flow events and sometimes even predicted physically unrealistic negative flowrates. Effective capturing of no-flow states 505 by the deep and wide topologies provided useful insights of the IRES flow time-series (e.g., persistence and transitions of states) that otherwise would not have been possible with a shallow network.
For low flow estimation, the wide models were found superior as in the case of a misclassification of no-flow states (OPE) the errors in the estimates of the deep model were higher. However, for high flow analysis, the deep model outperformed the other counterparts and provided closer predictions. The application of two pre-processing techniques, namely SMOTE and Log(Q+1) transformation was also investigated. As IRES flowrates are imbalanced datasets, SMOTE was useful in reducing the extent of imbalance, providing a better trade-off of under-prediction error (UPE) and over-prediction error (OPE), and thereby achieving better classification performance at most stations, except under very high level of flow intermittency. Applying a log transformation with a one-unit shift was found beneficial, specifically for low flow analysis. However, for several stations, the models using the transformed data showed large 515 errors in predicting extreme high flows, limiting the utility of this transformation technique for high flow analysis.
It is recommended that intermittent flow time-series not be treated as a continuum but rather be modeled as a mixture containing zero (or non-zero) inflated data. A stacked topology neural network comprised of an initial classifier cell provides better zero flow predictions and helps maintain its Markovian properties (no-flow and flow persistence and no-flow to flow and flow to no-flow transitions). A parallel coupling (Wide model) of the classification cell with a regression cell is beneficial when 520 the focus is on estimating low flows and having lower errors in case of zero-flow misclassification. On the other hand, a series coupling (deep model) with the regression cell is beneficial when modeling extreme (i.e., zero and very high) flows. Extreme learning machines (ELMs) calibrated using greedy algorithms and LASSO regularization enabled computationally efficient calibration while avoiding issues related to overfitting and curse-of-dimensionality and can be adopted to train stacked ANNs for improved forecasting of intermittent flows.

525
Code and data availability. The data set and custom scripts (in R) developed for this study are available at https://doi.org/10.17605/OSF.IO/UKAGQ (Forghanparast et al., 2021) Author contributions. VU and EAH conceptualized the study. FF carried out data curation and investigation and visualized the project. FF and EAH validated the data. FF and VU conducted the formal analysis and developed the model software. FF, EAH, and VU developed the methodology and wrote, reviewed, and edited the manuscript. EAH acquired the funding, provided resources, and administered the project.

530
VU supervised the project.
Competing interests. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper