Quantifying continuous discharge can be difficult, especially for nascent monitoring efforts, due to the challenges of establishing gauging locations, sensor protocols, and installations. Some continuous discharge series generated by the National Ecological Observatory Network (NEON) during its pre- and early-operational phases (2015–present) are marked by anomalies related to sensor drift, gauge movement, and incomplete rating curves. Here, we investigate the potential to estimate continuous discharge when discrete streamflow measurements are available at the site of interest. Using field-measured discharge as truth, we reconstructed continuous discharge for all 27 NEON stream gauges via linear regression on nearby donor gauges and/or prediction from neural networks trained on a large corpus of established gauge data. Reconstructions achieved median efficiencies of 0.83 (Nash–Sutcliffe, or NSE) and 0.81 (Kling–Gupta, or KGE) across all sites and improved KGE at 11 sites versus published data, with linear regression generally outperforming deep learning approaches due to the use of target site data for model fitting rather than evaluation only. Estimates from this analysis inform

Discharge, or streamflow, is a fundamental measure in hydrology, biogeochemistry, and river science more broadly. A measure of water volume over time, discharge is used to infer the theoretical watershed runoff (the depth of water “blanketing” the land surface, or depth over time), which in turn is integral to understanding watershed processes such as chemical weathering (White and Blum, 1995). Accurate, and at least daily, discharge estimates are essential components of nearly any quantitative study of physical or chemical watershed or river processes at the ecosystem scale. Determinations of solute fluxes (Bukaveckas et al. 1998), gas exchange rates (Hall, 2016), ecosystem metabolism (Odum, 1956), and sediment transport (Graf, 1984) all require well-constrained estimates of discharge.

Despite its centrality to so many fields of study, discharge is a notoriously difficult metric to capture on a regular basis, especially in free-flowing systems, as it may vary greatly with annual cycles and weather events (Turnipseed and Sauer, 2010). Established institutions like the United States Geologic Survey (USGS), Environment and Climate Change Canada (ECCC), and the National Water and Sanitation Agency (ANA) in Brazil have honed their instrumentation, methods, and monitoring locations over decades to generate reasonable discharge estimates, even under extreme conditions (Benson and Dalrymple, 1967; Hirsch and Costa, 2004); however, nascent and/or small-budget monitoring efforts face several challenges. Critically, hundreds of these efforts are constantly occurring within academic research groups, municipalities, counties, and other entities building smaller gauge networks with much less expertise and support and smaller budgets than gauging programs supported by dedicated national programs.

Not including purely model-based methods for discharge prediction (Manning, 1891; Hsu et al., 1995; Durand et al., 2023), automated discharge estimation requires the careful construction of an empirical “rating curve” by which discharge can be continuously inferred from the water level or “stage” (but see Shen, 1981). To build such a relationship, technicians must sample discharge and stage at points covering the range of observable flow, ideally including flood stage. In dynamic systems, this rating curve must be regularly updated. Point estimates of discharge can be collected using acoustic Doppler current profiling (Moore et al., 2017), manual flow meter profiling, or light-based methods (Wang, 1988) to determine the average cross-sectional velocity, or via conservative tracer injections (Tazioli, 2011). In many streams, two or more of these methods must be employed, depending on the conditions (Turnipseed and Sauer, 2010). During 10-year or 100-year floods, no method may be viable or safe. Even under regular storm conditions, a technician may be unable to mount a sampling effort quickly enough to capture peak flow, or they may produce an inaccurate measurement. As a result, rating curves may remain in a state of insufficiency for years, during which time high discharge estimates are unreliable, especially where they are made by extrapolating beyond the observed maximum flow.

Gauge placement presents another obstacle to the rapid deployment of discharge monitoring stations (Isaacson and Coonrod, 2011). Stage measured via pressure transduction is susceptible to bias and nonlinearity under turbulent flow conditions (Horner et al., 2018). Sensors placed in a depositional area may be buried by sediment, and installations in forested watersheds or debris flow regions may be destroyed during floods. Often, equipment must be relocated at least once before a new gauge site can be properly established. Even an established stage–discharge rating curve must be regularly updated and maintained because the bed of the river can change as sediment is deposited or excavated, altering the relationship between stage and flow.

For some studies aiming to quantify stream or watershed processes that require continuous discharge time series, the establishment of a high-quality monitoring station may be infeasible. Where co-location of the site of interest with an existing stream gauge is also infeasible, record-extension (Hirsch, 1982; Nalley et al., 2020) and gap-filling (Harvey et al., 2012; Arriagada et al., 2021) techniques cannot be employed, as these rely on prior knowledge of the statistical properties of the discharge time series being augmented. In such scenarios, streamflow reconstruction or prediction techniques are suitable, as these may proceed a priori or from minimal observation. Reconstruction typically involves methods that leverage the correlation between a partially measured target site and nearby “donor” (predictor) gauges. Discharge may also be quantified in the absence of direct measurements at the target location via statistical (Chokmani and Ouarda, 2004), mechanistic (Regan et al., 2019), or machine learning (Kratzert et al., 2022) modeling techniques.

Here, we use both linear regression (ordinary least squares (OLS), L2/ridge, segmented) and deep learning (long short-term memory recurrent neural network, or LSTM-RNN) approaches to reconstruct discharge from the early operational phase (2015–2022) of the National Ecological Observatory Network (NEON), a time during which site selection issues and rating curve development rendered many site-months of discharge estimates potentially unreliable (Rhea et al., 2023a). Our goal was to achieve Kling–Gupta efficiency (KGE) scores greater than those of the official NEON continuous discharge product at as many sites as possible. A secondary goal was to improve temporal coverage of the official record where it contains gaps. For researchers intending to use NEON continuous discharge data between 2015 and 2022, the results of this effort, as well as efforts by Rhea et al. (2023a), can ensure that data gaps and questionable periods in the official record are replaced by high-quality estimates wherever possible. We provide composite discharge series for all 27 NEON stream gauge locations, built from the best NEON-published estimates and the best estimates generated by this study (

The success of this effort demonstrates the viability of “virtual gauges” (sensu Philip and McLaughlin, 2018; not to be confused with the “virtual staff gauges” of Seibert et al., 2019). In this study, we use the term to describe sites at which discrete discharge observations can be used to fit or evaluate models that generate continuous flow. For accurate results, field measurement campaigns should prioritize characterizing the distribution of possible flow conditions, rather than achieving any particular threshold number of observations. Methods like those presented could be used to reduce the cost and simplify the process of establishing streamflow monitoring sites, especially in river networks that are already partially gauged.

We used the “neonUtilities” package (Lunch et al., 2022) in R to retrieve NEON discharge data. Officially released (NEON, 2023c) and provisional (NEON, 2023b) field measurements were used to fit linear regression models and evaluate all models, as these data were collected directly by NEON technicians using a combination of state-of-the-art methods, including acoustic Doppler current profiling (ADCP; Moore et al., 2017), conservative salt tracer releases (Tazioli, 2011), and flow meter measurements (Pantelakis et al., 2022). We used quality-controlled “finalQ” values where available, or “totalQ” values (taken directly from the flowmeter) in their absence. We refer to NEON's discharge field measurements hereafter as, e.g., “the response variable” or “response discharge time series” in the context of linear regression or as the “target” variable in the context of machine learning. In either context, we refer to the 27 NEON sites for which discharge predictions were generated as “target sites” or “target gauges” (Table 1).

Continuous discharge data (NEON, 2023a) were also retrieved via neonUtilities. We used RELEASE-2023 and

Donor gauge data for linear regression analysis were acquired primarily from the US Geological Survey's National Water Information System (NWIS), using the “dataRetrieval” package (DeCicco et al., 2022) in R. NWIS gauge ID numbers are provided in cfg/donor_gauges.yml at the GitHub and Zenodo links below. Additional donor gauge data from Niwot Ridge LTER and Andrews Forest LTER were retrieved from the MacroSheds dataset (Vlah et al., 2023a) via the package “macrosheds” (Rhea et al., 2023b) and from the EDI data portal (Johnson et al., 2020), respectively.

We used the original CAMELS dataset (Newman et al., 2014; Addor et al., 2017), the USGS National Hydrologic Model with Precipitation-Runoff Modeling System (NHM-PRMS; hereafter NHM; Regan et al., 2019) and the MacroSheds dataset as training data for neural network simulations of discharge data at each target site. CAMELS watershed attributes were generated for MacroSheds and NHM sites using the code provided at

Candidate donor gauges were identified by visually examining an interactive map of NEON gauges, USGS gauges, and MacroSheds gauges (

Barring gauges on reaches that are subject to overt human influence, the exact methods used to choose donor gauges are of little consequence so long as informative donor gauges are not overlooked. In practice, there will usually be just a few, if any, potential donor gauges available for a given location. If multiple donor gauges are included in a regression, L2 regularization (ridge regression) should be used to account for their covariance (see Sect. 2.4)

Map of target sites (NEON) and donor gauge candidates for three target sites: MCRA (McRae Creek, state of Oregon), REDB (Red Butte Creek, state of Utah), and GUIL (Rio Guilarte, Puerto Rico). © OpenStreetMap contributors 2023. Distributed under the Open Data Commons Open Database License (ODbL) v1.0.

All 27 lotic (flowing) aquatic sites associated with NEON were included as target sites for discharge prediction in this study (Fig. 1). The sites TOMB, BLWA, and FLNT are installed on major rivers, downstream of hydropower dams. All other sites have been free of any dam influence since 2012 at the latest, and are designated “wadeable streams” by NEON. In addition to the three sites above, hydrology at BLUE, GUIL, KING, MCDI, and ARIK may be influenced by agricultural activity, especially in the relatively arid Midwest (i.e., the states KS, CO, and OK). Continuous discharge data for TOMB are provided by a nearby gauge of the US Geological Survey's National Water Information System, and are given at hourly intervals, rather than at NEON's customary 1 min intervals.

Target sites for discharge prediction. See

All donor and response discharge time series were neglog transformed (Eq. 1; Whittaker et al., 2005) before fitting linear regression models.

Series were scaled by 1000 before transformation, in order to reduce the disproportionate impact of adding one to every value. Response observations were synchronized to the interval of the predictor series by approximate datetime join, allowing forward or backward timeshifts of up to 12 h if necessary.

One of three forms of linear regression was employed at each site, depending on the number and location of donor gauges and the donor–target gauge relationships. For sites with a single donor gauge (REDB, HOPB, BLUE, SYCA, LECO), the considered predictors were discharge from the donor gauge, a four-season categorical variable, and their interaction. Additionally, an intercept parameter could be estimated, or not, for each specification. Thus, up to six models were fitted using OLS regression (Galton, 1886), ensuring at least 15 observations per model parameter. At LECO, an additional dummy variable was included to address an intercept change due to a wildfire in November 2016. The best model was selected via 10-fold cross-validation, minimizing the mean squared error (MSE). MSE, being a squared-error term, disproportionately penalizes the inaccurate prediction of high discharge values and helps to balance against the relative rarity of high discharge measurements in the field data. At site SYCA, the log-log relationship between discharge at the target gauge and a single donor gauge exhibited a distinct breakpoint, and segmented least-squares regression was used (R package “segmented”; Muggeo, 2008). At all other sites (19 in total), predictors included discharge series from 2–4 donor gauges, season, and all interactions. To control overfitting and shrink covarying coefficients toward zero, we used L2 regularization (ridge regression; Gruber, 2017) via the R package “glmnet” (Friedman et al., 2010). As with the other regression approaches, 10-fold cross-validation and MSE loss were used for model parameter selection – in this case for the value of the penalty hyperparameter

For each site, we fitted two sets of models as described above, one with discharge scaled by watershed area (i.e., “specific discharge” in the surface water hydrology sense) prior to transformation and one without areal scaling. Only one model from each set was ultimately selected for each target site; this was done on the basis of the Kling–Gupta efficiency (KGE; Gupta et al., 2009), a composite model efficiency metric that incorporates measures of correlation, variance, and bias. We also report the percent bias and Nash–Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970), a measure of predictive accuracy that implicitly compares predictions to a mean-only reference model.

Predictions were generated for all time points during which data were available at the selected donor gauges. At target site COMO, a secondary model omitting one donor gauge was able to produce 36 % more predictions than the selected model, so our predicted discharge at COMO is a composite of both models, with the better model's predictions preferred where available. We were unable to locate sub-daily donor gauge data near COMO, so regression predictions for this site were generated at daily intervals. Regression predictions for all other sites were generated at sub-daily intervals matching the coarsest interval across predictor gauges – generally 15 min, though it should be noted that in most cases these predictions were interpolated to 5 min for our composite discharge product.

Supplementing the linear regression methods described above, we simulated discharge data at all 27 target sites using long short-term memory recurrent neural networks (LSTM-RNNs; hereafter “LSTMs”; Hochreiter and Schmidhuber, 1997). Four LSTM strategies were employed, all of which involved training on a large and diverse corpus of stream discharge data (Table 3). Two of these strategies included further finetuning to the time-series dynamics of each target site in turn. Due to the relative scarcity of field-measured discharge observations (between 39 and 213 per site; mean 122), none were used in LSTM training. Instead, these measurements were used only to evaluate predictions. LSTMs trained in this study are intended only for discharge prediction within the temporal and spatial bounds of NEON's early operational phase, not for forecasting or application to other sites. Therefore, all available daily training data were used as such; no validation set was kept for hyperparameter tuning, and no holdout set of daily estimates was kept for evaluation (note that split-sample designs may be undesirable more generally: Arsenault et al., 2018; Guo et al., 2018; Shen et al., 2022). See Kratzert et al. (2019b) and Read et al. (2019) for split-sample considerations in the context of a generalist and process-guided generalist LSTM, respectively.

After a hyperparameter search routine, described below, potentially skilled models were identified as those achieving at least 0.5 KGE and 0.4 NSE. The best-performing potentially skilled LSTM for each site (if applicable) was then re-trained 30 times, forming an ensemble. Ensembles were trained for 18 of 27 sites. LSTM predictions included in our composite discharge product are means taken across the distributions of ensemble point predictions. Uncertainty bounds were computed as the 2.5 % and 97.5 % quantiles of these distributions. LSTM skill was evaluated on the basis of mean ensemble efficiency (KGE) with respect to field-measured discharge (Table A1).

Daily discharge time series (training data) and field-measured discharge were scaled by watershed area. For each predicted day, LSTMs received five dynamic Daymet meteorological forcing variables and 11 static watershed attribute summary statistics (Table 2). Multitask learning (Caruana, 1998; Sadler et al., 2022) was found to improve discharge prediction broadly in a preliminary analysis, so Daymet minimum air temperature was used as a secondary target variable. Kratzert et al. (2019a) found that a maximum of about 150 preceding days were able to influence the LSTM output in a similar prediction problem, so we set the input sequence length to 200 d to ensure full utilization of available information. In other words, for each day of prediction, the model was able to leverage information from the preceding 200 d.

We employed the four different training pipelines described in Table 3. Of the 671 CAMELS watersheds (i.e., basins), we used a subset of 531 with undisputed areas of less than 2000 km

For the process-guided strategies, we used NHM estimates for all reaches coinciding with a CAMELS or MacroSheds gauge, for a total of 551 reaches. Only nine target sites on relatively high-order streams were amenable to the process-guided specialist approach, as these sites are on reaches large enough to be modeled by the NHM. The most recent version of the NHM at the time of this writing provides discharge estimates beginning in 1980 and ending in 2016, just before the installation of most NEON target sites.

LSTM input data.

LSTM model training pipelines used in the simulation of discharge at target sites. Here, “NEON” refers to NEON's continuous discharge product, RELEASE-2023, with quality-flagged estimates and

LSTMs were configured in R and trained using v1.3.0 of the NeuralHydrology library in Python (Kratzert et al., 2022; Van Rossum and Drake, 2009) on the Duke Compute Cluster at Duke University, Durham, NC, USA. All trained models used the Adam optimizer (Kingma and Ba, 2014) and NeuralHydrology's “NSE loss” function after an initial evaluation in which we compared it to the MSE and root mean squared error (Table 4). Learning was annealed using a series of three fixed rates for pretraining and for round one of finetuning according to Eq. (2):

LSTM hyperparameter search space for all model types, and the selected values (bold) used for pretraining. These were observed to allow for both malleability and high performance of subsequent finetuning iterations over nearly 2000 exploratory LSTM trials. The relationship of

All LSTM models were outfitted with fully connected, single-layer embedding networks to efficiently encode inputs as fixed-length numerical vectors (Arsov and Mirceva, 2019). Separate embedding networks were used for static and dynamic inputs, with 20 neurons for static inputs and 200 neurons for dynamic inputs. All embedding neurons used the hyperbolic tangent activation function. Another advantage of embedding networks in the context of the NeuralHydrology library is that they provide one of few opportunities to introduce dropout, which can improve training efficiency and reduce overfitting (Srivastava et al., 2014).

This study generated time-series predictions of discharge for each lotic NEON site using up to three distinct processes: linear regression on absolute discharge, linear regression on specific discharge, and one of four LSTM strategies. We provide regression predictions wherever applicable (24 of 27 sites). LSTM predictions are provided only for sites that had promising model performance after a hyperparameter search and for which ensemble models were therefore trained (18 of 27). All model outputs and results from this study are archived at

In addition to predictions from individual modeling strategies, we provide an analysis-ready discharge dataset for all 27 sites that splices the best available predictions across methods – including published NEON estimates (NEON, 2023a) – into composite series (

To construct composite series, we first distinguished “good” site-months of NEON discharge estimates as those categorized as Tier 1 or Tier 2 by Rhea et al. (2023a). For a NEON site-month to meet the requirements for at least Tier 2, four requirements must be met. The linear relationship between stage, determined from pressure transducer readings, and field-measured gauge height must score at least 0.9 NSE. The transducer-derived stage series must also pass a drift test relative to gauge height, but only if sufficient data exist to perform such a test. The rating curve used to relate stage to discharge must score at least 0.75 NSE, and fewer than 30 % of predicted discharge values may exceed the range of measured discharge used to build the curve. See Rhea et al. (2023a) for further details.

Although only 50 % of NEON's RELEASE-2023 estimates are classified as Tier 1 or Tier 2, the remainder may still be of high analytical value if NEON's quality control indicators and uncertainty bounds are observed. We also stress that NEON rating curves and protocols improved over the course of its early operational phase and continue to do so.

We then ranked the available predictions for each site, assigning a rank of 1 either to predictions from linear regression or to NEON's continuous data product, depending on the overall KGE and NSE against the field-measured discharge. KGE was considered first and used to determine preference, except in cases where the difference between NSE scores was greater than that between KGE scores and opposite in sign. Rank 2 predictions were then used to fill gaps of 12 or more hours in the rank 1 series, but only “good” NEON site-months were included. Only after this first round of gap-filling were the remaining NEON data incorporated, with site-years achieving at least 0.5 KGE and 0.5 NSE against the field-measured discharge being used to fill still-remaining gaps. Finally, daily LSTM predictions (placed at 12:00:00 UTC on the day of prediction) were used to fill any recalcitrant gaps, but only if produced by an ensemble model achieving at least 0.5 KGE and 0.5 NSE across all field discharge observations. Note that while such benchmarks are in common use (Moriasi et al., 2015), the efficiency that any model can or should achieve varies substantially with the hydroclimate and watershed characteristics of a given site (Seibert et al., 2018). We provide all data and code for modifying the composite discharge product in accordance with alternative benchmarks as users see fit. After visual examination of composite series plots, we chose to prefer NEON predictions to linear regression predictions at site ARIK, “good” or not, due to frequent sharp discontinuities between the two predicted series. See Table A1 for an account of the linear regression and LSTM methods used in the construction of ensemble series.

The prevailing interval varies across data sources used to assemble our composite discharge product from 1 min (NEON) to 1 d (LSTM predictions; regression predictions at site COMO). Regression predictions were primarily generated at 15 min intervals, and their timestamps are always divisible by 15 min. Around the prevailing NEON interval there is considerable variation due to data gaps and sensor reconfigurations, both across sites and across the temporal ranges of each site's record. To reduce the complexity associated with irregular time-series analysis, we synchronized the interval across data sources to 5 min. Regression estimates were linearly interpolated to 5 min, though gaps larger than 15 min were not interpolated. NEON estimates were first smoothed with a triangular moving average window of 15 min to remove unrealistic minute-to-minute noise associated with Bayesian error propagation. They were then interpolated the same way as the regression estimates and finally downsampled to 5 min, with some timestamps being shifted by up to 2 min. For example, for a sampling duration of 30 min, a sample taken at 00:03:00 would be shifted by 2 min by rounding each timestamp up to the nearest minute divisible by 5.

A performance comparison of linear regression on discharge from donor gauges and four LSTM strategies is shown in Figs. 2 and A1 and detailed in Table A1. Via linear regression, we were able to produce 15 min discharge estimates at 11 sites with overall KGE scores higher than those of published series (Fig. 2). At four of the same sites, we achieved a higher KGE via LSTM methods, which generated daily discharge series. Of the 10 sites at which the published discharge KGE was less than 0.8, we improved five sites to above that mark (mean 0.932,

Efficiency of five stream discharge prediction methods and NEON's published continuous discharge product at 27 NEON gauge locations versus field-measured discharge. Small, white triangles represent the max/min KGE of the published discharge by water year (1 October through 30 September) with at least five field measurements (or two for site OKSR). KGE was computed on all available observation–estimate pairs except those with quality flags (dischargeFinalQF or dischargeFinalQFSciRvw of 1). For the best-performing LSTM method at all sites except TECR, FLNT, REDB, WALK, POSE, and KING, the displayed KGE is averaged over 30 ensemble runs with identical hyperparameters. For the sites just named, the performance of a chosen method after ensembling dropped below that of at least one other method's optimal KGE from the parameter search. For all other LSTM site–method pairs, which were not ensembled, the displayed performance is that of the best model trained during the parameter search phase. Sites are ordered by the KGE of the NEON continuous discharge. See Table 3 for LSTM model definitions. A KGE of 1 is a perfect prediction, while a KGE of

For 12 of 27 sites, linear regression on specific discharge (i.e., scaled by watershed area) provided the most accurate discharge predictions, while linear regression on absolute discharge performed better at the other 12 sites with donor gauges. LSTM models (as proper ensembles) outperformed linear regression at only two sites. In general, linear regression provided more accurate predictions than all LSTM methods. Linear regression on absolute discharge produced estimates with a median NSE of 0.848 and a median KGE of 0.806 across sites (

Performance of five stream discharge prediction methods, and the official continuous discharge time-series data, across

Linear regression was not applicable at sites TECR, BIGC, or WLOU due to the lack of donor gauges contemporary with target gauge data. Donor gauges associated with Kings River Experimental Watersheds exist within close proximity to TECR and BIGC, but we were unable to access up-to-date discharge records for these gauges.

The process-guided specialist LSTM yielded predictions on par with those of the other LSTM strategies in terms of KGE (median 0.652;

In addition to improvements in accuracy, estimates from this study inform

Durations of missing values (gaps) in NEON's 2023 release of continuous discharge time series, illustrating gaps filled or informed by estimates from this analysis. All officially published values are shown, including those with quality control flags. Sites are ordered as in Fig. 2. Gaps smaller than 6 h are not indicated. Figure A10 is the same, but with a fixed and labeled

Estimated discharge time series from this study are of practical value for any researcher using NEON continuous discharge data, especially for those sites and site-months at which published data from NEON's early operational phase may be unreliable (Rhea et al., 2023a). Figure 4 shows that official records at sites REDB and LEWI are compromised by disagreements (erratic sections of gray lines) between pressure transducer stage readings and manual gauge height recordings, as discussed in Rhea et al. (2023a). Red lines show improved estimates via linear regression on discharge from donor gauges. Sites FLNT and WALK show generally close agreement between NEON discharge and our regression estimates, but the uncertainty associated with high discharge values should be noted.

Best linear regression predictions of continuous discharge for four NEON gauge-years compared with official NEON discharge data. All officially published values are shown, including those with quality control flags, indicated by black marks on the lower border. Light red bands represent 95 % prediction intervals. NEON uncertainty is not shown.

This study was designed to produce high-quality estimates of continuous discharge for NEON stream gauges, especially at 10 gauges for which the KGE of published continuous discharge was lower than 0.8, over the full record, when compared to field-measured discharge. A secondary goal was to improve temporal coverage of the official discharge record where possible.

We treat NEON field-measured discharge as truth, which means there are 39–213 observations for each target site. Although these numbers represent a tremendous investment of time and technical effort, they do not meet the high data volume requirements for most machine learning approaches, so we used field discharge only to evaluate, rather than train, LSTM models. By contrast, in linear regression, regardless of the details of any particular method, we ultimately fit a line to the relationship between donor gauge data and field measurements at each target site. Because the linear regression models are allowed to “see” all of the target site data (after a model is selected via cross-validation), they have a powerful advantage over the LSTM approaches, which in this context must essentially treat target watersheds as if they are ungauged. Furthermore, whereas the LSTM models must parameterize each day of prediction individually, the regression models need only parameterize relationships between flow regimes. Still, if given enough training data, including examples of watersheds and streams similar to each of those modeled in this study, the LSTM approaches would eventually close the performance gap. See Figs. A2, A3, A4, A5, A7, and A8 for linear regression diagnostics.

In this study, discharge estimates produced by linear regression were more accurate than those generated by LSTM models in 21 of 23 comparisons (Fig. 2). This demonstrates the value of existing gauge networks in advancing discharge estimation at newly or partially gauged locations; however, there is a limit to the predictive potential of linear regression methods, as they depend on a strong correlation between streamflow at target and donor gauges. In principle, there is no such limit for machine learning approaches, which are instead limited by the quality and quantity of training data.

The process-guided specialist LSTM yielded predictions on par with those of the other LSTM strategies in terms of KGE, but performed worst of the four in terms of NSE, possibly indicating that information gleaned from NHM estimates helped this strategy to accurately capture discharge variance and reduce prediction bias without ultimately improving the correlation between predictions and observations. Unlike KGE, NSE only explicitly captures this latter metric (Nash and Sutcliffe, 1970; Gupta et al., 2009). Conversely, the specialist performed better than the generalist in terms of NSE but not KGE, suggesting that information contained in NEON's continuous discharge product was of disproportionate predictive value relative to each of correlation, variance, and bias, favoring correlation.

The specialist may have been affected by data filtering choices. After filtering NEON continuous discharge for rating curve issues, drift, and quality flags, relatively few daily estimates were available for some sites (47–1642). Annual and seasonal variation in meteorological forcings and discharge in NEON sites' generally small, often mountainous watersheds may be large enough that finetuning a pretrained LSTM on a few hundred days of site-specific data reduces its ability to generalize at that site. Our specialist LSTM strategy in particular might be improved with a broader hyperparameter search, especially one that explores smaller learning rates. Ideally, site-specific finetuning should enable better prediction by allowing the network to assimilate information unique to the target site without corrupting previously learned generalities. For validation plots of all ensembled LSTMs, see Fig. A6.

The process-guided specialist LSTM strategy was viable at nine sites for which discharge estimates were available from the National Hydrologic Model. By using a mechanistic (i.e., process-based) model with higher spatial resolution than the NHM, it should be possible to apply this process-guided approach at more of the NEON sites. A potentially stronger process-guided approach would use mechanistic model predictions as features (predictors), rather than training targets, but that would require mechanistic model predictions concurrent with discharge series at target sites, whereas NHM predictions at the time of this writing are available only through the year 2016. For a summary of process-guided deep learning strategies, see the “Integrating Design” subsection of Appling et al. (2022).

We caution that evaluation scores for both NEON's published estimates and ours are computed on a small fraction of each series for which both an estimate and a direct field measurement are available (39–213 per site), and that measurements tend to be collected disproportionately at low flow. This often occurs for practical reasons such as site access and technician safety, but may also reflect a need to characterize the low-flow variability of the stage–discharge relationship in streams with unstable low-flow hydrologic controls, such as unconsolidated bed material.

Whatever the reason for less sampling at high flow, any model attempting to use field measurements to reconstruct continuous discharge will estimate with greater uncertainty at high flow than at low, and users of our composite discharge product should observe uncertainties associated with estimates from all methods. Mechanistic models that proceed from physical principles, or data-driven approaches that can generalize from prior observations, do not in principle suffer this disadvantage, as they do not depend on observations from a target site. However, these approaches may not reliably generate strong predictions at all sites or under all conditions (Razavi and Coulibaly, 2013; Kratzert et al., 2019b), and may produce erratic point estimates where conditions diverge from past observations. Hybrid approaches that successfully leverage field measurements, as well as physical principles or learned relationships, are likely to yield well-constrained predictions where our efforts did not.

This study demonstrates that, in proximity to established streamflow gauges, even simple statistical methods can be used to generate accurate, continuous discharge at “virtual gauges” where discrete discharge has been measured. The number of field measurements across sites in this study varies from 39 to 213, but the number required for virtual gauging may be substantially smaller than even the minimum of this range. If the discharge relationships between a target site and all donor gauges were perfectly linear or log-linear, they could in principle be established with only two precise measurements at the target site. More important than the quantity is the distribution of measurements across flow conditions, which should be sufficient to fully characterize all modeled discharge relationships and their linearity or lack thereof (Sauer, 2002; Zakwan et al., 2017). Concretely, we advocate for “storm chasing”, or disproportionately seeking to sample discharge under high-flow conditions and during both rising and falling limbs of storm events, rather than routine sampling. Observed NEON flow conditions relative to predicted discharge can be seen in Fig. A9. See Philip and McLaughlin (2018) for further commentary on establishing a virtual gauge network, and see Seibert and Beven (2009) and Pool and Seibert (2021) for information on the number and statistical properties of discharge samples required to establish strong stage–discharge or discharge–discharge relationships.

Using linear regression on donor gauge data and LSTM-RNNs, we reconstructed continuous discharge at 5 min and/or daily frequency for the 27 stream and river monitoring locations of the National Ecological Observatory Network (NEON) over the water years 2015–2022. Relative to field-measured discharge used as ground truth, our estimates achieve higher Kling–Gupta efficiency than NEON's official continuous discharge at 11 sites. We also provide continuous discharge estimates for

In general, linear regression methods produced more accurate discharge estimates (median KGE: 0.79; median NSE: 0.81;

Improvements to our design could be made in several ways. LSTM models could be exposed to additional training data, such as the recently published Caravan compendium of CAMELS offshoots (Kratzert et al., 2023) or future expansions of the MacroSheds dataset (Vlah et al., 2023a). Neural networks trained on sub-daily inputs might be better equipped to exploit atmospheric–hydrological dynamics that respond to both daily and annual cycles. Linear regression methods too might be improved with the use of additional predictors, such as continuous water level or precipitation.

The success of simple statistical methods in generating high-quality continuous discharge time series demonstrates the viability of “virtual gauges”, or locations at which a small number of field discharge measurements in proximity to one or more established gauges provide a basis for continuous discharge estimation in lieu of a gauging station. Virtual gauges have the potential to greatly expand the spatial coverage of continuous discharge data throughout the USA and any richly gauged region of the world.

All project code is on GitHub at

The code repository is archived on Zenodo:

All model input, output, and diagnostics are archived on Figshare:

MRVR, ESB, and MJV originated the project and identified its goals and methods. MJV carried out all analyses and drafted the manuscript. SR assisted in data collection. All authors took part in steering the project and editing the manuscript.

The contact author has declared that none of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

The authors are grateful to the NeuralHydrology team for their efforts in democratizing deep learning for the hydrology community. We thank NEON, NCAR, NWIS, Niwot Ridge LTER, Andrews Forest LTER, and the USGS for generating the data, that made this analysis possible. Special thank you to Parker Norton of the USGS for extracting all NHM-PRMS outputs used in this study.

The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle. This material is based in part upon work supported by the National Science Foundation through the NEON Program.

This research has been supported by the National Science Foundation (grant no. 1926420).

This paper was edited by Jan Seibert and reviewed by Roy Sando and one anonymous referee.