Leveraging gauge networks and strategic discharge measurements to aid the development of continuous streamflow records

Vlah, Michael J.; Ross, Matthew R. V.; Rhea, Spencer; Bernhardt, Emily S.

doi:https://doi.org/10.5194/hess-28-545-2024

Articles | Volume 28, issue 3

https://doi.org/10.5194/hess-28-545-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-545-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 3

Research article

|

08 Feb 2024

Research article |

| 08 Feb 2024

Leveraging gauge networks and strategic discharge measurements to aid the development of continuous streamflow records

Michael J. Vlah, Matthew R. V. Ross, Spencer Rhea, and Emily S. Bernhardt

Download

Final revised paper (published on 08 Feb 2024)
Preprint (discussion started on 18 Jul 2023)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-1178', Roy Sando, 16 Aug 2023
General comments –
This manuscript documents work to use a “donor-gauge” concept to predict continuous daily streamflow discharge time series by developing OLS equations using a limited number of discrete, manual discharge measurements coupled with overlapping timeseries of streamflow observed at one or more nearby donor gauges. Additionally, the manuscript describes results from training multiple Neural Network models to fill in gaps or improve data that are inaccurate training on basin characteristics or basin characteristics and a process-based model, and finetuned on partial streamflow records meeting certain quality criteria.
Overall, this work is important in that it shows the feasibility of creating synthetic time series of streamflow discharge using only discrete manual measurements as long as there is at least one nearby continuous streamgage. The paper is very well written, and the analyses are thorough and well described. I believe the paper could benefit from some reorganization and clarifying text.
I suggest that the results and discussion be separated into two separate sections. There is much to discuss from this work. I added comments for specific text that I thought could go into the discussion section.
The LSTMs, while thoroughly and expertly developed, are limited by a lack of transferability. This is ok, given that the primary contribution of this paper is to show that continuous streamflow discharge can be predicted at high temporal resolution by developing simple models using a limited number of concurrent measurements. That said, a lot of value could be added by holding out data at the NEON sites and showing the performance of the models on the holdout data. I understand the authors’ point that this approach is not always optimal for producing the best predictions, but it is quite useful for understanding the performance of the model at ungauged sites. I don’t think this analysis is required for the work to published, but I would suggest at least adding this into a subsection of the discussion that discusses future work or ideas.
In my opinion, the paper is a bit misleading in that both the simple (OLS) and complex (LSTMs) models are put in the context of leveraging the donor gauge information to predict at “virtual gage” sites (for example, line 26 of the abstract). However, the LSTMs are really only being used to essentially correct the NEON time series and fill gaps. I don’t think this is a big deal, as the work and results are still important, but I do think that there are some points in the manuscript that could be written more clearly. I address these in the specific comments below.

Specific comments –
Line 26-27. If by “simple and complex” you mean OLS and LSTM-RNN, I don’t think this statement is entirely true. If I understood the paper correctly, weren’t the discrete manual measurements only used in the development of the OLS models? It was my understanding that the NNs were trained on the continuous timeseries data published as part of the NEON data that met quality assurance criteria.
Line 89. I’m having a hard time following the sentence that begins with “Where co-location…”. I assume the author is referring to co-location of the study (or site of interest), but because the previous paragraph is discussing gauge placement, I could also see this referring to co-location of low-cost monitoring equipment. Depending on which meaning the authors intend, I think the implications and direction of this paragraph are changed. I suggest adding a bit of text to clarify. Possible rewrite:
“Where co-location of the site of interest with an existing stream gauge…” or “Where co-location of low-cost, rapid-deployment monitoring gauges with an existing stream gauge…”
Line 110. “wide range of flows” is subject to interpretation. I think it would be helpful if the authors provided more information on what is meant by “wide range of flows”. I think this information would also provide valuable insight into how transferrable this method is to other locations. If laid out thoroughly, I think this could allow readers to have a greater understanding of how well sampled a location needs to be to use the methods described in this manuscript to reconstruct or predict streamflows. For example, how important is it that the flows used to develop the equations or train the models captures the annual extremes? How does predictive performance change as a function of the relative distribution of flows represented by the training data? At the very least, I think some text is required to discuss the implications of the representativeness of the training data.
Line 150. Figure 1. The resolution of this figure is poor. I suggest replacing this figure with a higher resolution figure. I also think that this figure would benefit greatly from having the NEON sites labeled to give the reader some geographical context.
Lines 164-173. Suggest moving this text up into a new section titled “2.3 Donor-gauge Selection” as it is a crucial component of the analysis and risks being overlooked by including it in the section on model selection.
Lines 169-170.
Are the three criteria provided in parentheses the only criteria used to determine donor-gauge candidacy? If so, suggest removing the “e.g.”, as you are not providing examples of criteria, but the exhaustive list of criteria. If so, I suggest listing all the criteria used and not just providing examples.

For the criteria used in determining geographic similarity: what data and methods were used to determine whether a site was in an urban area or downstream from a reservoir?

Additionally, see the technical comment for lines 167-170 provided below.
Line 268. Authors state that the models were “trained, validated, and tested”, but the methods only describe training and validation. Were the LSTMs test on holdout data? If not, suggest removing “tested”.
Lines 362-365. Great that the authors noted the limitations of the evaluation metrics due to potentially limited representation of flow in the evaluation data. If possible, providing more information (beyond the statement on disproportionate representation of low flows) would be very helpful to the reader in understanding where there might be more or less uncertainty in the model predictions.
Line 374. Figure 2. Do the bars show the mean KGE of the 27 sites? The figure caption does not state what the value represented by the bars is. Please include that in the caption.
Line 385. Figure 3. This is a great figure! A great presentation of a lot of info. Adding an x-axis showing time, even if the interval is in years, would add a lot of value. Additionally, if the stations could be ordered in the same manner as they are on the x-axis of Figure 2, it would help in the assessment of performance at individual sites and add to the flow of the paper.
Lines 419-426. This text could go into a discussion section.
Lines 436-441. This text could go into a discussion section.
Lines 455-474. This text could go into a discussion section.
Line 515. Figure A1. Do the bars show the mean KGE of the 27 sites? The figure caption does not state what the value represented by the bars is. Please include that in the caption.

Technical comments –
Line 99. “site location selection” is redundant. Suggest changing to “site-selection issues”.
Line 100. Suggest changing “high-quality” to “higher quality”.
Line 105. Suggest removing the comma after “series,”
Line 108. Change “virtual gauges.” to “virtual gauges” by removing the period.
Lines 167-170. For conciseness and clarity, suggest the following rewrite: “For each target site, up to four donor gauge candidates were selected based on spatial proximity (within 50km) and geographic similarity (not in an urban area; not downstream of a reservoir). Generally, no more than four gauges met these criteria, but for one target site (MCRA) there were 10 nearby candidate gauges to select from, all of which were associated with the Andrews Experimental Forest in western Oregon, USA.”
Lines 172-173. Suggested rewrite: “…chose three candidate sites representing A) a catchment upstream of the target site (GSWS08), B) downstream of the target site on the MCRA mainstem (GSLOOK), and C) downstream on a tributary of MCRA (GSWS01).”
Line 198. Change “glmnet” to “Ridge-regression”.
Line 455. Change “low flow conditions” to “low-flow conditions”
Citation: https://doi.org/10.5194/egusphere-2023-1178-RC1
- AC1: 'Reply to RC1', Michael Vlah, 06 Oct 2023
  
  RC1 (responses in bold)
  General comments –
  This manuscript documents work to use a “donor-gauge” concept to predict continuous daily streamflow discharge time series by developing OLS equations using a limited number of discrete, manual discharge measurements coupled with overlapping timeseries of streamflow observed at one or more nearby donor gauges. Additionally, the manuscript describes results from training multiple Neural Network models to fill in gaps or improve data that are inaccurate training on basin characteristics or basin characteristics and a process-based model, and finetuned on partial streamflow records meeting certain quality criteria.
  Overall, this work is important in that it shows the feasibility of creating synthetic time series of streamflow discharge using only discrete manual measurements as long as there is at least one nearby continuous streamgage. The paper is very well written, and the analyses are thorough and well described. I believe the paper could benefit from some reorganization and clarifying text.
  Thank you for your comments.
  I suggest that the results and discussion be separated into two separate sections. There is much to discuss from this work. I added comments for specific text that I thought could go into the discussion section.
  Our editor shares this preference. The two sections will be separated in the next revision.
  The LSTMs, while thoroughly and expertly developed, are limited by a lack of transferability. This is ok, given that the primary contribution of this paper is to show that continuous streamflow discharge can be predicted at high temporal resolution by developing simple models using a limited number of concurrent measurements. That said, a lot of value could be added by holding out data at the NEON sites and showing the performance of the models on the holdout data. I understand the authors’ point that this approach is not always optimal for producing the best predictions, but it is quite useful for understanding the performance of the model at ungauged sites. I don’t think this analysis is required for the work to published, but I would suggest at least adding this into a subsection of the discussion that discusses future work or ideas.
  Split-sample analogues for a generalist LSTM (Kratzert et al. 2019b) and a process-guided generalist (Read et al. 2019, referenced below) already feature prominently in the literature, so we will highlight these other works in the discussion. In both those cases, the goal was robust out-of-sample prediction.
  Read, J. S. et al. (2019) ‘Process-Guided Deep Learning Predictions of Lake Water Temperature’, Water Resources Research, 55(11), pp. 9173–9190. doi: 10.1029/2019WR024922.
  In my opinion, the paper is a bit misleading in that both the simple (OLS) and complex (LSTMs) models are put in the context of leveraging the donor gauge information to predict at “virtual gage” sites (for example, line 26 of the abstract). However, the LSTMs are really only being used to essentially correct the NEON time series and fill gaps. I don’t think this is a big deal, as the work and results are still important, but I do think that there are some points in the manuscript that could be written more clearly. I address these in the specific comments below.
  We’d like to clarify that the linear regression and LSTM methods accomplish the same goal of correcting and gap-filling the NEON time series. The two approaches supplement each other.
  Another minor point is that OLS is only used at five sites for which a single donor gauge is available. Ridge regression is used at 19 sites, and segmented regression at one site. “Linear regression” is the term that we use in the paper to encompass these methods.
  We address your concern about conflating the two approaches, including on line 26, in responses to the comments below.
  Specific comments –
  Line 26-27. If by “simple and complex” you mean OLS and LSTM-RNN, I don’t think this statement is entirely true. If I understood the paper correctly, weren’t the discrete manual measurements only used in the development of the OLS models? It was my understanding that the NNs were trained on the continuous timeseries data published as part of the NEON data that met quality assurance criteria.
  The discrete discharge measurements were leveraged differently by the two broad model categories. For the linear regressions, they were used for fitting. For the LSTMs, they were used for evaluation-- After training LSTMs on diverse, continuous discharge data, including at the target site(s), we needed the manually measured discharge data to determine whether the models were any good.
  Lines 26-27 are “Here, we investigate the potential for both simple and complex models to accurately estimate continuous discharge (at least daily estimates), using only discrete manual measurements of streamflow.” This sentence could be improved with “...for both linear regression and deep learning models to accurately estimate continuous discharge (at least daily estimates), when discrete streamflow measurements are available at the site of interest.”
  Line 89. I’m having a hard time following the sentence that begins with “Where co-location…”. I assume the author is referring to co-location of the study (or site of interest), but because the previous paragraph is discussing gauge placement, I could also see this referring to co-location of low-cost monitoring equipment. Depending on which meaning the authors intend, I think the implications and direction of this paragraph are changed. I suggest adding a bit of text to clarify. Possible rewrite:
  “Where co-location of the site of interest with an existing stream gauge…” or “Where co-location of low-cost, rapid-deployment monitoring gauges with an existing stream gauge…”
  The first meaning is what we intended, and we will gladly borrow your clarifying text. Thanks!
  Line 110. “wide range of flows” is subject to interpretation. I think it would be helpful if the authors provided more information on what is meant by “wide range of flows”. I think this information would also provide valuable insight into how transferrable this method is to other locations. If laid out thoroughly, I think this could allow readers to have a greater understanding of how well sampled a location needs to be to use the methods described in this manuscript to reconstruct or predict streamflows. For example, how important is it that the flows used to develop the equations or train the models captures the annual extremes? How does predictive performance change as a function of the relative distribution of flows represented by the training data? At the very least, I think some text is required to discuss the implications of the representativeness of the training data.
  Lines 463-469 of the discussion address this more thoroughly, but we will provide more detail in the introduction by saying “In this study, we use the term [virtual gauge] to describe sites at which discrete discharge observations can be used to fit or evaluate models that generate continuous flow. For accurate results, field measurement campaigns should prioritize characterizing the distribution of possible flow conditions, rather than achieving any particular threshold number of observations.”
  Line 150. Figure 1. The resolution of this figure is poor. I suggest replacing this figure with a higher resolution figure. I also think that this figure would benefit greatly from having the NEON sites labeled to give the reader some geographical context.
  This figure resolution will be improved, and NEON sites labeled, in the final version.
  Lines 164-173. Suggest moving this text up into a new section titled “2.3 Donor-gauge Selection” as it is a crucial component of the analysis and risks being overlooked by including it in the section on model selection.
  Thank you for this suggestion. Those lines will be moved to a new section in the final version.
  Lines 169-170.
  Are the three criteria provided in parentheses the only criteria used to determine donor-gauge candidacy? If so, suggest removing the “e.g.”, as you are not providing examples of criteria, but the exhaustive list of criteria. If so, I suggest listing all the criteria used and not just providing examples.
  For the criteria used in determining geographic similarity: what data and methods were used to determine whether a site was in an urban area or downstream from a reservoir?
  There are other conceivable disqualifiers, such as tidal influence (not applicable in our case) or water withdrawal (harder to identify), but these were the three criteria we used, so that “e.g.” would serve better as an “i.e.” We will make clearer that visual inspection of interactive maps was the only method used.
  Additionally, see the technical comment for lines 167-170 provided below.
  Line 268. Authors state that the models were “trained, validated, and tested”, but the methods only describe training and validation. Were the LSTMs test on holdout data? If not, suggest removing “tested”.
  Thanks. This wording was a holdover from a previous draft and will be amended.
  Lines 362-365. Great that the authors noted the limitations of the evaluation metrics due to potentially limited representation of flow in the evaluation data. If possible, providing more information (beyond the statement on disproportionate representation of low flows) would be very helpful to the reader in understanding where there might be more or less uncertainty in the model predictions.
  We will clarify that uncertainty is generally higher for high-discharge estimates.
  Line 374. Figure 2. Do the bars show the mean KGE of the 27 sites? The figure caption does not state what the value represented by the bars is. Please include that in the caption.
  Some of these bars represent means, which we do mention: “...displayed KGE is averaged over 30 ensemble runs....” The rest are simply KGE computed on all available observation-estimate pairs. We will append “, computed on all available observation-estimate pairs” to the first sentence of the caption.
  Line 385. Figure 3. This is a great figure! A great presentation of a lot of info. Adding an x-axis showing time, even if the interval is in years, would add a lot of value. Additionally, if the stations could be ordered in the same manner as they are on the x-axis of Figure 2, it would help in the assessment of performance at individual sites and add to the flow of the paper.
  Thanks! We appreciate these suggestions. A re-ordering of sites (and explanation of said ordering in the caption) will be useful. As for including the x-axis, we’d like to identify a tradeoff: observations begin anywhere from March 2016 to Oct 2018, depending on the site. That means there will be a lot of white space on the left side of this figure, and that the already dense detailing will be squished into as little as 3/5 of the available horizontal space. You’ve identified a clear value-add (visualizing when sampling begins and when gaps occur), but one that goes beyond the intent of the figure, which is to illustrate the relative contribution of NEON/reconstruction estimates to the full picture, and to allow potential data users to visually assess (in)congruity between the black and gray lines. So, we will provide the plot you suggest, with x-axis included, as a supplementary figure.
  Lines 419-426. This text could go into a discussion section.
  Accepted
  Lines 436-441. This text could go into a discussion section.
  Accepted
  Lines 455-474. This text could go into a discussion section.
  Accepted
  Line 515. Figure A1. Do the bars show the mean KGE of the 27 sites? The figure caption does not state what the value represented by the bars is. Please include that in the caption.
  See proposed solution to line 374 comment.
  Technical comments –
  Line 99. “site location selection” is redundant. Suggest changing to “site-selection issues”.
  Accepted
  Line 100. Suggest changing “high-quality” to “higher quality”.
  The quality of data gaps is indeterminate, so the comparative wouldn’t work here (assuming this refers to line 104, rather than 100)
  Line 105. Suggest removing the comma after “series,”
  Accepted
  Line 108. Change “virtual gauges.” to “virtual gauges” by removing the period.
  Good catch!
  Lines 167-170. For conciseness and clarity, suggest the following rewrite: “For each target site, up to four donor gauge candidates were selected based on spatial proximity (within 50km) and geographic similarity (not in an urban area; not downstream of a reservoir). Generally, no more than four gauges met these criteria, but for one target site (MCRA) there were 10 nearby candidate gauges to select from, all of which were associated with the Andrews Experimental Forest in western Oregon, USA.”
  This rewrite does make the text more concise, though geographic similarity encompasses terrain and landcover broadly, and is a continuum, whereas artificial flowpaths and control structures grossly alter flow relationships, and do so in idiosyncratic ways. We will focus on clarifying here that, barring gauges with overt human influence, the exact methods used to choose donor gauges are of little consequence, so long as informative donor gauges are not overlooked. In practice, there will usually be just a few, if any, potential donor gauges available. If multiple donor gauges are included in a regression, L2 regularization (Ridge regression) should be used to account for their covariance.
  Lines 172-173. Suggested rewrite: “…chose three candidate sites representing A) a catchment upstream of the target site (GSWS08), B) downstream of the target site on the MCRA mainstem (GSLOOK), and C) downstream on a tributary of MCRA (GSWS01).”
  This becomes a bit of a parenthesis soup, especially confusing when some of them are unpaired.
  Line 198. Change “glmnet” to “Ridge-regression”.
  Accepted
  Line 455. Change “low flow conditions” to “low-flow conditions”
  Accepted
  
  Citation: https://doi.org/10.5194/egusphere-2023-1178-AC1

RC2: 'Comment on egusphere-2023-1178', Anonymous Referee #2, 26 Aug 2023

This paper describes an effort to reconstruct and correct streamflow data series from NEON (National Ecological Observatory Network) stream gaging sites, based on limited field measurements at the NEON sites and information from nearby stream gages.

For anyone who has attempted to use NEON streamflow data, this contribution is a gift. The methods are clearly described, code is provided, and the resulting datasets are made available. Thanks to the authors for their efforts.

Comments:

Abstract: please describe how the linear models performed compared to the LSTM models, and why.
A number of datasets and models are used in the analysis. It would be helpful to provide a table of these and their hyperlinks with a description of each, in the Appendix. For example, CAMELS, NHM, etc. And in addition, can you provide links to the “donor” datasets that you used? Perhaps this is provided and I missed it.
Results/Discussion. A number of products including code, model results, and datasets are produced. It would be helpful to provide a table of the results created by the analysis and their hyperlinks with a description of each, in the text.
Results/Discussion. I could not find metadata in the materials provided at the hyperlinks. Please provide measurement units (are these unit area flows in mm?) and other metadata for the composite series linked at https://figshare.com/articles/dataset/Composite_discharge_series_for_each_NEON_stream_river_site_/23206592
Results/discussion (l. 390-400) and conclusions: (l. 495-500). Please provide a slightly expanded explanation for why the LSTM models performed considerably less well than the linear models. It appears that it was not the model structures themselves but their data requirements that limited the performance of the LSTM models relative to the regression models. Can you help readers better understand these limitations more generally? Were watershed characteristics or other data shown in Table 2 relevant to the performance of the linear vs. LSTEM models? How did the differences in drainage areas of the CAMELS data set relative to these watersheds (l. 25-251) influence model performance?
Results/discussion. For each site, could you summarize the record lengths over which you were able to reconstruct the 5-minute data vs. developing a daily estimate? And if the daily estimates were based on the LSTM models, which have a median NSE of 0.32 to 0.47, why are the confidence intervals for daily discharge in the resulting composite data series so narrow?
Conclusions. Do you have any recommendations for NEON to improve their stream gaging procedures, data collection, and information management processes? Will it be necessary for users of NEON data to use your code in the future?

Citation: https://doi.org/10.5194/egusphere-2023-1178-RC2

AC2: 'Reply on RC2', Michael Vlah, 06 Oct 2023

RC2 (responses in bold)

This paper describes an effort to reconstruct and correct streamflow data series from NEON (National Ecological Observatory Network) stream gaging sites, based on limited field measurements at the NEON sites and information from nearby stream gages.

For anyone who has attempted to use NEON streamflow data, this contribution is a gift. The methods are clearly described, code is provided, and the resulting datasets are made available. Thanks to the authors for their efforts.

Thank you!

Comments:

Abstract: please describe how the linear models performed compared to the LSTM models, and why.

We will append “...with linear regression generally outperforming deep learning approaches due to the use of target site data for model fitting, rather than evaluation only.” to the third-to-last sentence.

A number of datasets and models are used in the analysis. It would be helpful to provide a table of these and their hyperlinks with a description of each, in the Appendix. For example, CAMELS, NHM, etc. And in addition, can you provide links to the “donor” datasets that you used? Perhaps this is provided and I missed it.

We will include the following table in the appendix. Note that links for NEON data are instead given as citations, to ensure that users find their way to the complete link, doi, and timestamp reference corresponding to the datasets we used. Instead of links to the donor gauge datasets, we provide a link to the landing page of the National Water Information System, and an example of a page for a single donor gauge. There are 54 such gauges involved in this study, and we do not recommend that anyone download these individually, but instead use the cited dataRetrieval package. We will also insert the sentence “NWIS gauge ID numbers are provided in cfg/donor_gauges.yml at the Zenodo archive link below.” on line 135.

Table Ax. Model input data used in this study.

Resource	Description	Source/Link
NEON discharge field collection	Discharge measurements from field-based surveys	NEON 2023b, NEON 2023c
NEON continuous discharge	Discharge calculated from a rating curve and sensor measurements of water level	NEON 2023a
User-focused evaluation of NEON streamflow estimates	3-tier classification of the reliability of NEON continuous discharge by site-month	https://www.nature.com/articles/s41597-023-01983-w
CAMELS dataset	Catchment Attributes, Meteorology, (and streamflow) for Large-sample Studies	https://ral.ucar.edu/solutions/products/camels
National Hydrologic Model (NHM)	USGS infrastructure that, when coupled with the Precipitation-Runoff Modeling System, can produce streamflow simulations at local to national scale	https://www.usgs.gov/mission-areas/water-resources/science/national-hydrologic-model-infrastructure
MacroSheds	A synthesis of long-term biogeochemical, hydroclimatic, and geospatial data from small watershed ecosystem studies	https://portal.edirepository.org/nis/mapbrowse?scope=edi&identifier=1262
Daymet	Gridded estimates of daily weather parameters	https://developers.google.com/earth-engine/datasets/catalog/NASA_ORNL_DAYMET_V4
HJ Andrews Experimental Forest stream discharge	Stream discharge in gaged watersheds, 1949 to present	https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-and.4341.33
USGS National Water Information System	Streamflow and associated data for thousands of gauged streams and rivers within the USA	https://waterdata.usgs.gov/nwis, e.g. https://waterdata.usgs.gov/monitoring-location/06879100/

Results/Discussion. A number of products including code, model results, and datasets are produced. It would be helpful to provide a table of the results created by the analysis and their hyperlinks with a description of each, in the text.

The following table could be inserted at the end of section 2, or in the appendix.

Table Ax. Products of this study.

Product	Description	Link
Data archive landing page	Figshare page linking to each of four archives described below	https://doi.org/10.6084/m9.figshare.c.6488065
Composite discharge timeseries	Analysis-ready CSVs combining the best available discharge estimates across linear regression and LSTM approaches from this study, and NEON’s published data	https://doi.org/10.6084/m9.figshare.23206592.v1
Composite discharge plots	Interactive plots of our composite discharge product	https://macrosheds.org/data/vlah_etal_2023_composites
All model outputs and results	Complete predictions from all linear regression and LSTM models, run results, and diagnostics	https://doi.org/10.6084/m9.figshare.22344589.v1
All model input data	Donor gauge streamflow, training data for LSTMs, model configurations, etc.	https://doi.org/10.6084/m9.figshare.22349377.v1
All code associated with this paper	Zenodo archive of GitHub repository	https://doi.org/10.5281/zenodo.7976251
All figures associated with this paper	High-resolution images of all figures from the main body and appendix	https://doi.org/10.6084/m9.figshare.23169362.v1

Results/Discussion. I could not find metadata in the materials provided at the hyperlinks. Please provide measurement units (are these unit area flows in mm?) and other metadata for the composite series linked at https://figshare.com/articles/dataset/Composite_discharge_series_for_each_NEON_stream_river_site_/23206592

Metadata and units are given at the bottom of that page, in a section called “CSV column definitions.” Similar structural diagrams are given for all model outputs at https://figshare.com/articles/dataset/Model_outputs_and_results/22344589.

Results/discussion (l. 390-400) and conclusions: (l. 495-500). Please provide a slightly expanded explanation for why the LSTM models performed considerably less well than the linear models. It appears that it was not the model structures themselves but their data requirements that limited the performance of the LSTM models relative to the regression models. Can you help readers better understand these limitations more generally? Were watershed characteristics or other data shown in Table 2 relevant to the performance of the linear vs. LSTEM models? How did the differences in drainage areas of the CAMELS data set relative to these watersheds (l. 25-251) influence model performance?

The crux of the difference in performance is given on lines 395-99. We will elaborate by inserting this text between the sentences of line 397:

“Whereas the LSTM models must parameterize each day of prediction individually, the regression models need only parameterize relationships between flow regimes.”

And this text between the sentences of line 399:

“Still, if given more training data, including numerous examples of watersheds and streams similar to each of those modeled in this study, the LSTM approaches would eventually close the performance gap.”

Results/discussion. For each site, could you summarize the record lengths over which you were able to reconstruct the 5-minute data vs. developing a daily estimate? And if the daily estimates were based on the LSTM models, which have a median NSE of 0.32 to 0.47, why are the confidence intervals for daily discharge in the resulting composite data series so narrow?

Figure 3 provides just such a summary. The hashed pink and purple sections have only daily reconstruction estimates.

As for narrow confidence intervals on inaccurate estimates, this implies that the corresponding models were confidently incorrect! Confidence intervals are quite variable for the LSTM models, and are often similar in relative magnitude to those of the linear regressions. For the LSTMs, these intervals were generated by building ensembles of 30 models and computing the 95% quantiles of each point estimate. The optimization of a neural network is a stochastic process, so a different solution is reached for every iteration. Optimization problems with a diversity of solutions will naturally produce a wide range of answers (resulting in wide confidence intervals), whereas other problems might be more constrained. Here, for example, inadequate training data for a particular type of stream might result in a model that consistently predicts off the mark for that stream type, but does so with a consistent bias (low accuracy, high precision, in a sense). Users of the composite discharge product must therefore be mindful of both uncertainty and accuracy when evaluating our estimates for a particular site.

Conclusions. Do you have any recommendations for NEON to improve their stream gaging procedures, data collection, and information management processes? Will it be necessary for users of NEON data to use your code in the future?

We have communicated with NEON in regards to this and similar efforts. As lines 23-25 and 62-69 indicate, measuring discharge, especially in new systems, is hard, and NEON products are showing clear improvement over time, especially as their gauge placements and rating curves mature. Primarily we think a thoughtfully-engaged user community can help improve the use and reuse of NEON data. Ideally, it will not be necessary for NEON data users to rerun this code on future data releases. Still, we’ve taken great pains to make it reusable with minimal effort.

Citation: https://doi.org/10.5194/egusphere-2023-1178-AC2

CC1:
'Comment on egusphere-2023-1178', Nick Harrison, 30 Aug 2023

The manuscript “Leveraging gauge networks and strategic discharge measurements to aid development of continuous streamflow records” uses data from the National Ecological Observatory Network (NEON) to develop models for predicting continuous streamflow from discrete measurements. The authors also provide a dataset that combines published NEON data with gap-filled and corrected data derived from their models. Both the manuscript and dataset will be extremely useful for NEON data users and their process may be useful for other researchers attempting to build continuous streamflow records at other sites. We have some substantive and editorial comments outlined below that will strengthen the manuscript and increase the reproducibility of their analyses.
Substantive comments:
Abstract:
The authors do an excellent job of summarizing the challenges to developing continuous streamflow records and highlighting the importance of overcoming these challenges for the research community. The abstract is a bit narrowly focused on the implications of their analyses for the NEON dataset when they may be useful for a broader group of researchers.
Methods Section:
Section 2.1: Based on the text in the Methods section, it appears that the authors applied the tier classifications from the outputs produced by the analysis in Rhea et al. (2023). That manuscript used a combination of RELEASE-2022 and Provisional NEON data available at that time. For best alignment, we suggest that the authors apply the classification framework from Rhea et al. (2023) to the RELEASE-2023 and provisional NEON data used in this manuscript. These new outputs should be included in the archived dataset associated with this manuscript for completeness. If this is already done, the authors should update the Methods section with additional detail to make this clear. Also, any Provisional data used in the analysis should be archived, if possible, by the authors to maintain reproducibility.
Section 2.5: The authors should provide citations and/or a rationale for the NSE and KGE benchmarks that they apply to the various datasets since those are different for each model. Some potential examples include Knoben et al. (2019), Moriasi et al. (2015), Golmohammadi et al. (2014), and Chiew and McMahon (1993).
Results and Discussion Section:
Line 455: While practical considerations such as site access, timing, and safety do indeed limit the collection of higher-flow discharge measurements, low-to-baseflow conditions are typically present throughout the majority of annual water year, particularly in smaller stream systems. Stream monitoring networks that include small streams with unstable low-flow hydrologic controls must balance the accurate characterization of the low to mid-flow regime with measuring discharge during high-flow events, which are relatively infrequent. Streams with unstable low-flow hydraulic controls (e.g., those with unconsolidated bed material) exhibit highly variable stage-discharge relationships, both within a given water year and across longer timescales. As such, multiple gaugings are required each year in order to accurately characterize the stage-discharge relationship in the low-to-middle segments of the rating curve. In contrast, high-flow gaugings can often be used in rating curves over longer periods of time due to the relative stability of high-flow controls (e.g., banks, valley walls, etc.).
The Discussion section focused a lot on NEON, without much discussion of how the methods could be applied more generally. This was an important point brought up in the Introduction, but it was not extensively revisited in the Discussion. Limiting most of how the results apply specifically to NEON to the Results section, and discussing how the methodology could be applied to un-gaged watersheds or watersheds with incomplete records more generally, would likely increase the amount of interested readers beyond just those who are using NEON data.
Editorial comments (primarily about citing NEON data to improve reproducibility):
Throughout the text of the paper: it is not clear from the citations whether NEON provisional data (see definition at https://www.neonscience.org/data-samples/data-management/data-revisions-releases) were used in assessment of the linear regression and LSTM models (e.g. lines 116-117). Please clarify in the text and citations in terms of which data were used in which part of the analysis.
Line 106: The readme for the macrosheds repository linked here reports an incorrect data product ID (reports DP4.00230.001, should be DP4.00130.001).
Line 117: Separate citations are needed for released and provisional datasets of the Discharge field collection data product (DP1.20048.001). These are two separate datasets. See https://www.neonscience.org/data-samples/guidelines-policies/citing for NEON citation guidelines.
Lines 118, 126: The date the data were accessed should be in the references entry, not the inline citation, unless specifically requested by the journal. Regardless, the accessed date should be included in the references on lines 671 and 669, respectively. See https://www.neonscience.org/data-samples/guidelines-policies/citing for NEON citation guidelines.
Lines 669-671: NEON data citations need to be updated to match the NEON citation guidelines detailed in https://www.neonscience.org/data-samples/guidelines-policies/citing.
Line 326: Please change ‘2023-release’ to ‘RELEASE-2023’ to be consistent with NEON terminology.
Line 509: Please change ‘DP4.00230.001’ to ‘DP4.00130.001’ as the incorrect data product ID is currently being used.
Line 579: In the acknowledgements section, please include the standard NEON data acknowledgement (see https://www.neonscience.org/data-samples/guidelines-policies/citing).

Citation: https://doi.org/10.5194/egusphere-2023-1178-CC1
- AC3: 'Reply on CC1', Michael Vlah, 06 Oct 2023
  
  CC1 (responses in bold)
  
  The manuscript “Leveraging gauge networks and strategic discharge measurements to aid development of continuous streamflow records” uses data from the National Ecological Observatory Network (NEON) to develop models for predicting continuous streamflow from discrete measurements. The authors also provide a dataset that combines published NEON data with gap-filled and corrected data derived from their models. Both the manuscript and dataset will be extremely useful for NEON data users and their process may be useful for other researchers attempting to build continuous streamflow records at other sites. We have some substantive and editorial comments outlined below that will strengthen the manuscript and increase the reproducibility of their analyses.
  Thank you so much for taking the time to provide this feedback!
  Substantive comments:
  Abstract:
  The authors do an excellent job of summarizing the challenges to developing continuous streamflow records and highlighting the importance of overcoming these challenges for the research community. The abstract is a bit narrowly focused on the implications of their analyses for the NEON dataset when they may be useful for a broader group of researchers.
  Good point. We zoom out at the end of the intro and discussion, but not here in the abstract. We will add a sentence like, “The success of this effort demonstrates the potential to establish “virtual gauges,” or sites at which continuous streamflow can be accurately estimated from discrete measurements, by transferring information from nearby donor gauges and/or large collections of training data.”
  Methods Section:
  Section 2.1: Based on the text in the Methods section, it appears that the authors applied the tier classifications from the outputs produced by the analysis in Rhea et al. (2023). That manuscript used a combination of RELEASE-2022 and Provisional NEON data available at that time. For best alignment, we suggest that the authors apply the classification framework from Rhea et al. (2023) to the RELEASE-2023 and provisional NEON data used in this manuscript. These new outputs should be included in the archived dataset associated with this manuscript for completeness. If this is already done, the authors should update the Methods section with additional detail to make this clear. Also, any Provisional data used in the analysis should be archived, if possible, by the authors to maintain reproducibility.
  We used the updated version of Rhea 2023 (the dataset accompanying Rhea et al. 2023), published 4/11/23. This update repeats the original analysis using RELEASE-2023 data. We did not use provisional continuous discharge data, but did use provisional field Q, which is included in the archived dataset associated with this paper. All of this will be made clearer in our methods.
  Section 2.5: The authors should provide citations and/or a rationale for the NSE and KGE benchmarks that they apply to the various datasets since those are different for each model. Some potential examples include Knoben et al. (2019), Moriasi et al. (2015), Golmohammadi et al. (2014), and Chiew and McMahon (1993).
  Thank you for these citations. We will cite the 0.5 NSE threshold recommended by Moriasi et al and acknowledge that one site-year was used for gap-filling despite not quite reaching that threshold (KING 2017 at NSE 0.499).
  Results and Discussion Section:
  Line 455: While practical considerations such as site access, timing, and safety do indeed limit the collection of higher-flow discharge measurements, low-to-baseflow conditions are typically present throughout the majority of annual water year, particularly in smaller stream systems. Stream monitoring networks that include small streams with unstable low-flow hydrologic controls must balance the accurate characterization of the low to mid-flow regime with measuring discharge during high-flow events, which are relatively infrequent. Streams with unstable low-flow hydraulic controls (e.g., those with unconsolidated bed material) exhibit highly variable stage-discharge relationships, both within a given water year and across longer timescales. As such, multiple gaugings are required each year in order to accurately characterize the stage-discharge relationship in the low-to-middle segments of the rating curve. In contrast, high-flow gaugings can often be used in rating curves over longer periods of time due to the relative stability of high-flow controls (e.g., banks, valley walls, etc.).
  This is a crucial point that we have overlooked. We will ensure that the asymmetrical importance of characterizing low- and mid-flow regimes, in streams with unconsolidated bed material, is addressed in the final version.
  The Discussion section focused a lot on NEON, without much discussion of how the methods could be applied more generally. This was an important point brought up in the Introduction, but it was not extensively revisited in the Discussion. Limiting most of how the results apply specifically to NEON to the Results section, and discussing how the methodology could be applied to un-gaged watersheds or watersheds with incomplete records more generally, would likely increase the amount of interested readers beyond just those who are using NEON data.
  The next iteration of this paper will have separate Results and Discussion sections. We will take that opportunity to expand on how this methodology could be applied in other settings.
  Editorial comments (primarily about citing NEON data to improve reproducibility):
  Throughout the text of the paper: it is not clear from the citations whether NEON provisional data (see definition at https://www.neonscience.org/data-samples/data-management/data-revisions-releases) were used in assessment of the linear regression and LSTM models (e.g. lines 116-117). Please clarify in the text and citations in terms of which data were used in which part of the analysis.
  We will clarify the text by explicitly stating that provisional data were not used for the continuous discharge data product. The citations will be clarified by addressing your comment on line 117 below.
  Line 106: The readme for the macrosheds repository linked here reports an incorrect data product ID (reports DP4.00230.001, should be DP4.00130.001).
  Good catch. We’ll revise this file in the data archive.
  Line 117: Separate citations are needed for released and provisional datasets of the Discharge field collection data product (DP1.20048.001). These are two separate datasets. See https://www.neonscience.org/data-samples/guidelines-policies/citing for NEON citation guidelines.
  Thank you for pointing this out. We will include the provisional data citation for field Q in section 2.1, and correct our in-line citations and references as per the following comments.
  Lines 118, 126: The date the data were accessed should be in the references entry, not the inline citation, unless specifically requested by the journal. Regardless, the accessed date should be included in the references on lines 671 and 669, respectively. See https://www.neonscience.org/data-samples/guidelines-policies/citing for NEON citation guidelines.
  This correction will be made in the final version of the paper. Thank you.
  Lines 669-671: NEON data citations need to be updated to match the NEON citation guidelines detailed in https://www.neonscience.org/data-samples/guidelines-policies/citing.
  These references will be corrected to match the official guidelines.
  Line 326: Please change ‘2023-release’ to ‘RELEASE-2023’ to be consistent with NEON terminology.
  Accepted.
  Line 509: Please change ‘DP4.00230.001’ to ‘DP4.00130.001’ as the incorrect data product ID is currently being used.
  Egad! This mistake will be corrected in the final version.
  Line 579: In the acknowledgements section, please include the standard NEON data acknowledgement (see https://www.neonscience.org/data-samples/guidelines-policies/citing).
  The standard acknowledgement will be included in the final version.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1178-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (further review by editor) (07 Oct 2023) by Jan Seibert

AR by Michael Vlah on behalf of the Authors (02 Nov 2023) Author's response Author's tracked changes Manuscript

ED: Publish as is (16 Dec 2023) by Jan Seibert

AR by Michael Vlah on behalf of the Authors (22 Dec 2023)

Short summary

Virtual stream gauging enables continuous streamflow estimation where a gauge might be difficult or impractical to install. We reconstructed flow at 27 gauges of the National Ecological Observatory Network (NEON), informing ~199 site-months of missing data in the official record and improving that accuracy of official estimates at 11 sites. This study shows that machine learning, but also routine regression methods, can be used to supplement existing gauge networks and reduce monitoring costs.