the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
CAMELS-Chem: Augmenting CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) with Atmospheric and Stream Water Chemistry Data
Gary Sterle
Julia Perdrial
Thomas Adler
Kristen Underwood
Donna Rizzo
Abstract. Large sample datasets are transforming hypothesis testing and model fidelity in the catchment sciences, but few large stream water chemistry datasets exist with complementary streamflow, meteorology, and catchment physiographic attributes. Here, we pair atmospheric deposition and water chemistry related information with the existing CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) dataset. The newly developed dataset, CAMELS-Chem, comprises U.S. Geological Survey water chemistry data and instantaneous discharge over the period from 1980 through 2014 in 506 minimally impacted headwater catchments. The CAMELS-Chem dataset includes 18 common stream water chemistry constituents: Al, Ca, Cl, Dissolved Organic Carbon, Total Organic Carbon, HCO3, K, Mg, Na, Total Dissolved Nitrogen [nitrate + nitrite + ammonia + organic-N], Total Organic Nitrogen, NO3, Dissolved Oxygen, pH (field and lab), Si, SO4, and water temperature. We also provide an annual wet deposition loads from the National Atmospheric Deposition Program over the same catchments that includes: Ca, Cl, H, K, Mg, and Total Nitrogen from deposition [precipitation NO3 + NH4, dry deposition of particulate NH4, + NO3, and gaseous NH3], Na, NH4, NO3, SO₄. We release a paired instantaneous discharge (and mean daily discharge) measurement for all chemistry samples. To motivate wider use by the larger scientific community, we develop three example analyses: 1. Atmospheric-aquatic linkages using atmospheric and stream SO4 trends, 2. Hydrologic-biogeochemical linkages using concentration-discharge relations, and 3. Geological-biogeochemical linkages using weathering relations. The retrieval scripts and final dataset of > 412,801 individual stream water chemistry measurements are available to the wider scientific community for continued investigation.
- Preprint
(1138 KB) -
Supplement
(2828 KB) - BibTeX
- EndNote
Gary Sterle et al.
Status: final response (author comments only)
-
RC1: 'Comment on hess-2022-81', Anonymous Referee #1, 08 Apr 2022
Evaluating the overall quality of the preprint ("general comments"),
Sterle et al. present a compiled novel dataset of water quality solutes and atmospheric deposition inputs for the CAMEL catchments. Their work augmenting existing and widely used CAMELS datasets is needed for further research in analyzing spatial and temporal water quality trends in minimally impacted watersheds. Existing papers have done large-scale water quality analyses, but few have provided open-access datasets and the breadth of solutes.
At its core, this is a data paper. As such, I think the methods need to be expanded. My comments primarily pertain to the data and methods, as I see that this is the paper's novelty.
The CAMELS dataset is widely used, and the addition of water chemistry provides the opportunity for analysis. However, I think the dataset could be improved substantially. This paper by Sterle et al. provides an excellent resource for the community.
Methods and Results
Length of dataset
In the paper, the dataset is stated to end in 2014 (see Data Comments below because I'm not sure if this is accurate). However, in many cases, solute and discharge data are available until the at least end of the NADP reporting period. Soon 2014 will be 10-years ago, and I worry the data will not be quickly obsolete and not used to its fullest capacity. The value of this dataset would be exponential if the authors were able to harmonize data from various agencies.
The USGS has developed methods to generate the longest timeseries possible for watersheds. However, I do not see these methodologies applied to these CAMELS watersheds.
Three different approaches can be used alone or in concert to expand the dataset.
- The following USGS report looked at the statistical differences between similar solutes. They were then able to merge solutes that were registered under different parameter codes but were statistically similar, allowing for a richer dataset.
- Secondly, the USGS report also has a methodology for pairing gauges, and sampling locations are not co-located.
- With dataRetrieval package in R you can pull USGS, state, tribal, and NGO water quality measurements. This can potentially expand the dataset. However a lot of manual data cleaning is required because of the poor metadata for non-USGS agencies (ex. nitrate can be reported as N or NO3 , and not explicitly stated in the metadata) (Sprague et al. 2017). With the data harmonization from multiple sources, I would expect a section on the methodology of data cleaning.
- Oelsner, G. P., Sprague, L. A., Murphy, J. C., Zuellig, R. E., Johnson, H. M., Ryberg, K. R., Falcone, J. A., Stets, E. G., Vecchia, A. V., Riskin, M. L., Cicco, L. A. D., Mills, T. J., & Farmer, W. H. (2017). Water-Quality trends in the nation's rivers and streams , 1972 – 2012 — Data Preparation, Statistical Methods, and Trend Results Scientific Investigations Report 2017 – 5006. U.S. Geological Survey Scientific Investigations Report 2017–5006. https://doi.org/10.3133/sir20175006
Figures capture the various hydrological/biogeochemical metrics for select solutes in section 4. However, as a user, I find it challenging to evaluate whether the data would be sufficient for my use case. The authors have included some analysis of data coverage in section 3.1, however, since the strength of this paper lies in the data there could be more information to help users evaluate whether this dataset suits their needs.
There should be summary figures for all solutes, so users can adequately assess whether the dataset is appropriate for their use. I think the paper would benefit tremendously if the dataset had more metadata and signatures/summary statistics. Specifically, in Addor et al. (2017), the authors summarized many indices and described the indices in great detail (see Table 2 and Table 3 in Addor et al). I suggest summarizing information about (1) missing data/data gaps, years of continuous data, (2) low/high flow distribution, (3) FDC the WQ spans (figure 7), (4) seasonality of hydrology and solutes, and other metrics that the authors deem useful.
Data Comments
These comments pertain to the files on the google drive.
Data provided has inconsistencies in the way the date is reported. Example from camel_chem_v3.
- Gauge ID 14309500: sample_timestamp reads 8/15/67 15:00 whereas other dates are in 2010-12-11 format.
- Gauge ID 6447000: sample_start_dt appear to indicate that the data from 1950s. If so, this falls outside of the time period listed in the paper.
It appears that the data available in camel_chem_q_v1 ends in 2018. Please update the manuscript with the correct dates if this data is available.
The original CAMELS dataset provides Shapefiles, however, to allow for seamless merging of data, the header used to identify which column the watershed IDs are the same as the name used by the original CAMELS dataset.
The dataset provided should be able to stand on its own without needing the other CAMELs dataset. The watershed metadata should thus be included (area, outlet latitude and longitude, USGS gage number with leading 0s).
Individual scientific questions/issues ("specific comments"),
Line 84-87: Authors state that they have WQ data from 1980-2014. Later the authors state that the data is "for the same time period" as NADP data (1985-2019). Dates should be consistent.
Line 103: Section 2.2 should be written in a high-level abstract way. I find it unclear how these frameworks are applied to this data in the way it is currently written. I would be better to see more specificity and names of the database (ex. NADP, NWIS).
Line 125: It is unclear whether "daily average discharge" means a continuous dataset or just discharge measurements for the data that there are solutes. The dataset provided suggests the latter, but I think there would be value in providing the daily discharge timeseries for the same timespan of the solute data.
Line 135: In Table 2, deposition units are reported as mg/L. However, NADP reports their deposition in both concentration and kg per hectare. Are the units in Table 2 a mistake? If not, can you add some detail on the methods used to convert concentration to an area normalized load?
Table 1: Consider added the NWIS parameter code. For example, is "Nitrate, water filtered" the nitrate plus nitrite (00631) or just nitrate (00618)? There are many parameters for slightly similar solutes and it would help with reproducibility if the parameters codes were included. Also, consider listing the difference between pH in the field and pH in the lab for users.
Table 1: Also consider adding more detail to units. For example, is nitrate mg-NO3/L or mg-N/L.
Line 204: EPA link is broken. I have had many issues with direct links where they are archived and become a dead end. I highly encourage the authors to find a paper with a DOI to support this sentence. As a starting point, you can consider:
- Baumgardner, R. E., Lavery, T. F., Rogers, C. M., & Isil, S. S. (2002). Estimates of the Atmospheric Deposition of Sulfur and Nitrogen Species: Clean Air Status and Trends Network, 1990−2000. In Environmental Science & Technology (Vol. 36, Issue 12, pp. 2614–2629). https://doi.org/10.1021/es011146g
- Lloret, J., & Valiela, I. (2016). Unprecedented decrease in deposition of nitrogen oxides over North America: the relative effects of emission controls and prevailing air-mass trajectories. Biogeochemistry, 129(1-2), 165–180.
Technical corrections ("technical corrections": typing errors, etc.).
- Subscripts for solutes should be consistent throughout the manuscript.
- Table 3 formatting caused solutes to be cut off.
- Line 18 and 149-151: 18 solutes listed in the abstract, 17 listed in Table 1 and in Line 149, and 16 listed in the text. Make them all consistent.
- Line 57: Remove (?).
- Line 60: Remove “CITE”
- Line 168, 174 and others: When referencing figure (ex Figure 2), please add the panel letters (a,b,c, etc.).
- Figure 2: Panel a, why does daily average discharge only have 393 watersheds while the original CAMELS dataset uses USGS discharge in the original CAMELS dataset?
- Figure 6: Regarding NO3, if arid and humid sites are a subset of all sites, I am unsure how the slope for all sites can be larger than both arid and humid.
Citation
Sprague, L. A., Oelsner, G. P., & Argue, D. M. (2017). Challenges with secondary use of multi-source water-quality data in the United States. Water Research, 110, 252–261.
Citation: https://doi.org/10.5194/hess-2022-81-RC1 -
AC1: 'Reply on RC1', Adrian Harpold, 21 Jul 2022
We greatly appreciate the review comments of reviewer 1, particularly about tangible ways we can improve the dataet. We agree with points 1 and 2 raised about improving USGS datasets, but we do not wish to expand the dataset beyond USGS and initiate additional harmonizing efforts. We appreciate you going through the posted dataset and will address inconsistencies. We are unclear about the comment on summar figures for all solutes, as this was done between the main document (Figure 2 and 3) and the supplemental (Figure S1 and S2). We plan to incorporate the suggestions into our final manuscript.
Citation: https://doi.org/10.5194/hess-2022-81-AC1
-
RC2: 'Comment on hess-2022-81', Anonymous Referee #2, 02 Jun 2022
Overall comment:
This paper describes efforts to amend the known and quite successful US-wide CAMELS dataset with hydrochemical and deposition data. This effort is overdue and would greatly enhance the usage of the original CAMELS data as well. Having stated this I have to admit that the manuscript is not convincing to me. It fails to state what data (e.g., constituent codes in the USGS data) was used in what exact way (missing description of data evaluation, conversion, filtering). It visualizes data coverage but does not state number of observations and number of stations in a consistent way. So overall the reader is left rather unclear about the whole data handling process and the outcome. While I like the idea of three examples what this data could be used for, this leaves me rather puzzled. This is submitted as a research paper but does not really come up with research. For me this manuscript would rather fit the purpose of ESSD (Environmental Science System Data) as a dataset description than HESS as a research paper. For the latter it would have an appropriate structure and depths but still would need to acknowledge the details comments below.
Specific comments:
Abstract
The abstract needs to transport the content and motivation in a better way. It misses quite some information e.g. resolution of provided data. It is not clear what is really provided in this new data set and what is already available from CAMELS.
Line 18: The linkage to the original US CAMELS dataset remains unclear in the abstract. Is this an addition to the original one as induced by title and text here or is this something completely new.
Line 24: This is odd. Ether give the exact number or >. Rounding to something easier to read would probably best.
Introduction
From my point of view the introduction should state some examples of existing water quality databases. There are recent advances here such as: GRQA: Global River Water Quality Archive (Virro et al., ESSD), GLORICH (Hartmann, J., Lauerwald, R., and Moosdorf, N.: A Brief Overview of the GLObal RIver Chemistry Database, GLORICH, Proced. Earth Plan. Sc., 10, 23–27, https://doi.org/10.1016/J.PROEPS.2014.08.005, 2014. a, b) or QUADICA (https://doi.org/10.5194/essd-2022-6)
Line 31: There are quite a number of examples of nation-wide to continental scale water quality studies using more than a single catchment. I disagree here that availability of datasets does not go hand in hand with usage of this.
Line 35: Check citation formatting here.
Line 35: Can you be more specific on the "issues" mentioned here?
Line 42: Year in citation missing.
Line 43: Use the complete name of CAMELS here.
Line 56: Check the question mark here.
Line 60: Check citation here.
Line 83: I think you should better introduce the idea to provide atmospheric deposition data.
Line 86: Can you specify why to stop in 2014? Because the original CAMELS data also have that time frame?
Material and Methods
Line 94: I would expect a citation to the NWIS data sources mentioned here.
Line 97: What is meant with the "two datasets"?
Line 97: "for each observation" seems unnecessary here.
Line 97: The match with the CAMELS station does not explain the promised "geographical coverage". Please adjust this sentence.
An option could be a reference to a figure showing the distribution of stations, maybe including the ones from the CAMELS data that are not amended with chemical data.
Line 102: From my point of view the 2.2 chapter on the methods is not informative and needs some revision. The reader does learn some fancy names but not what exactly was done with the data. The linkage of methods to the provided text-files remains unclear. This does not fit the previously stated wish to make everything reproducible by providing scripts. See also details below.
Line 103: I cannot follow this argumentation.
Line 109: Is ETL an established method? If so, please reference this. If not: Is there really a need for a fancy name for a standard data transformation method needed that uses a scripting language in a reproducible way?
Line 112: I do not see the link from this method to the resulting data provided, which are easy-to-use but simple text-files. This paragraph seems to be overly complicated.
Line 116: Similar as before. Either reference this as established methods or leave it. These abbreviations are not used further in the manuscript - so why introducing them here?
Line 124: Is type of discharge match (paired daily, instantaneous...) indicated in the dataset?
L131: "once per day" at the same time? Is this what is meant here?
Line 135: Again a reference to the dataset would makes sense for me. It would be, moreover, helpful for the readers to know about the absolute number of stations the interpolation is based on.
L 137: I do not understand what "align" means in this context. Please be more specific.
Line 139: Somehow strange to mention Table 2 first in the text while Table 2 comes second.
Table 1: It would be helpful, if not done elsewhere, to state the USGS parameter codes as this often causes confusion. The caption is not fully matching the column names, e.g. is abbreviation=attribute? Also it would be good if this is also consistent with or clearly linked to the column names in the provided data files, which seems to be not the case.
Line 145: The "+" seems unnecessary.
Dataset desription
Line 151: This description says all what is needed. No further need for a table here.
Line 151: Please use subscript in the constituent's abbreviations similar to the table where the constituents are introduced. This applies to the entire manuscript.
Table 3: I do not get the meaning and idea of Table 3.
Line 164: This chapter would profit from a table (maybe combined with table 1) that lists number of stations and number of observations and maybe median number of observations per station for each of the constituents.
Line 171: I was not aware of this varying foci. Is there a reference for that?
Line 175: This sentence needs to be checked - the logic is not clear.
Line 186: While the previous chapter states reasons for different data density and distribution. Here, however, suddenly spatial pattern in the deposition is described. This belongs elsewhere. The chapter should describe the meta-information only.
Trends in...
Line 226: "not" missing?
Line 234: This reads quite strange. You compute cq relationships to check if there are cq relationships? Consider revising.
Line 250: I do not get the meaning of "overlapping solutes" here.
Line 251: Isn't this part of the A (attributes) in the CAMELS dataset?
Line 127: I very much like the analyis of FDC coverage. For me this is, however, not just an addition to the cq analysis but rather part of the dataset description in the former chapters.
Line 269: "response of the relationship"? Consider revising.
Line 270: What are uneven collection dates?
Line 299: This statement seems to be unrelated to stream chemistry and thus not directly a result of CAMELS-Chem.
Line 302: This sentence is very long. What exactly is meant by large-scale response? The observed response is always local, isn't it?
Line 404: This sentence is redundant, please revise.
Code availability:
I think the authors may already give an idea which repository they are aiming at.
Citation: https://doi.org/10.5194/hess-2022-81-RC2 -
AC2: 'Reply on RC2', Adrian Harpold, 21 Jul 2022
We greatly appreciate the review comments of reviewer 2, particularly about better ways to frame the study and increase consistency and clarity. We understand your point about HESS versus ESSD, but would contend that the CAMELS database development has a track record in HESS (https://hess.copernicus.org/articles/19/209/2015/ and https://hess.copernicus.org/articles/21/5293/2017/) making it a better fit. Moreover, we demonstrate several new, analyses that illustrate continental-scale patterns and link stream and atmospheric chemistry data, which are more apporparte for HESS. We can address this based on comments form the handling editor. We plan to address the detailed comments of the reviewer in our resubmission.
Citation: https://doi.org/10.5194/hess-2022-81-AC2
-
AC2: 'Reply on RC2', Adrian Harpold, 21 Jul 2022
Gary Sterle et al.
Gary Sterle et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
782 | 394 | 43 | 1,219 | 93 | 18 | 23 |
- HTML: 782
- PDF: 394
- XML: 43
- Total: 1,219
- Supplement: 93
- BibTeX: 18
- EndNote: 23
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1