the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Overview and comparison of three quality control algorithms for rainfall data from personal weather stations
Abstract. The number of rainfall observations from personal weather stations (PWSs) has increased significantly over the past years; however, there are persistent questions about data quality. In this paper, an examination and comparison of three quality control algorithms (PWSQC, PWS-pyQC, and GSDR-QC) designed for the quality control of rainfall data is presented. The focus was on a series of rainfall events occurring in the Amsterdam area between May 2017–May 2018. Quality issues observed include faulty zeros i.e., the underreporting of rainfall, significant gaps in the dataset, and systematic bias often caused by incorrect setup and installation of the PWS. The analysis shows that all three algorithms improve PWS data quality when cross-referenced against rain radar. The considered algorithms have different strengths and weaknesses depending on PWS and official data availability, making it inadvisable to recommend one over another without carefully considering the specific setting. The need for further objective quantitative benchmarking of QC algorithms requiring freely available test datasets representing a range of environments, gauge densities, and weather patterns is highlighted.
- Preprint
(4611 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on hess-2023-195', Anonymous Referee #1, 24 Sep 2023
- AC1: 'Reply on RC1', Abbas El Hachem, 11 Dec 2023
-
RC2: 'Comment on hess-2023-195', Anonymous Referee #2, 13 Nov 2023
The article discusses three recently proposed algorithms for the quality control (QC) of rainfall data from personal weather stations (PWSs), namely PWSQC, PWS-pyQC, and GSDR-QC, and presents a case study application of the three approaches. The performances of the three methods are compared in terms of the resulting maps of interpolated rainfall depths and areal average precipitation over time, using some correlation and error metrics and some error and scatter plots. A radar rainfall product is used to obtain reference precipitation values. There is not a clear winner among the three tested QC algorithms, and the main findings of the study involve:
- all tested QC algorithms reduce the errors in the raw data from PWSs;
- all tested algorithms may fail to detect some faulty zero observations (i.e., unduly dry observations in the time series); GSDR-QC seems to be the most prone to this issue;
- local peaks in precipitation depth tend to be underestimated, when considering the spatial interpolation of precipitation depths from the corrected station values.
I personally find the overall contribution of this work a bit poor at this stage. The algorithms are well introduced, but I would like to read more detailed descriptions of their functionalities. The comparison of the performance of the three algorithms is based on a limited number of precipitation events from only one case study area, which does not help generalize the findings presented in the article. The findings themselves are not particularly compelling and well supported, as it cannot be excluded that they are determined by conditions that are specific to the selected case study area. Specifically, both the issues with detecting faulty zeros and the underestimation of the peak in precipitation depth apparently occur at locations of the study area which are affected by lack of stations and therefore may not be limitations intrinsic to the QC algorithms themselves.
Please find below a list of more detailed comments and suggestions to improve the work.
- For each map of interpolated rainfall depths (such as Fig. 2), please use markers (small, maybe points) to indicate the PWS stations (using a legend to distinguish between stations used and discarded by each QC algorithm) and include (on a separate sub-plot) the interpolation of the raw data not passed through QC. This would help visualize how the tested QC algorithms change the raw data and the impact of the stations that were not filtered out.
- Please include latitude and longitude in all maps. Use the same grid both in Fig. 1 (case study area with stations) and all maps of interpolated precipitation depths, to facilitate any comparison.
- In the discussion of the results, you consider the radar rainfall product (referred to as radar reference) as aprioristically correct, since the interpolated, quality-controlled results from the PWS network are directly compared to the former (Fig. 2 and 3), and the performance of the QC algorithms is measured in terms of how closely the derived rainfall depth maps match the radar product. However, I do not think that this is the best approach, conceptually. While data from PWSs may contain errors because of the different reasons listed in your Introduction, the dramatically higher spatial resolution of the PWS network, compared to the official station network (KNMI) used for the correction of the radar rainfall product (134 vs. 1 station, in your case study area), implies that the information obtained from the PWS network may result in a potentially more detailed representation of the spatial distribution of precipitation depths, compared to the radar rainfall product, at least in some locations and time steps. While there may be errors in the PWS time series due to the potentially non-optimal usage of some stations by non-professionals, the radar product is also likely affected by other types of errors and uncertainties, inherent to the procedures for deriving that product. As such, these two sources of rainfall information should not be used as if one were an axiomatic truth and the other had to closely match the former, but rather as complementary sources, where each one has its own advantages and limitations.
- When considering the radar rainfall product (referred to as radar reference) aprioristically correct (see previous comment), you risk introducing some bias in the comparison of QC algorithms, as one of them (PWS-pyQC) expressly requires a reference observation network (primary network) to perform most of the bias correction procedures, and your provided reference observation network is a random subsample (20 pixels) of the radar rainfall product itself. As the output from PWS-pyQC is then more likely to closely match the radar reference by construction (as it happens in Fig. 2), you might end up always considering PWS-pyQC as superior over the other QC algorithms in obtaining values of precipitation depth close to truth. The next point suggests performing a sensitivity analysis that might also help assess if there is indeed such a bias, and to what extent.
- Show on a map the random 20 pixels of the radar rainfall product used as the primary network for PWS-pyQC (line 83). Perform a sensitivity analysis trying with other random subsamples, also playing with the number of pixels considered. This may help shed more light on the sensitivity of the performance of PWS-pyQC to the size and configuration of the primary network. Specifically, the stability of the performance of PWS-pyQC for a range of different subsamples of fixed size (i.e., fixed number of pixels) will be an indicator of the robustness of the methodology, while any observed changes in the average performance with the number of pixels in the primary network will help get an idea of how much information from the primary network is required in general for applications on other case study areas.
- PWSQC and GSDR-QC both include some procedures that are based on neighbor-checks (lines 58, 61, and 95; Tab. 1). E.g., PWSQC detects faulty zeros at those stations with 0 mm observations but concurrent precipitation at some neighboring stations (line 58); PWSQC also detects outliers based on the correlations of concurrent records at neighboring stations (line 61). I was left wondering what the criteria are to identify neighboring stations, e.g., what is the maximum distance beyond which two stations would not be considered neighbors of each other? I would assume that there are some control parameters that the modeler can set when using the algorithms. Have you tried playing with them? It would be interesting to see a sensitivity analysis to these and any other control parameters that may be available to the modeler (if applicable). This would ensure that the comparison of the three QC algorithms is made when the optimal parameter settings are adopted for each of the tested approaches.
- Table 1 effectively summarizes the characteristics of the three QC algorithms. However, I feel that some of the concepts mentioned there (e.g., “Level of QC-allocation”, “WMO QC classification types”) also need to be briefly introduced in the body of the manuscript. I would also describe in a little more detail how the routines in each QC method work, e.g., are the neighbor-checks iterative processes? In general, I feel that a deeper understanding of alternative algorithms is important to guide informed decisions on which is the most suitable algorithm for the specific case at hand.
- One of the two main findings of the work involves the issue with the presence of faulty zeros, and in particular the lower performance of the GSDR-QC algorithm in detecting faulty zero observations, compared to the other two algorithms, as remarked, e.g., while commenting on Fig. 2 (line 143), or in the Discussion (line 180) and Conclusions (line 200) sections. These conclusions are based on the observation of “dry spots” (Fig. 2) in the interpolation maps obtained from the quality-controlled PWS data, as compared to the radar reference, which are most noticeable for GSDR-QC. However, it seems that many of those spots occur at locations with limited station availability, compared to the rest of the PWS network in the study area. E.g., focusing on Fig. 2b, c, and d, the dry spots are located in a big portion of the SW (south-west) quadrant and two smaller portions in the NW and SE quadrants of the case study area, where there are noticeable gaps in the station network (see Fig. 1). Hence, the lack of sufficient data may be an alternative explanation for the presence of dry spots, as the lack of stations may cause problems with the QC algorithms (e.g., with the neighbor-checks). In turn, this means that the observations of dry spots may not be connected a priori to the presence of any large number of faulty zeros in the original data (in this respect, the suggestion given in point 1 would be very helpful to give the right interpretation), but rather be a consequence of the specific experimental settings. The lack of a univocal interpretation of why dry spots are observed in the maps of interpolated precipitation depths prevents an objective assessment of the performance of the QC algorithms. What you can say at this stage is that PWS-QC and PWS-pyQC are apparently more effective than GSDR-QC in avoiding underestimation of precipitation depths at locations with limited presence of stations. I would be confident with concluding that GSDR-QC has a lower sensitivity to faulty zero observations only if those dry spots were observed in portions of the study area with plenty of stations, and where the raw data actually present some faulty zeros (but the raw data are not shown in this current version of the manuscript), so that to avoid any external source of bias and only focus on the intrinsic limitations of the algorithms (if any). It may be a good idea to consider more than one case study area and more events per area, to derive more robust, less biased conclusions about the performance of each QC algorithm. When selecting a larger number of events, I would suggest not to limit the search to those with significant rainfall all over the study area and for a long duration (as you state in line 108), as it may be more useful to test the QC algorithms under a wide range of heterogenous conditions.
- The other main finding of the work involves the tendency to underestimate the peak in precipitation depth in the maps of interpolated values after performing the QC, as highlighted in the Discussion (line 175) and Conclusion (line 199) sections. However, also the robustness of this conclusion is, in my opinion, undermined by the presence of spots of low PWS network density in the case study area, as in many cases (e.g., Fig. 2, A1, A2) the underestimation of the precipitation depth occurs at locations affected by the lack of stations, as you also admit in line 176. Because of the specific conditions of the case study considered, the underestimation of the peak may be imputable to the lack of a sufficient number of stations in the location of the peak, and not to limits in the way the QC algorithms are designed (at least in principle). This article is about comparing three QC algorithms, but how can you assess pros and cons of these if the precipitation event that you consider (event 4) displays its peak in a location where it seems there is no sufficient data coverage in the first place? To draw robust conclusions about how these algorithms work and compare their performances, you should try to have almost ideal conditions in terms of data coverage, so that to focus solely on any problems with the algorithms.
- How did you choose the rectangle for the case study? E.g., it seems that the distance between the edges of the study area and the most peripheral PWSs of the network is not the same on each side. Please describe how the boundary of the study area was outlined and provide the coordinates of its corners. It is worth noting that the choice of the rectangle may have some effects on the computation of the areal averages, and in turn on the plots in Figures 4 and C1.
- Why did you choose 15 validation locations from the radar reference (line 164)? For each of these locations, did you obtain one Pearson correlation metric by Eq. (1), and do the boxplots in Fig. 5 show the distribution of the values of this metric across the different validation locations? Or is it the other way around, i.e., the Pearson coefficient is computed across the different validation locations, and the boxplot shows the variability of the Pearson coefficient with time? Please clarify this in the manuscript.
- I personally find it a bit unclear how the presented results lead to some of the concluding remarks in the Conclusion. E.g., line 206: “the PWSQC algorithm is most useful where there is a dense PWS network, and the GSDR-QC is most appropriate in locations where the PWS network is sparse and comprises rain gauges from a range of manufacturers (resulting in a range of potential errors)”. Please elaborate more on that in the article.
Minor comments:
- In Eq. (2) and Eq. (3), x and y without any hat or subscript are used but not defined.
- The CV in Eq. (3) is different from the typical definition for a single variable, as in the numerator there is the standard deviation of the difference between x and y (you use sigma as the standard deviation operator, correct?), while you consider the average y in the denominator. Please provide some more information.
- Line 149: “GSDR-QC shows the most remaining data after QC…”. I would rephrase it as “GSDR-QC retains more PWS stations, as compared to …”, or something like that.
Citation: https://doi.org/10.5194/hess-2023-195-RC2 - AC2: 'Reply on RC2', Abbas El Hachem, 11 Dec 2023
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
595 | 209 | 20 | 824 | 18 | 15 |
- HTML: 595
- PDF: 209
- XML: 20
- Total: 824
- BibTeX: 18
- EndNote: 15
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1