Comment on hess-2021-350

In this manuscript, Hou et al. estimated water storage dynamics for more than 6,000 reservoirs worldwide from 1984 to 2015 using a combination of Landsat imagery, radar satellite altimetry, and geostatistical modeling. They also analyzed the patterns of increasing and decreasing trends globally. Finally, they attributed reservoir storage changes to climate and human variables and found that precipitation and river inflows largely dominated reservoir storage changes.

I feel this is a very interesting study. Previous studies provided long-term storage changes for only dozes of reservoirs. It is really great to see a global dataset of more than 6,000 reservoirs, as compiled in this study. Their attributions on the reservoir storage changes can potentially inform local to regional water resources management. However, I have some major concerns on the quality of the global dataset and the methodology that they applied to attribute the storage changes. While this study produces storage changes for a greater amount of reservoirs globally, I do not think the authors fully addressed the limitations that prevent previous studies from documenting reservoir storage dynamics with a better spatial coverage. The authors estimated storage changes for the 132 large reservoirs with both water areas and levels without assessing their consistency. Without a high correlation between water areas and levels, it makes no sense to me to combine these two to deduce storage changes. The authors need to refer to existing studies (e.g., Busker et al.) on quality control before simply combining satellite observations. The authors used a geostatistical method to estimate the storage changes in the vast majority of reservoirs, on which I have an even greater concern. The authors need to be aware that the mean depth, as archived in the HydroLakes dataset, is a ratio of the total volume and maximum lake area. The mean depth does not provide any meaningful information of the actual water depth. Additionally, the geostatistical model adopted by Messager et al. is a spatial model measuring the relationship between the total storages and maximum areas for a large group of water bodies. The authors tried to use the outcome (e.g., mean depth) to estimate storage changes in each individual reservoir, which differs from the purpose of the Messager et al. Unless the authors provide a comprehensive validation, I am not convinced the proposed method is feasible to estimate storage changes for the majority of studied reservoirs here. The presented attribution on reservoir storage changes seems to be so simplified that I have many concerns. First, the authors simply compared the directions of the trend in reservoir storage versus that in potential drivers but the analysis only produces coincidence rather than causation. Second, the authors conclude that the evaporation did not significantly impact the reservoir storage but the calculation for the evaporation is too cheap. The authors may need to use more advanced approaches (e.g., Zhao and Gao) in order to draw a confident conclusion. Third, reservoirs, particularly large ones as documented in GranD dataset, are highly regulated by humans. The authors depend on the outputs of global models on estimating human water release from reservoirs. Are the data really reliable for producing trend in human release for each reservoir? In sum, the authors need to pay careful attention to these limitations that potentially affect their conclusions. Line 89: It is hard to understand "coefficient of determination" here. Could you define or explain it?
Line 120: I do not quite understand what's the purpose showing the correlation between A0 and calculated V0 (based on A0). It makes more sense to me to show the correlation between A0 and h0 as these two are independent estimates. The authors may only need to consider a pearson' R greater than 0.8 (or R2 higher than 0.6) as correlation between A-L or A-V should be pretty high, otherwise it indicates substantial uncertainty in the data sets.
Line 135: the equation does not make sense to me. The authors need to show more details about the rationale.
Line 150: "Only 132 reservoirs with both area and level observations….". Do you conclude based on the 132 reservoirs or all reservoirs, majority of which do not have both observations?
164: It seems the MSWEP v1.1 may not be the latest version of the dataset.
192: The authors only validated on 1% of the studied reservoirs and the validation samples are located in U.S. only, which could be a concern.
194: What do you mean by "published"? The authors use pearson's R (correlation) for doing validation, which does not give insights on the accuracy of estimated values. Line 215: "this was almost entirely explained by positive trends for the two largest reservoirs in the world, Lake Kariba (+0.8 km3 yr-1) on the Zambezi River and Lake Aswan". This statement is confusing. I know some completed projections of megadams in China and Brazil, such as the three gorges dams.
Line 219: "while 948 reservoirs showed increasing trends, distributed in northern North America and southern Africa". The reported hotspots of increasing reservoir storage are inconsistent with the patterns of recent dam booms. Figure 4: This map is confusing to me. For example, China may be the global lead in dam constructions during the study period. Why its reservoir storage decreased? Is the data correctly shown in this map?
Line 245: "We summed storage for individual reservoirs to calculate combined storage in 134 river basins worldwide". Do reservoirs show a similar pattern of storage change in the same river basin? Is it more meaningful to analyze each of them individually?
Line 268: "In summary, we did not find evidence for widespread reductions in reservoir water storage due to increased releases". Reservoir storage increase could be a result of increased impoundments. Did you consider that?
Line 339: As Zhao and Gao used contaminated Landsat imagery to increase the monthly coverage of reservoir areas by 80%, do the estimates from poor-quality images affect your storage analysis? I know some studies (e.g., Busker et al) only adopted good-quality images due to this issue.