the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Metamorphic Testing of Machine Learning and Conceptual Hydrologic Models
Abstract. Predicting the response of hydrologic systems to modified driving forces, beyond patterns that have occurred in the past, is of high importance for estimating climate change impacts or the effect of management measures. This kind of predictions requires a model, but the impossibility of testing such predictions against observed data makes it still difficult to estimate their reliability. Metamorphic testing offers a methodology for assessing models beyond validation with real data. It consists of defining input changes for which the expected responses are assumed to be known at least qualitatively, and to test model behavior for consistency with these expectations. To increase the gain of information and reduce the subjectivity of this approach, we extend this methodology to a multi-model approach and include a sensitivity analysis of the predictions to training or calibration options. This allows us to quantitatively analyse differences in predictions between different model structures and calibration options in addition to the qualitative test to the expectations. In our case study, we apply this approach to selected conceptual and machine learning hydrological models calibrated to basins from the CAMELS data set. Our results confirm the superiority of the machine learning models over the conceptual hydrologic models regarding the quality of fit during calibration and validation periods. However, we also find that the response of machine learning models to modified inputs can deviate from the expectations and the magnitude and even the sign of the response can depend on the training data. In addition, even in cases in which all models passed the metamorphic test, there are cases in which the quantitative response is different for different model structures. This demonstrates the importance of this kind of testing beyond the usual calibration-validation analysis to identify potential problems and stimulate the development of improved models.
- Preprint
(6804 KB) - Metadata XML
-
Supplement
(95915 KB) - BibTeX
- EndNote
Status: closed
-
CC1: 'Comment on hess-2023-168', Scott Steinschneider, 29 Jul 2023
I read this paper with interest, as I agree that such metamorphic tests on ML hydrologic models are needed to assess their appropriateness for certain hydrologic modeling applications like projections under climate change.
My main comment is that I think the authors could contextualize thier study with past work that has conducted a similar exploration. The first paper that I am aware of which attempted a metamorphic test on an LSTM was Razavi (2021) (see their Figure 11). They only considered an LSTM fit to one site, and so there are limitations to that work, but I think its important to recognize it. Afterwards, Wi and Steinschneider (2022) conducted a similar metamorphic test as conducted in the present study, using both 1) an LSTM and physics-informed LSTMs fit to 15 sites across California, as well as an LSTM fit across the entire CAMELS dataset. They found related challenges with LSTM projections under warming as found in this work.
Therefore, I recommend that the authors adjust their Introduction to recognize these past studies, and then to articulate how their work provides a contribution over these past studies. I believe this is very straightforward, as the present study 1) considers changes in precipitation as well; 2) explores responses separately by basin elevation and temperature; and 3) explore sensitivity to calibration choices (this later one was particularly helpful to see). In addition, I might adjust the Summary and Conclusion to discuss the results of the present study in comparison to the metamorphic results seen in Wi and Steinschneider (2022), in order to help synthsize related results in the literature.
References
Razavi, S. (2021). Deep learning, explained: Fundamentals, explainability, and bridgeability to process-based modelling, Environmental Modelling and Software, 105159, https://doi.org/10.1016/j.envsoft.2021.105159.
Wi, S., & Steinschneider, S. (2022). Assessing the physical realism of deep learning hydrologic model projections under climate change. Water Resources Research, 58, e2022WR032123. https://doi.org/10.1029/2022WR032123
Citation: https://doi.org/10.5194/hess-2023-168-CC1 -
CC2: 'Reply on CC1', Peter Reichert, 03 Aug 2023
Thank you very much for these hints. We will expand our introduction and conclusions as recommended.
Citation: https://doi.org/10.5194/hess-2023-168-CC2 - AC1: 'Reply on CC1', Peter Reichert, 12 Feb 2024
-
CC2: 'Reply on CC1', Peter Reichert, 03 Aug 2023
-
RC1: 'Comment on hess-2023-168', Anonymous Referee #1, 13 Dec 2023
Reichert and co-authors describe a study in which they test the qualitative response (metamorphic testing) of two conceptual hydrologic models and a deep learning model trained on CAMELS-US dataset to perturbed temperature and precipitation, aimed at mimicking the qualitative performance of these models under climate change scenarios. The deep learning model (LSTM) outperforms the conceptual models during calibration and validation, but exhibited unexpected hydrologic response in low-elevation basins when temperature was perturbed. Solely training on the low-elevation basins from CAMELS-US improved this qualitative response to the perturbed temperature, suggesting that fine-tuning or limiting datasets to prediction task may help improve out-of-bounds predictions. I provide some comments below that I think will improve the manuscript.
- Generally, I encourage the authors to attempt to summarize all the basins used for metamorphic testing rather than providing individual plots across all basins in the supplementary. There does not appear to be a reason why certain basins are displayed in the main text vs. supplementary. Summarizing across all basins used will help the reader understand better if the pattern is common without having to look through dozens of individual plots in supplementary.
- Line 40 and elsewhere: Instead of ‘modified driving forces’, could you use something more generic like ‘out-of-bounds predictions’?
- Line 114-115: the initial clause seems a bit clunky. Would it be clearer to either remove the first clause entirely or consider placing it as a separate sentence?
- Lines 120-139: this is a flat 10% increase in precipitation for every precipitation data point in the dataset? What if there is no rain on a given day? I assume that will still be 0 precipitation increase scenario given the equation 1 but clarification would be helpful.
- Line 160: can you give examples of what these precipitation or temperature related attributes would be?
- Line 168: what was the validation NSE range for these catchments?
- Line 213: I encourage the authors to include a link to the working repository at the moment, or a draft code release.
- Figure 2: describe in the legend what the numbers next to the points represent.
- Line 253: clarify that the sensitivities are in relation to the outlet discharge and not the overall model performance – ‘sensitivities of the models are essentially negative’ makes it sound like the models had a poor/unexpected outcome, but this is quite the opposite. I suggest changing to something like, “The predicted outlet discharge for the GR4neige and HBV models was lower with increased temperature, which is expected … “
- Figure 3: Please change the colors of the third panel to be different than the colors used to indicate the different types of models in the other panels? This is confusing to switch the meaning of the colors in the same figure.
- Figure 3: for the top panel, why is there a y-axis that extends to -4 when there are no negative values? Also seems like the max differences are cutoff at the top of the y-axis, please extend higher.
- Figure 3: for the fourth panel, indicate what the black circles are – I assume they are observations
- Figure 4: please include the full legend here so the reader doesn’t have to refer to a separate figure
- Line 276: change “rainfall” to “precipitation” as a lot of this precipitation is falling as snow in this basin and other basins.
- Figure 4: it is hard to distinguish the different model traces on here and I cannot tell what I’m supposed to take away from the bottom panel. I suggest splitting out the LSTM traces into a separate panel and/or show the deviation of the LSTM_x compared to the base LSTM results.
- Lines 395-402: I’m curious if the authors tried pre-training on the entire dataset with an early stopping criteria as to not overfit, and then fine tune on the reduced dataset with only low-elevation basins. That seems like it might be the best of both worlds – providing better fits by using more data but also passing the metamorphic test.
- Lines 398-399: by how much did the quality of fit deteriorate? By 1%, 30%? I suggest adding in a quantitative measure so the readers can evaluate how much the tradeoff is for passing the metamorphic test with less data.
- Lines 412-414: See Topp et al. 2023 https://doi.org/10.1029/2022WR033880 for comparison of ML architectures to prediction in unseen conditions. They suggest ML process/physics guidance helps improve predictions in unseen conditions. Likewise, see Read et al. 2019 https://doi.org/10.1029/2019WR024922 for out-of-bounds predictions using process-guidance for ML models.
Citation: https://doi.org/10.5194/hess-2023-168-RC1 - AC3: 'Reply on RC1', Peter Reichert, 12 Feb 2024
-
RC2: 'Comment on hess-2023-168', Joel Harms, 16 Jan 2024
I believe this is a significant paper, I am looking forward to its publishing, however, I do believe some improvements can and should be made first. The paper lacks a bit of specificity when it comes to implementation details of the metamorphic testing approach. In particular I find the choice of model types in the multi-model approach important, the merits of which are not particularly covered in the paper until the conclusion section. I would suggest this be done in a more sophisticated way around lines 88-89. Other considerations should also be specified and guidance provided as to what correct versus incorrect implementation entails, it is at the moment a bit too broad on this front.
Please see more details below:
Line 19: Odd source placement.
Line 24 and 30 please specify whether you are talking about deep learning, and if you do then clarify right away. Additionally, at its current location the Shen (2018) citation seems out of place.
Line 27: Good that you point this out, please also provide some examples.
Lines 33 to 36: Seems out of place, maybe move or integrate into discussion meaningfully.
Lines 82 and 85-88: I am glad you point these out!
Line 88-89: I was hoping for more detail.
Line 137: Re-cite in the bracket
Line 150: Reduce the overall size of this paragraph and the one around line 130 by focusing only on the basins that will appear in the study. Perhaps a more concise mention near Line 170 and 175 of the expected effects would be best. A schematic can also be useful in showing the expected effects for the basins in your study.
Line 195: Please specify the choice of optimizer.
Line 220: Remain consistent throughout with your word-choice of training versus calibration. (However, you should keep the statement equating these terms for researchers from different backgrounds that may be used to one or the other.)
Figure 2: Utilize symbols for black and white version distinguishability of basin classification.
Figure 3/4/5: Particularly the top 2 panels are difficult to read in black and white, double check with the editor whether this would need to be adjusted.
Line 323-324: Perhaps use a different loss function with guaranteed convexity to avoid this problem. You can still use NSE for evaluation of course to stay consistent with the other models. (I would recommend you do this at least for the reanalysis)
Line 363: Instead of "was not investigated" saying "this was not the main aim" would be better.
Line 412-414: Make these claims and future research directions more specific. Why ML and which research will be necessary to achieve this?
Last paragraph: In the conclusion new ideas should not be first mentioned as it is done here. Citations should also not be in the conclusion section for this reason. Please mention these points earlier in the discussion and just summarize them in the conclusion.
Thank you for your submission! I am looking forward to reviewing the revised version of this manuscript!
Citation: https://doi.org/10.5194/hess-2023-168-RC2 - AC2: 'Reply on RC2', Peter Reichert, 12 Feb 2024
Status: closed
-
CC1: 'Comment on hess-2023-168', Scott Steinschneider, 29 Jul 2023
I read this paper with interest, as I agree that such metamorphic tests on ML hydrologic models are needed to assess their appropriateness for certain hydrologic modeling applications like projections under climate change.
My main comment is that I think the authors could contextualize thier study with past work that has conducted a similar exploration. The first paper that I am aware of which attempted a metamorphic test on an LSTM was Razavi (2021) (see their Figure 11). They only considered an LSTM fit to one site, and so there are limitations to that work, but I think its important to recognize it. Afterwards, Wi and Steinschneider (2022) conducted a similar metamorphic test as conducted in the present study, using both 1) an LSTM and physics-informed LSTMs fit to 15 sites across California, as well as an LSTM fit across the entire CAMELS dataset. They found related challenges with LSTM projections under warming as found in this work.
Therefore, I recommend that the authors adjust their Introduction to recognize these past studies, and then to articulate how their work provides a contribution over these past studies. I believe this is very straightforward, as the present study 1) considers changes in precipitation as well; 2) explores responses separately by basin elevation and temperature; and 3) explore sensitivity to calibration choices (this later one was particularly helpful to see). In addition, I might adjust the Summary and Conclusion to discuss the results of the present study in comparison to the metamorphic results seen in Wi and Steinschneider (2022), in order to help synthsize related results in the literature.
References
Razavi, S. (2021). Deep learning, explained: Fundamentals, explainability, and bridgeability to process-based modelling, Environmental Modelling and Software, 105159, https://doi.org/10.1016/j.envsoft.2021.105159.
Wi, S., & Steinschneider, S. (2022). Assessing the physical realism of deep learning hydrologic model projections under climate change. Water Resources Research, 58, e2022WR032123. https://doi.org/10.1029/2022WR032123
Citation: https://doi.org/10.5194/hess-2023-168-CC1 -
CC2: 'Reply on CC1', Peter Reichert, 03 Aug 2023
Thank you very much for these hints. We will expand our introduction and conclusions as recommended.
Citation: https://doi.org/10.5194/hess-2023-168-CC2 - AC1: 'Reply on CC1', Peter Reichert, 12 Feb 2024
-
CC2: 'Reply on CC1', Peter Reichert, 03 Aug 2023
-
RC1: 'Comment on hess-2023-168', Anonymous Referee #1, 13 Dec 2023
Reichert and co-authors describe a study in which they test the qualitative response (metamorphic testing) of two conceptual hydrologic models and a deep learning model trained on CAMELS-US dataset to perturbed temperature and precipitation, aimed at mimicking the qualitative performance of these models under climate change scenarios. The deep learning model (LSTM) outperforms the conceptual models during calibration and validation, but exhibited unexpected hydrologic response in low-elevation basins when temperature was perturbed. Solely training on the low-elevation basins from CAMELS-US improved this qualitative response to the perturbed temperature, suggesting that fine-tuning or limiting datasets to prediction task may help improve out-of-bounds predictions. I provide some comments below that I think will improve the manuscript.
- Generally, I encourage the authors to attempt to summarize all the basins used for metamorphic testing rather than providing individual plots across all basins in the supplementary. There does not appear to be a reason why certain basins are displayed in the main text vs. supplementary. Summarizing across all basins used will help the reader understand better if the pattern is common without having to look through dozens of individual plots in supplementary.
- Line 40 and elsewhere: Instead of ‘modified driving forces’, could you use something more generic like ‘out-of-bounds predictions’?
- Line 114-115: the initial clause seems a bit clunky. Would it be clearer to either remove the first clause entirely or consider placing it as a separate sentence?
- Lines 120-139: this is a flat 10% increase in precipitation for every precipitation data point in the dataset? What if there is no rain on a given day? I assume that will still be 0 precipitation increase scenario given the equation 1 but clarification would be helpful.
- Line 160: can you give examples of what these precipitation or temperature related attributes would be?
- Line 168: what was the validation NSE range for these catchments?
- Line 213: I encourage the authors to include a link to the working repository at the moment, or a draft code release.
- Figure 2: describe in the legend what the numbers next to the points represent.
- Line 253: clarify that the sensitivities are in relation to the outlet discharge and not the overall model performance – ‘sensitivities of the models are essentially negative’ makes it sound like the models had a poor/unexpected outcome, but this is quite the opposite. I suggest changing to something like, “The predicted outlet discharge for the GR4neige and HBV models was lower with increased temperature, which is expected … “
- Figure 3: Please change the colors of the third panel to be different than the colors used to indicate the different types of models in the other panels? This is confusing to switch the meaning of the colors in the same figure.
- Figure 3: for the top panel, why is there a y-axis that extends to -4 when there are no negative values? Also seems like the max differences are cutoff at the top of the y-axis, please extend higher.
- Figure 3: for the fourth panel, indicate what the black circles are – I assume they are observations
- Figure 4: please include the full legend here so the reader doesn’t have to refer to a separate figure
- Line 276: change “rainfall” to “precipitation” as a lot of this precipitation is falling as snow in this basin and other basins.
- Figure 4: it is hard to distinguish the different model traces on here and I cannot tell what I’m supposed to take away from the bottom panel. I suggest splitting out the LSTM traces into a separate panel and/or show the deviation of the LSTM_x compared to the base LSTM results.
- Lines 395-402: I’m curious if the authors tried pre-training on the entire dataset with an early stopping criteria as to not overfit, and then fine tune on the reduced dataset with only low-elevation basins. That seems like it might be the best of both worlds – providing better fits by using more data but also passing the metamorphic test.
- Lines 398-399: by how much did the quality of fit deteriorate? By 1%, 30%? I suggest adding in a quantitative measure so the readers can evaluate how much the tradeoff is for passing the metamorphic test with less data.
- Lines 412-414: See Topp et al. 2023 https://doi.org/10.1029/2022WR033880 for comparison of ML architectures to prediction in unseen conditions. They suggest ML process/physics guidance helps improve predictions in unseen conditions. Likewise, see Read et al. 2019 https://doi.org/10.1029/2019WR024922 for out-of-bounds predictions using process-guidance for ML models.
Citation: https://doi.org/10.5194/hess-2023-168-RC1 - AC3: 'Reply on RC1', Peter Reichert, 12 Feb 2024
-
RC2: 'Comment on hess-2023-168', Joel Harms, 16 Jan 2024
I believe this is a significant paper, I am looking forward to its publishing, however, I do believe some improvements can and should be made first. The paper lacks a bit of specificity when it comes to implementation details of the metamorphic testing approach. In particular I find the choice of model types in the multi-model approach important, the merits of which are not particularly covered in the paper until the conclusion section. I would suggest this be done in a more sophisticated way around lines 88-89. Other considerations should also be specified and guidance provided as to what correct versus incorrect implementation entails, it is at the moment a bit too broad on this front.
Please see more details below:
Line 19: Odd source placement.
Line 24 and 30 please specify whether you are talking about deep learning, and if you do then clarify right away. Additionally, at its current location the Shen (2018) citation seems out of place.
Line 27: Good that you point this out, please also provide some examples.
Lines 33 to 36: Seems out of place, maybe move or integrate into discussion meaningfully.
Lines 82 and 85-88: I am glad you point these out!
Line 88-89: I was hoping for more detail.
Line 137: Re-cite in the bracket
Line 150: Reduce the overall size of this paragraph and the one around line 130 by focusing only on the basins that will appear in the study. Perhaps a more concise mention near Line 170 and 175 of the expected effects would be best. A schematic can also be useful in showing the expected effects for the basins in your study.
Line 195: Please specify the choice of optimizer.
Line 220: Remain consistent throughout with your word-choice of training versus calibration. (However, you should keep the statement equating these terms for researchers from different backgrounds that may be used to one or the other.)
Figure 2: Utilize symbols for black and white version distinguishability of basin classification.
Figure 3/4/5: Particularly the top 2 panels are difficult to read in black and white, double check with the editor whether this would need to be adjusted.
Line 323-324: Perhaps use a different loss function with guaranteed convexity to avoid this problem. You can still use NSE for evaluation of course to stay consistent with the other models. (I would recommend you do this at least for the reanalysis)
Line 363: Instead of "was not investigated" saying "this was not the main aim" would be better.
Line 412-414: Make these claims and future research directions more specific. Why ML and which research will be necessary to achieve this?
Last paragraph: In the conclusion new ideas should not be first mentioned as it is done here. Citations should also not be in the conclusion section for this reason. Please mention these points earlier in the discussion and just summarize them in the conclusion.
Thank you for your submission! I am looking forward to reviewing the revised version of this manuscript!
Citation: https://doi.org/10.5194/hess-2023-168-RC2 - AC2: 'Reply on RC2', Peter Reichert, 12 Feb 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
806 | 272 | 33 | 1,111 | 53 | 25 | 22 |
- HTML: 806
- PDF: 272
- XML: 33
- Total: 1,111
- Supplement: 53
- BibTeX: 25
- EndNote: 22
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1