Evaluating different machine learning methods to simulate runoff from extensive green roofs

Abdalla, Elhadi Mohsen Hassan; Pons, Vincent; Stovin, Virginia; De-Ville, Simon; Fassman-Beck, Elizabeth; Alfredsen, Knut; Muthanna, Tone Merete

doi:https://doi.org/10.5194/hess-25-5917-2021

Articles | Volume 25, issue 11

https://doi.org/10.5194/hess-25-5917-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-25-5917-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 25, issue 11

Research article

|

15 Nov 2021

Research article |

| 15 Nov 2021

Evaluating different machine learning methods to simulate runoff from extensive green roofs

Elhadi Mohsen Hassan Abdalla, Vincent Pons, Virginia Stovin, Simon De-Ville, Elizabeth Fassman-Beck, Knut Alfredsen, and Tone Merete Muthanna

Download

Final revised paper (published on 15 Nov 2021)
Preprint (discussion started on 10 Mar 2021)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-124', Anonymous Referee #1, 20 Apr 2021

Summary

This paper compares the performance of four machine learning algorithms (including a deep learning one) in simulating runoff from green roofs, and provides their benchmarking by also utilizing a conceptual model. The comparison is conducted by using data from sixteen green roofs located in four Norwegian cities, and the comparted algorithms are the Artificial Neural Network (ANN), M5 Model tree, Long Short-Term Memory (LSTM) and k-Nearest Neighbour (kNN) ones. Additional investigations focus on the transferability of the algorithms between different green roofs. The results show that the performance of the investigated algorithms is acceptable; however, the conceptual model should be preferred over the transferred machine and deep learning algorithms.

General comments

Overall, I believe that the paper is meaningful, interesting and mostly well-written with room for improvements.

Although my comments are quite few, I recommend major revisions, as the suggested improvements (mainly those prescribed with specific comment #1) are both important and necessary, to my view, for the model comparison (and the entire paper) to reach the best possible shape.

Specific comments

1) In line 246, it is written that the “methods were evaluated based on the performance on the validation data sets”. However, in line 221 it is written that “to avoid overfitting, the performance of changing hyperparameters was observed in the validation periods”. As the validation set has been used for hyperparameter selection (i.e., for identifying the best version of its machine learning algorithm), the addition of an extra independent set (i.e., a test set that is not used for model selection) is necessary here. This extra set will serve the independent comparison between machine learning algorithms, as well as the independent comparison between machine learning algorithms and the conceptual model. Therefore, the datasets should be divided into (at least) three independent sets (including different data points), i.e., the training, validation and test sets.

2) Moreover, it would be better (but not strictly necessary, to my view) that the datasets are divided into four independent sets (i.e., the training, validation 1, validation 2 and test sets), as time lag selection also takes place according to the following lines: “Secondly, the structural parameters were fixed, and different lag values ranging from 1 hour to 200 hours were tested to identify the optimal lag value” (lines 219−220).

3) In lines 217−216, it is written that “BERG1, OSL1, SAN1 and TRD1 roofs were selected to test different hyperparameters to find the optimal parameters for each city”. Would it be better to select different hyperparameters for each roof?

4) In lines 209−211, it written that “data were aggregated into one-hour resolution, and snow accumulation periods were excluded (1 Oct. – 31 Mar.). One year was used for training and one year for validation. The selection of the training year was based on the sum of precipitation as the wettest year between 2015 to 2017 for each roof, and the second wettest year for validation. The rationale for the selection is that the wettest year covers a broader span of precipitation events which improves the generalization performance of the models”. To my view, it would be better if the training and validation periods for all greens roofs were presented in a new table.

5) Also, I think that –at least in the supplement− it would be interesting to show what happens when one uses the entire datasets (i.e., without excluding the snow accumulation periods or other periods), and not selected parts of these datasets.

6) I find that some important literature pieces on data-driven hydrological modelling (e.g., some of the oldest works in the field) are currently missing from the manuscript’s reference list.

7) Lastly, since the manuscript is not typo-free at the moment, a careful reading and typo correction are required. For instance, something is currently wrong with the sections numbering (“2 Data”, “2.1 Machine learning models”, “3 Results and Discussion”). Also, there are typos in the units, symbols and equations, which should be written according to the following conventions:

• Single-letter variables should be written in italics.

• Multi-letter variables should not be written in italics.

Citation: https://doi.org/10.5194/hess-2021-124-RC1
- AC1: 'Reply on RC1', Elhadi Abdalla, 06 Jun 2021
  
  We would like to thank the reviewer for the thoughtful and constructive comments. Our responses could be found in the attached letter.
  
  Citation: https://doi.org/10.5194/hess-2021-124-AC1
RC2:
'Comment on hess-2021-124', Anonymous Referee #2, 14 May 2021
General Comments:

First, I want to apologize with Authors due to my late review. It was due to unexpected issues. The present study presents a numerical analysis to compare the performance of multiple Machine Learning techniques against conceptual models for the hydrological analysis and forecasting of Green Roofs behavior. The aim of the paper is interesting and of relevance for HESS readers. However, I find that the paper has multiple weaknesses:

There are multiple bold statements against the use of physically-based models for GRs analysis, which are not supported by evidence and not needed in the manuscript, which should simply attain to its aim: assessing the performance of ML techniques for GRs analysis. Instead of reinforcing the paper, these statements draw the attention on other aspects, which are highly debatable. There doesn’t exist a perfect numerical tools for everything, or one better than the other. It’s up to the modeler to choose the right model for the specific modeling task.

The emulators training is performed by using the trial-and-error technique, which is an outdated and inefficient methodology. This is especially true for this task since the response surface in the hyperparameters’ space can be multimodal, thus making it easy to get trapped in local minima. Furthermore, the uncertainty of the estimated hyperparameters should be properly assessed and eventually propagated in the validation step. The way it is handled in the paper (manually changing hyperparameters) is weak.

Since authors calibrate (manually, but still calibrate) the emulators and compare it with a conceptual model, then the latter should be calibrated as well to conduct a fair comparison. This was not done.

Specific Comments:

L2-5 In my opinion, there is a general misunderstanding in this field, which is reiterated in multiple manuscripts, and it’s the idea that conceptual models are always computationally cheaper than physically-based models for the hydrological analysis of GRs. Except particular circumstances, the computational cost is comparable. For instance, the authors can verify by themselves that HYDRUS-1D, a mechanistic hydrological model frequently used in GR analysis, takes less than few seconds for a long-term hydrological simulation. Conversely, for the same task, some conceptual models can be even more computationally expensive if the code is developed in excel or in high-level programming languages. Therefore, I would not build the premise of the work on this.

L2-5 Regarding the complexity, we should first define what is complexity (number of parameters, number of processes, etc). This is again questionable.

Measurements: This is true and implies that conceptual models are not easily generalizable.

L20-25 “Improving quality” is a bold statement. There is an extensive literature about nutrients leaching from GRs.

L30 Why bold font?

L35-40 I don’t agree with these statements. Mechanistic models actually rely on huge literature body, which can be used to set the model parameters. For instance, parameters of the van Genuchten can be obtained with pedotransfer functions (using particle size distribution and other info from the producer) or set according to several studies which have been already performed. The unsaturated conductivity is needed as the soil water retention curve in the Richards equation, there is no difference. What is the acceptable level of uncertainty depends on the analysis (in dry conditions the magnitude of fluxes is low thus K is not prominent).

L55 Computational cost: As I stated before, I don’t agree with this.

L75-80 MLs are not uncertainty-free.

L115-120 “Green Roof runoff” should be “Green Roof subsurface runoff” to avoid misunderstandings.

I would just say “ when observations are not available”

L168. “Trial-and-error” This is not true. A correct ANN training should use numerical optimization to identify the right set of hyperparameters since

Section 2.2 I’m not sure that you can basically neglect physical properties of GRs. This might be somehow borderline acceptable for extensive GRs but morphological and hydraulic characteristic will play an important role as the soil substrate depth increases. This is acknowledged also in one of latest paper from the same authors (Peng et al., 2020), and it is rather intuitive. I would be curious to see how the emulators behave when splitting the sample between thin and thick roofs. This would certainly deliver a more meaningful information to the community.

L210 The validation should be performed on a drier year to really assess the generalizability of emulators.

L210-215 The optimal hyperparameters should be calibrated numerically, since you can easily end up trapped in a local minima (10.1016/j.jhydrol.2005.03.013). This is true for all emulators.

The use of Latin hypercube doesn’t make solve the problem. You have a better coverage of parameters’ space but, unless you use a global optimization strategy, you can be still trapped in local minima.

L220 What are the structural parameters?

L221 What you attempt to do is to investigate how small changes in hyperparameters affect the response of the emulator. Basically, how the uncertainty in the estimated hyperparmeters (you see that ML techniques are not uncertainty free) propagates. This should have been done more correctly by numerically optimizing MLs parameters and estimating (at least) their confidence intervals. Even better would have been using Bayesian inference to estimate posterior uncertainty (e.g., 10.1016/j.jhydrol.2011.09.002).

L2.3 Why reporting all these equations, which are already mentioned in other studies from the same authors? Cite them and move forward.

L228 “Without the need of prior calibration…” This sounds puzzling to me. In the Introduction you write “calibration is needed to find their optimal values, unlike physically-based models”, which is true since conceptual models generally needs site-specific calibration. If conceptual model parameters were not previously calibrated in other studies for the same site, then they should be calibrated here to conduct a fair comparison with trial-and-error optimized MLs.

L3.1 For the reasons that I mentioned above, I consider this way of training emulators not formally correct and scientifically outdated.

L331 This can be said only when you perform a scientifically sounding calibration and uncertainty assessment of both models. None of the two was carried out, furthermore the conceptual model was not calibrated, thus the comparison is not fair.

L333-335 Not sure what you refer with “…accommodate complex, multi-layered systems”. These are bold statements not supported by evidence, which actually should be avoided since they don’t contribute to the discussion unless they are proven.
Citation: https://doi.org/10.5194/hess-2021-124-RC2
- AC2: 'Reply on RC2', Elhadi Abdalla, 06 Jun 2021
  
  We would like to thank the reviewer for the thoughtful and constructive comments. Our responses could be found in the attached letter.
  
  Citation: https://doi.org/10.5194/hess-2021-124-AC2
- AC3: 'Reply on RC2', Elhadi Abdalla, 06 Jun 2021
  
  We would like to thank the reviewer for the thoughtful and constructive comments. Our responses could be found in the attached letter.
  
  Citation: https://doi.org/10.5194/hess-2021-124-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (16 Jun 2021) by Christa Kelleher

AR by Elhadi Abdalla on behalf of the Authors (28 Jul 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Aug 2021) by Christa Kelleher

RR by Anonymous Referee #1 (24 Aug 2021)

RR by Anonymous Referee #3 (06 Oct 2021)

Suggestions for revision or reasons for rejection

In the manuscript, the authors evaluated the performance of 4 machine learning (ML) approaches in simulating runoff from green roofs. The experiments include 16 green roofs in 4 cities which are of different environmental conditions. Upon modeling retention of green roofs, The authors highlighted the advantage of the ML methods in comparing with a conceptual model. Upon modeling runoff, most models achieved a promising performance (NSE>0.5). The authors also examined the transferability of ML models between green roofs and concluded that those models could be transferred between cities with similar rainfall events characteristics.

As an extra reviewer who accidentally reviewed the out-of-date version, I am impressed by the substantial amount of work that the authors have done to improve the manuscript. The current version of the manuscript is clear-written and resolved most concerns I had in the previous version. I found several technical issues that I will detail following, but most of them are easy to fix in my opinion.

L25: I like the highlight of retention and detention, as the first round manuscript did.
L190: The author should write clearly about how the distance is calculated in this work, instead of "such as".
L210: The description does not agree with the precipitation amount in table 2. For example, at Bergen and Oslo the drier year is chose as the training set.
L234: The dropout citation is not correct here as it is already so widely used before the cited work.
L331: Table 5 presented the testing error?
L335: Typo. Reads BERG2 here but BERG1 in figure 6?
L344: The author only explained why models on Bergen are of better performance comparing to Oslo, but did not reason upon "other roofs in the study" as claimed.
L351 and figure 7: The author implied that the TRD site is heavily impacted by snow melting here, which is supported by the calibration results in table 4 (preferred longer lag time). However figure 7 did not present which season/month/date it plotted, and readers who are not familiar with the climate of Norway (like me) may have a hard time understanding the effect of snow storage.
L362: I am wondering if the conclusion is true for all sites? The three sites presented in figure 8 happen to have positive and negative biases for the two training years, and maybe that is why the sum of two models outplay either one of them. Are there any sites that the two models from two years result in biases of the same direction?
L383: In the manuscript, the author did not include a process-based or conceptual model upon detention process, and as a result, the work only proved that ML models are better at simulating retention process comparing to conventional methods, other than simulating the runoff. The reason seems to be the detention models are not "convincing" according to the manuscript. However, the authors reviewed detention models using three paragraphs (L39-57) in the introduction. Are none of those reviewed models "convincing"?
Table 1: There are two BERG2.
Figure 4: I would suggest labeling train, validation, and test for each plot.
figure 6: I think the author presented the testing period other than the validation period? And which are the three months?

In general, I find the manuscript is well-written and could be a novel contribution to the community. I have to acknowledge that I am not an expert in green roof modeling, and I may not fully understand the background and contribution of this manuscript, so please consider my review accordingly.

Hide

ED: Publish subject to minor revisions (review by editor) (06 Oct 2021) by Christa Kelleher

AR by Elhadi Abdalla on behalf of the Authors (08 Oct 2021) Author's response Manuscript

EF by Manal Becker (11 Oct 2021) Author's tracked changes

ED: Publish as is (18 Oct 2021) by Christa Kelleher

AR by Elhadi Abdalla on behalf of the Authors (19 Oct 2021)

Short summary

This study investigated the potential of using machine learning algorithms as hydrological models of green roofs across different climatic condition. The study provides comparison between conceptual and machine learning algorithms. Machine learning models were found to be accurate in simulating runoff from extensive green roofs.