Rebuttal Letter for the Reviews on ’Machine learning methods to assess the effects of a non-linear damage spectrum taking into account soil moisture on winter wheat yields in Germany’

We thank the reviewers for their detailed comments, which helped us to improve our manuscript substantially. In particular, the main comment from reviewer 1 was was very helpful because the corresponding analysis of why the clusters are identical to the state boundaries helped us identify an issue within the data provided to the clustering algorithms that was not valid. Unfortunately, this analysis is the basis of the statistical analysis and therefore has profound implications for all of the results 15 presented in this manuscript.

. (a) 20-year winter wheat yield average (1999-2018, left) and standard deviation (right) of yields for the counties over Germany.
(b) Box-and-whisker plots of winter wheat anomalies for each year and both linear and nonlinear model fits to identify potential trends in the anomaly data. Exceptional years of interest are marked with a light beige box. Data source: Federal Statistical Office DESTATIS Table 1. Table of  The crop yield potential varies regionally in Germany due to differences in climate and soils among other factors. To take account of these differences, a spatial clustering was implemented to identify different subregions within Germany. The clustering methods used are representatives of centroid-based ones, such as k-means (KMEANS, (MacQueen, 1967;Hartigan and 140 Wong, 1979)) and partitioning around medoids (PAM, (Kaufman and Rousseeuw, 1990)), which is less sensitive to outliers, as well as the connectivity-based hierarchical clustering (HIERARCHICAL, (Murtagh, 1985)). Standard internal validation such as connectivity, average silhouette width, Dunn index for cluster numbers between 2 and 16 were tested for the evaluation.
However, the results show no clear outcome on which algorithm and size combination to use ( figure A3). Instead, we fit the random forests individually for each region defined by one of the cluster algorithms and cluster size. We then selected the 145 combination that maximizes the average prediction capacity (test R-square) across all regions. For each of these cluster configurations the model is trained on 80 percent of the data in that subset and predicted for the rest. The data used for clustering are monthly averages and daily observations of the meteorological data for the entire year. SMI is included for both the upper layer and the entire soil column. Average yields are also taken into account in the data for cluster formation. This is based on the intuition of taking into account time-invariant factors of each cluster that affect yields such as soil quality and average farm 150 size. These factors are not considered in the random forest due to use of yield anomalies. This approach is inspired by fixed effect econometric models. There, the group means are fixed, thus taking into account the time-invariant heterogeneity of these groups (for econometric literature see for instance Wooldridge (2012).
The random forest algorithm allows to study the relationships between hydro-metrological extremes and yield anomalies by assessing the relative importance of the variables and the functional relationship between each predictor and the response 155 variable (Jeong et al., 2016;Vogel et al., 2019;Beillouin et al., 2020). Furthermore, we use model agnostics, which includes various flexible methods that allow the interpretation of black box models that separate the explanation of the model from the model itself. Accordingly, the same method can be used for any kind of machine learning algorithm, different types of explanations and different types of features can be presented (Ribeiro et al., 2016). The particular method considered here is Accumulated Local Effects (ALE), which is a visualization of the average marginal effect of features on target variables for 160 supervised learning models (Apley and Zhu, 2016;Molnar, 2020). ALE plots predict the effect of an explanatory variable across their realisations, taking into account only a subset of the sample with observed values adjacent to the respective realisation (Apley and Zhu, 2016). It is a faster alternative to the popular approach of the Partial Dependence Plots (Friedman, 2001), which have already proved to be suitable in the context of yield prediction (Jeong et al., 2016;Vogel et al., 2019). However, ALE plots are more suitable to visualize marginal effects by plotting explanatory variables against the predicted outcomes if 165 the features are highly correlated (Storm et al., 2020). One limitation is that uncertainty estimates are not provided for ALE plots, which is a substantial limitation of the approach and is an area of active research (Storm et al., 2020). The ALE plots are shown for the most important features of each cluster. To evaluate the cluster algorithm and the number of clusters the test R-squared for each cluster and number of cluster combination is generated. Table 2 shows results for three different soil moisture configurations, i.e. one each for the upper layer as well as the entire soil column and one that takes both into account. For each of these soil moisture configurations the three combinations of algorithms and cluster sizes with the highest R-square are shown. The validation criterion for non-cluster formation is also shown as a reference. Overall, the best results can be achieved if only SMI for the uppermost 25 cm is considered. The best 175 results explain 70% of the wheat yield anomaly variation. The average variance explained exceeds the variance explained by RFs applied at the global level (Vogel et al., 2019) or for the Northern Hemisphere (Vogel et al., 2020) and is comparable to the highest explained variability for RFs applied to European regions , with an average for winter wheat of 43% (Beillouin et al., 2020). A comparable regression model approach is able to explain a maximum of 32% of the variation (Gömann et al., 2015).
A large fraction of the variability is usually explained by time-invariant factors, which are largely not considered here due to  We choose to further explore the results of PAM with two clusters because it provides a compromise between a high predictive power and reduced complexity. The clusters are divided along the former border between western and eastern Germany (see figure 2b). This division along administrative borders is also supported by other cluster (figure 2a). In general, a higher variability of yields but lower average yields can be observed in most parts of eastern Germany (figure 1). This indicates structural differences between western and eastern Germany (Albers et al., 2017). In addition to environmental conditions, these may 190 include, for example, different data collection practices and different agribusiness structures. The R-square of cluster 1 is 0.64 and of cluster 2 is 0.73. For both clusters, a good fit can be observed in the scatter plots for most of the data ( figure 3 (a)), while the tails are slightly underestimated ( figure 3 (b)). The higher explanatory power for cluster 2 might be related to the higher variation in yield anomaly there (see figure 1). Furthermore, different impact mechanisms might work within each cluster.
Those are disentangled in the following.

Marginal effects of the most important features
The main variables for each sub-cluster and the corresponding average marginal effects are presented below in order to understand the range of adverse effects on yield variation in winter wheat. To generate variable importance and ALE plots, no split is made between test and training data. The non-cluster results are compared with the spatial clusters generated with the PAM clustering algorithm for a cluster size of 2 (PAM (2)). The detailed ALE plots for the overall best algorithm cluster size  The ALE plots in figure 4 are ranked in accordance to their variable importance (for further information see the variable importance section in the appendix). In general, soil moisture supports best the performance of the model. This is valid in 205 particular for the non-cluster approach and cluster 1, since in cluster 2 more meteorological variables are critical. Figure 4a shows the ALE plots for the non-cluster approach. Cluster 1 (figure 4b) shows almost similar sensitivities to those observed for the non-cluster approach. Besides the importance ranking, basically only the amplitude of the effects change. Only the least important variable is PS5 instead of PS4 in cluster 1. Topsoil SMI in February and March as well as in August show a positive signal for shortage in soil moisture. This indicates a preference for drier than normal conditions during those months.

210
For December and January, no negative impact of soil water scarcity can be found. SMI shows a negative drought signal for the months April to July with the one in June being the largest. For July a drop in yield can be observed for small values of SMI. However, the negative effects of soil water abundance are much larger in this case.
For cluster 1, in May and June, the drought signal by soil moisture is comparatively smaller than the one found in the noncluster setting. For April the signal is stronger but still ambiguous. Overall, the signals associated with water deficit stress are threshold for negative heat effects to higher temperatures. Furthermore, the need for explicit control of soil moisture is evident.

Conclusions
Here we show that random forests are very suitable for assessing the non-linear damaging effects of different environmental conditions on winter wheat yield anomalies. Explicit consideration is given to soil moisture at various depths. In addition, the 300 crop potential and other spatially related environmental conditions are taken into account, which helps to improve predictive of the year before are not considered as well as February soil moisture. Also, PS5 is not represented in the data there. Instead, precipitation scarcity of April, July, and August is considered now. This indicates, that the meteorological variables are more 380 important in the regions considered here. Also, the late spring and summer seasons are more pronounced as precipitation scarcity in April as well as soil moisture from May to July are amongst the most important variables. However, soil moisture in March is still the most relevant variable. In general soil moisture supports the performance of the model for all three considerations the most. This is particular true for the non-cluster approach and cluster 1 as in cluster 2 more meteorological variables are critical. The most important variable for all three cluster considerations is the same. The only meteorological 385 variable listed for all three clusters is Heat8. It can be observed that the non-cluster approach particularly reflects cluster 1 whereas cluster 2 is underrepresented.