Machine-learning approach to crop yield prediction with the spatial extent of drought

Diaz, Vitali; Osman, Ahmed A. A.; Corzo Perez, Gerald A.; Van Lanen, Henny A. J.; Maskey, Shreedhar; Solomatine, Dimitri

doi:10.5194/hess-2021-600

Preprints

https://doi.org/10.5194/hess-2021-600

Preprints

24 Nov 2021

| 24 Nov 2021

Status: this preprint has been withdrawn by the authors.

Machine-learning approach to crop yield prediction with the spatial extent of drought

Vitali Diaz, Ahmed A. A. Osman, Gerald A. Corzo Perez, Henny A. J. Van Lanen, Shreedhar Maskey, and Dimitri Solomatine

Abstract. Crop yield is one of the variables used to assess the impact of droughts on agriculture. Crop growth models calculate yield and variables related to plant development and become more suitable for crop yield estimation. However, these models are limited in that specific data are needed for computation. Given this limitation, machine learning (ML) models are often widely utilised instead, but their use with the spatial characteristics of droughts as input data is limited. This research explored the spatial extent of drought (area) as input data for building an approach to predict seasonal crop yield. This ML approach is made up of two components. The first includes polynomial regression (PR) models, and the second considers artificial neural network (ANN) models. In this approach, the purpose was to evaluate both types of ML models (PR and ANN) and integrate them into one operational tool. The logic is as follows: ANN models determine the most accurate predictions, but in practice, issues regarding data retrieval and processing can make the use of equations, i.e. PR, preferable. The proposed approach provides these PR equations to perform such calculations with early and preliminary input. The estimates can be further improved when the ANN models are run with the final input data. The results indicated that the empirical equations (PR) produced good predictions when using drought area as the input. ANN provides better estimates, in general. This research will improve drought monitoring systems for assessing drought effects. Since it is currently possible to calculate drought areas within these systems, the direct application of the prediction of drought effects is possible to integrate by following approaches such as the one presented or similar.

This preprint has been withdrawn.

Received: 23 Nov 2021 – Discussion started: 24 Nov 2021

Competing interests: Dimitri Solomatine is part of the HESS Editorial board

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 4525 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (4525 KB)

Download & links

This preprint has been withdrawn.

Vitali Diaz, Ahmed A. A. Osman, Gerald A. Corzo Perez, Henny A. J. Van Lanen, Shreedhar Maskey, and Dimitri Solomatine

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-600', Anonymous Referee #1, 26 Jan 2022

General comments:

The paper is very interesting and I think it is related to the main objectives of the journal. The authors use statistical models to try to adjust the annual rice yield with the area affected by droughts. They use SPEI at different scales and with a threshold of -1 to define droughts.

However, I have observations. Among them is the difficulty to follow the reading due to the order in which the information is presented. On the other hand, I think there are many assumptions in the methodology and in the way results are shown that should be further discussed. The methodology used is not 100% justified, as it leaves many options and questions open.

I elaborate on these issues below.

Specific comments:

I recommend reordering the information to make it easier to read. For example, section 2.1 explains a lot of information that is formally presented in section 3. Information that could be presented in the introduction (for example) or by arranging the text in a more coherent way.

The title and abstract I think should emphasise the area of study, because while the methodology is appropriate and interesting, a broader study at a global level is needed to generalise it. Therefore, I believe that specifying the region of the research is necessary.

The paragraph starting on line 101 explains the different sources of agricultural data in India. I recommend without extending the text too much to discuss with bibliographical sources (if available, otherwise make it clear) the different sources with their pros, cons and quantitative differences at different spatio-temporal levels.

Saying that using the area affected by droughts could be discussed more in the introductory line, so as to justify its use. Some studies discuss this issue, for example:

Araneda-Cabrera, R.J., Bermudez, M., Puertas, J., 2021c. Assessment of the performance of drought indices for explaining crop yield variability at the national scale: Methodological framework and application to Mozambique. Agric. Water Manag. 246. https://doi.org/10.1016/j.agwat.2020.106692

Use at least 1 or 2 model fit indicators to get a better picture of the results (e.g. R2).

In line 270 in the ANN model incorporation, the 85% and 15% data split is arbitrary. Although it is valid in principle, I believe that the possibility of obtaining the fact that different model fit values can be obtained each time it is used should be discussed, due to the randomness that exists in the selection of these variables. Several methods can be used to decrease this randomness (e.g. running the model 10 times, averaging the results or averaging the 10 RMSEs, etc.).

Rice is a crop that may depend on water from regulation or other sources. Is it the best example to fit with SPEI? Since this is a meteorological index, which, although it correlates very well with agricultural droughts when using 3–12-month scales, it does not detect the possible regulations that rice may have. Are we sure that it is a rainfed production in the whole area or where there is irrigation? These issues are not even mentioned in the results, which confuses the reading.

To this last fact I also have two doubts. One is an observation of lack of clarity in figure 3, it could be further improved to appreciate the time series of rice crops. And the second is to discuss the use of a scale of 0.5Ë. Is it in accordance with the agricultural data available and for the purposes of using it to calculate the area affected by droughts?

Figure 5 could be improved or summarised and shown separately (including as supplementary material and/or tables).

The limitations of the work are mentioned, but I think they could be expanded, so that the application of the methodology is justified despite the limitations.

Are the presented results (equations) suitable for operational use in the region?

Citation: https://doi.org/10.5194/hess-2021-600-RC1
- AC1: 'Reply on RC1', Vitali Diaz, 04 Apr 2022
  
  Response to the reviewers
  Thank you very much for the review and valuable suggestions. We describe below how we addressed each comment. Our responses begin with the word '[Reply]'. We have numbered each reviewer’s comment for ease of reading, and this numbering is indicated between brackets.
  Reviewer 1
  
  General comments:
  
  The paper is very interesting and I think it is related to the main objectives of the journal. The authors use statistical models to try to adjust the annual rice yield with the area affected by droughts. They use SPEI at different scales and with a threshold of -1 to define droughts.
  
  However, I have observations. Among them is the difficulty to follow the reading due to the order in which the information is presented. On the other hand, I think there are many assumptions in the methodology and in the way results are shown that should be further discussed. The methodology used is not 100% justified, as it leaves many options and questions open.
  
  I elaborate on these issues below.
  
  Specific comments:
  [1] I recommend reordering the information to make it easier to read. For example, section 2.1 explains a lot of information that is formally presented in section 3. Information that could be presented in the introduction (for example) or by arranging the text in a more coherent way.
  [Reply] Thank you very much. We have reordered (swapped) Sections 2 and 3 to facilitate the reading and understanding. They are now Sections 2. Data and 3. Methodology. We believe that in this order, the reader when approaching the ‘Sect. 3 Methodology’ will have more information and details on the case study/data that are previously presented in Sect. 2.
  
  [2] The title and abstract I think should emphasise the area of study, because while the methodology is appropriate and interesting, a broader study at a global level is needed to generalise it. Therefore, I believe that specifying the region of the research is necessary.
  [Reply] Thanks for your suggestion. Although the results are indeed of great interest for the analysed regions, we consider that the main paper's contribution is the methodology and especially the assessment of the input used for predicting crop yield, i.e., drought area. Drought area has not been investigated fully in previous studies. Therefore, we consider that the title and the abstract suit well.
  
  [3] The paragraph starting on line 101 explains the different sources of agricultural data in India. I recommend without extending the text too much to discuss with bibliographical sources (if available, otherwise make it clear) the different sources with their pros, cons and quantitative differences at different spatio-temporal levels.
  [Reply] Thank you very much for your comment. The main advantage between the two systems indicated in that paragraph, i.e., ground-field visits-based and satellite-based systems, is that the latter provides forecasting, that is, information before harvesting the crop. There are differences in the spatial and temporal scales due to the configuration of both systems; we have included a paragraph describing such characteristics based on your suggestion. We have also emphasised that this paper aims to introduce a methodology for crop prediction that may well serve to compare the other two systems.
  
  [4] Saying that using the area affected by droughts could be discussed more in the introductory line, so as to justify its use. Some studies discuss this issue, for example:
  
  Araneda-Cabrera, R.J., Bermudez, M., Puertas, J., 2021c. Assessment of the performance of drought indices for explaining crop yield variability at the national scale: Methodological framework and application to Mozambique. Agric. Water Manag. 246. https://doi.org/10.1016/j.agwat.2020.106692
  [Reply] Thanks. We have read and included the reference you mentioned in the Introduction Sect. Now have included a text indicating that one of the motivations for using the drought area as an input variable is its high correlation with crop yield, as noted in some previous studies, including Araneda-Cabrera et al. (2021).
  
  [5] Use at least 1 or 2 model fit indicators to get a better picture of the results (e.g. R2).
  [Reply] Thanks. We do not show the results, but we also used R2 in our analysis. Using R2 leads to the same results. Although other error metrics can be used, it is not intended to evaluate how the choice of metric influences the results. We believe that this type of analysis could be investigated in future studies. Here in this document, we have set a precedent for the effective use of drought area for crop yield prediction.
  
  [6] In line 270 in the ANN model incorporation, the 85% and 15% data split is arbitrary. Although it is valid in principle, I believe that the possibility of obtaining the fact that different model fit values can be obtained each time it is used should be discussed, due to the randomness that exists in the selection of these variables. Several methods can be used to decrease this randomness (e.g. running the model 10 times, averaging the results or averaging the 10 RMSEs, etc.).
  [Reply] Thanks. The 85% and 15% rule for training-validation and verification (testing) is a common practice in model building, as well as the use of iteration-based approaches to calculating the best realisation (model). This splitting rule was only applied to the ANN models. In the case of PR (polynomial regression), Eqs. 5 to 8 were used for the four types of PR, for each set (combination) of inputs (Table 2). Regarding the construction of the ANN models, there is no randomness in the selection of the input variables. The complete procedure was applied to each set of inputs (Table 2), and the errors were also calculated for each set of inputs. The iterations served to evaluate and select the best parameters of the ANN models for each set of inputs, varying the number of hidden layer nodes (from 1 to 10); for this, we applied 100 iterations for each number of nodes, giving a total of 1000 iterations for each set of inputs. We have improved our text to indicate that the iterative procedure, including the calculation of RMSE, was carried out for each set of input variables in the ANN models. We have also included an additional figure in our manuscript to show the general scheme of how the input and output variables are tied to facilitate the reading of our methodology.
  
  [7] Rice is a crop that may depend on water from regulation or other sources. Is it the best example to fit with SPEI? Since this is a meteorological index, which, although it correlates very well with agricultural droughts when using 3–12-month scales, it does not detect the possible regulations that rice may have. Are we sure that it is a rainfed production in the whole area or where there is irrigation? These issues are not even mentioned in the results, which confuses the reading.
  [Reply] Thanks. The three regions in which the methodology was tested show various configurations between rain-fed and irrigated production, with different percentages in each case, as shown in Figure 4.d (original manuscript). In regions 1 and 2, the rain-fed and irrigated production percentages are more or less than 50% in each case. Only in region 3 irrigated production reaches 35%. The correlation analysis shows that for the case of region 3, the R values are high between the drought areas and crop yield. The high correlation is due to in region 3, the lack of water resources is less mitigated for irrigation, and the water anomalies are more detected by the drought index that considers precipitation. Our original document pointed out this (lines 408-413). In the case of regions 1 and 2, although only half of the territory is rain-fed, the drought areas correlate well in the months of the growing season, although to a lesser extent than in the case of region 3.
  
  [8] To this last fact I also have two doubts. One is an observation of lack of clarity in figure 3, it could be further improved to appreciate the time series of rice crops. And the second is to discuss the use of a scale of 0.5deg. Is it in accordance with the agricultural data available and for the purposes of using it to calculate the area affected by droughts?
  
  Figure 5 could be improved or summarised and shown separately (including as supplementary material and/or tables).
  [Reply] Thanks. Figures 3 and 5 have been improved. In the case of Figure 3, the figure shows the de-trended crop yield data. We have also added in Figure 1 the time series of crop yield for each region.
  
  On the other hand, although the 0.5-degree resolution of the drought indicator might seem coarse, it was found to be a suitable resolution for the drought indicator to calculate drought areas according to the results. Finer resolutions could improve results and even build models for smaller regions like the crop district level. We use drought data from the SPEI global monitor to motivate the use of our methodology in other study areas. If there is a lack of data, the SPEI data from this drought monitor could be used as the initial implementation.
  
  [9] The limitations of the work are mentioned, but I think they could be expanded, so that the application of the methodology is justified despite the limitations.
  [Reply] Thank you very much for the issues raised. We have extended the discussion in our section on methodology’s limitation (Sect. 4.5).
  
  [10] Are the presented results (equations) suitable for operational use in the region?
  [Reply] Yes, the constructed PR equations are suitable for operational use. We have also enhanced the text to indicate so. We also include the description of an example of the PR equations application.
  
  Kind regards,
  
  Vitali Diaz
  
  on behalf of the authors
  
  Citation: https://doi.org/10.5194/hess-2021-600-AC1
RC2:
'Comment on hess-2021-600', Anonymous Referee #2, 07 Mar 2022

Thank you for this proposal.

Major comment 1: "Machine-learning approach to crop yield prediction with the spatial extent of drought" doesn't exactly reflect the exact object of your work as drought is mainly considered as agricultural drought through the SPEI index (ie the difference between precipitation and evapotranspiration) used as "drought indicator" in your study. Please modify in accordance the paper title: "...with the spatial extent of agricultural drought" and precize your definition of drought in the introduction Section. Please also explain if soil characteristics (structure, nature) and farmer operational pratices could be taken into account in the ML approach in the futur and how they could influence the "agricultural drought area" calculation? Please discusse it.

Major comment 2: the correlation analysis seems very tricky and is hard to read. In general the figures are not pleasant and easy to read (Figures 6-7-8-9). There is too much text in the figures and tables.

Minor comment 1: The proposed approach is a large-scale approach is disconnected from the drivers of the local scales of agricultural management (plot, farm or irrigated schemes) and masking the local variability of soil moisture. How does this complex approach provide added value for these local managers? Please discusse it.

Minor comment 2: Up to line 308 there is mention of "drought indicator" without really knowing what it is. Could you move the data section at the beginning of the paper?

Additional comment: The impact of a drought on agriculture cannot be measured by assessing crop yields because it depends on farming practices and cropping systems. In this approach, the joint result of the agricultural practices currently practiced during a period of plant water stress is evaluated.

Citation: https://doi.org/10.5194/hess-2021-600-RC2
- AC2: 'Reply on RC2', Vitali Diaz, 04 Apr 2022
  
  Response to the reviewers
  Thank you very much for the review and valuable suggestions. We describe below how we addressed each comment. Our responses begin with the word '[Reply]'. We have numbered each reviewer’s comment for ease of reading, and this numbering is indicated between brackets.
  Reviewer 2
  
  Thank you for this proposal.
  
  [1] Major comment 1: "Machine-learning approach to crop yield prediction with the spatial extent of drought" doesn't exactly reflect the exact object of your work as drought is mainly considered as agricultural drought through the SPEI index (ie the difference between precipitation and evapotranspiration) used as "drought indicator" in your study. Please modify in accordance the paper title: "...with the spatial extent of agricultural drought" and precize your definition of drought in the introduction Section. Please also explain if soil characteristics (structure, nature) and farmer operational pratices could be taken into account in the ML approach in the futur and how they could influence the "agricultural drought area" calculation? Please discusse it.
  [Reply] Thank you very much. Although it is true that drought indices are originally created to analyse a specific type of drought, by considering different aggregation periods it is also possible to identify other types for which they were originally created. In our study this is the case. For this reason, in the title we do not emphasise agricultural drought, because we are not using only aggregation periods usually used for agricultural drought.
  
  Our correlation analysis between crop yield and drought areas derived from different aggregation periods indicates that different types of drought (i.e. meteorological, agricultural, and hydrological) affect (impact) the crop yield to varying degrees throughout the months of the crop period. To build the ML models, this level of affectation could be taken into account by using the different hydro-meteorological variables or selecting different aggregation periods of the meteorological variables, as in this case.
  
  Due to the above, we trust that the title is correct because our analysis is not focused on a type of drought, although it is focused on a specific type of dependent variable, that is, the crop yield.
  
  Regarding the characteristics of the soil to be considered within the ML approach, of course they can be considered. Whether to include them depends largely on the availability and accuracy of this information and the need to do so. In our case, due, in large part, to that the cultivated cropland is high with respect to the area of the region, it was not necessary to include this type of information in our ML approach.
  
  In future studies, we envision different topics in this regard. Here are some examples.
  
  The degree of influence of anthropogenic factors, such as farmer operational practices, other types including soil conditions, or the impact of natural phenomena such as drought, could be included in the ML approach presented.
  
  One way to implement the above is as follows. Three or more types of input could be classified, anthropogenic, natural, and different types of combinations. For each set of inputs, the variable selection analysis could be carried out and thereby identify which are the ones that mostly manage agricultural production. Subsequently, the ML models could be built following our proposed approach, i.e. ANN models and equations.
  
  Also, optimisation analysis could be carried out, the best practices (combination) could be found. This type of analysis can be done at different spatial and temporal scales, if the data allows it.
  
  Another line that we see a lot of development in the future is the construction of ML models considering the study area totally discretised in cells. For each cell, the whole approach can be carried out. The availability of spatial data is crucial in this type of analysis, advances in remote sensing and the different earth monitors developed in the last decades could facilitate the implementation of this type of methodologies with the use of more advanced ML approaches.
  
  This research can also be extended into the future to analyse the climate change scenario, either to elucidate the consequences or to find the best crop management practices to face it.
  
  An analysis that is not very complicated to implement and that would, however, include anthropogenic factors or other variables such as the type and conditions of the soil, would be, for example, to weight the drought areas with factors calculated with the additional variables. In this way, the drought areas would be modified to a greater or lesser extent, increasing or attenuating the effects of the drought.
  [2] Major comment 2: the correlation analysis seems very tricky and is hard to read. In general the figures are not pleasant and easy to read (Figures 6-7-8-9). There is too much text in the figures and tables.
  [Reply] Thank you. We have modified and improved the description of the correlation analysis methodology as well as the presentation of the results. We have also included an additional figure in our manuscript to show the general scheme of how the input and output variables are tied.
  
  [3] Minor comment 1: The proposed approach is a large-scale approach is disconnected from the drivers of the local scales of agricultural management (plot, farm or irrigated schemes) and masking the local variability of soil moisture. How does this complex approach provide added value for these local managers? Please discusse it.
  [Reply] Thank you. The use of methodologies that consider the variables you mention, agricultural practices, soil properties and condition, among others, are ideal, however, this is not always possible. Our study presents a methodological alternative for predicting crop yield. In the study area, there are current approaches for crop yield calculation, one based on field visits and a monitoring system based on remote sensing inputs. The drawbacks and advantages are indicated in the Introduction. Our methodology is a complement to these two mentioned tools and provides crop yield predictions that can be compared with the current tools, with the difference that our ML approach produces results before the harvest (i.e. prediction)
  
  Our analysis could be extended further. In subsequent studies we consider that an analysis of irrigation practices could be made, where the best practices could be identified. Our results indicate that the increase in drought area is highly correlated with the decrease in crop yield. A more detailed analysis will make it possible to identify the best agricultural management practices, identify sub-regions more/less vulnerable to the effects of the different types of drought, and detect various demands on water resources throughout the different farming systems. See also the answer to the comment [1].
  
  [4] Minor comment 2: Up to line 308 there is mention of "drought indicator" without really knowing what it is. Could you move the data section at the beginning of the paper?
  [Reply] Thanks. For a better understanding and reading, we have reorganised sections 2 and 3, now Data is presented first in Sect. 2 and then Methodology in Sect. 3. The text in both sections has also been adjusted accordingly.
  
  [5] Additional comment: The impact of a drought on agriculture cannot be measured by assessing crop yields because it depends on farming practices and cropping systems. In this approach, the joint result of the agricultural practices currently practiced during a period of plant water stress is evaluated.
  [Reply] Thank you. We have updated our text.
  Kind regards,
  
  Vitali Diaz
  
  on behalf of the authors
  
  Citation: https://doi.org/10.5194/hess-2021-600-AC2

Interactive discussion

Status: closed

RC1:
'Comment on hess-2021-600', Anonymous Referee #1, 26 Jan 2022

General comments:

The paper is very interesting and I think it is related to the main objectives of the journal. The authors use statistical models to try to adjust the annual rice yield with the area affected by droughts. They use SPEI at different scales and with a threshold of -1 to define droughts.

However, I have observations. Among them is the difficulty to follow the reading due to the order in which the information is presented. On the other hand, I think there are many assumptions in the methodology and in the way results are shown that should be further discussed. The methodology used is not 100% justified, as it leaves many options and questions open.

I elaborate on these issues below.

Specific comments:

I recommend reordering the information to make it easier to read. For example, section 2.1 explains a lot of information that is formally presented in section 3. Information that could be presented in the introduction (for example) or by arranging the text in a more coherent way.

The title and abstract I think should emphasise the area of study, because while the methodology is appropriate and interesting, a broader study at a global level is needed to generalise it. Therefore, I believe that specifying the region of the research is necessary.

The paragraph starting on line 101 explains the different sources of agricultural data in India. I recommend without extending the text too much to discuss with bibliographical sources (if available, otherwise make it clear) the different sources with their pros, cons and quantitative differences at different spatio-temporal levels.

Saying that using the area affected by droughts could be discussed more in the introductory line, so as to justify its use. Some studies discuss this issue, for example:

Araneda-Cabrera, R.J., Bermudez, M., Puertas, J., 2021c. Assessment of the performance of drought indices for explaining crop yield variability at the national scale: Methodological framework and application to Mozambique. Agric. Water Manag. 246. https://doi.org/10.1016/j.agwat.2020.106692

Use at least 1 or 2 model fit indicators to get a better picture of the results (e.g. R2).

In line 270 in the ANN model incorporation, the 85% and 15% data split is arbitrary. Although it is valid in principle, I believe that the possibility of obtaining the fact that different model fit values can be obtained each time it is used should be discussed, due to the randomness that exists in the selection of these variables. Several methods can be used to decrease this randomness (e.g. running the model 10 times, averaging the results or averaging the 10 RMSEs, etc.).

Rice is a crop that may depend on water from regulation or other sources. Is it the best example to fit with SPEI? Since this is a meteorological index, which, although it correlates very well with agricultural droughts when using 3–12-month scales, it does not detect the possible regulations that rice may have. Are we sure that it is a rainfed production in the whole area or where there is irrigation? These issues are not even mentioned in the results, which confuses the reading.

To this last fact I also have two doubts. One is an observation of lack of clarity in figure 3, it could be further improved to appreciate the time series of rice crops. And the second is to discuss the use of a scale of 0.5Ë. Is it in accordance with the agricultural data available and for the purposes of using it to calculate the area affected by droughts?

Figure 5 could be improved or summarised and shown separately (including as supplementary material and/or tables).

The limitations of the work are mentioned, but I think they could be expanded, so that the application of the methodology is justified despite the limitations.

Are the presented results (equations) suitable for operational use in the region?

Citation: https://doi.org/10.5194/hess-2021-600-RC1
- AC1: 'Reply on RC1', Vitali Diaz, 04 Apr 2022
  
  Response to the reviewers
  Thank you very much for the review and valuable suggestions. We describe below how we addressed each comment. Our responses begin with the word '[Reply]'. We have numbered each reviewer’s comment for ease of reading, and this numbering is indicated between brackets.
  Reviewer 1
  
  General comments:
  
  The paper is very interesting and I think it is related to the main objectives of the journal. The authors use statistical models to try to adjust the annual rice yield with the area affected by droughts. They use SPEI at different scales and with a threshold of -1 to define droughts.
  
  However, I have observations. Among them is the difficulty to follow the reading due to the order in which the information is presented. On the other hand, I think there are many assumptions in the methodology and in the way results are shown that should be further discussed. The methodology used is not 100% justified, as it leaves many options and questions open.
  
  I elaborate on these issues below.
  
  Specific comments:
  [1] I recommend reordering the information to make it easier to read. For example, section 2.1 explains a lot of information that is formally presented in section 3. Information that could be presented in the introduction (for example) or by arranging the text in a more coherent way.
  [Reply] Thank you very much. We have reordered (swapped) Sections 2 and 3 to facilitate the reading and understanding. They are now Sections 2. Data and 3. Methodology. We believe that in this order, the reader when approaching the ‘Sect. 3 Methodology’ will have more information and details on the case study/data that are previously presented in Sect. 2.
  
  [2] The title and abstract I think should emphasise the area of study, because while the methodology is appropriate and interesting, a broader study at a global level is needed to generalise it. Therefore, I believe that specifying the region of the research is necessary.
  [Reply] Thanks for your suggestion. Although the results are indeed of great interest for the analysed regions, we consider that the main paper's contribution is the methodology and especially the assessment of the input used for predicting crop yield, i.e., drought area. Drought area has not been investigated fully in previous studies. Therefore, we consider that the title and the abstract suit well.
  
  [3] The paragraph starting on line 101 explains the different sources of agricultural data in India. I recommend without extending the text too much to discuss with bibliographical sources (if available, otherwise make it clear) the different sources with their pros, cons and quantitative differences at different spatio-temporal levels.
  [Reply] Thank you very much for your comment. The main advantage between the two systems indicated in that paragraph, i.e., ground-field visits-based and satellite-based systems, is that the latter provides forecasting, that is, information before harvesting the crop. There are differences in the spatial and temporal scales due to the configuration of both systems; we have included a paragraph describing such characteristics based on your suggestion. We have also emphasised that this paper aims to introduce a methodology for crop prediction that may well serve to compare the other two systems.
  
  [4] Saying that using the area affected by droughts could be discussed more in the introductory line, so as to justify its use. Some studies discuss this issue, for example:
  
  Araneda-Cabrera, R.J., Bermudez, M., Puertas, J., 2021c. Assessment of the performance of drought indices for explaining crop yield variability at the national scale: Methodological framework and application to Mozambique. Agric. Water Manag. 246. https://doi.org/10.1016/j.agwat.2020.106692
  [Reply] Thanks. We have read and included the reference you mentioned in the Introduction Sect. Now have included a text indicating that one of the motivations for using the drought area as an input variable is its high correlation with crop yield, as noted in some previous studies, including Araneda-Cabrera et al. (2021).
  
  [5] Use at least 1 or 2 model fit indicators to get a better picture of the results (e.g. R2).
  [Reply] Thanks. We do not show the results, but we also used R2 in our analysis. Using R2 leads to the same results. Although other error metrics can be used, it is not intended to evaluate how the choice of metric influences the results. We believe that this type of analysis could be investigated in future studies. Here in this document, we have set a precedent for the effective use of drought area for crop yield prediction.
  
  [6] In line 270 in the ANN model incorporation, the 85% and 15% data split is arbitrary. Although it is valid in principle, I believe that the possibility of obtaining the fact that different model fit values can be obtained each time it is used should be discussed, due to the randomness that exists in the selection of these variables. Several methods can be used to decrease this randomness (e.g. running the model 10 times, averaging the results or averaging the 10 RMSEs, etc.).
  [Reply] Thanks. The 85% and 15% rule for training-validation and verification (testing) is a common practice in model building, as well as the use of iteration-based approaches to calculating the best realisation (model). This splitting rule was only applied to the ANN models. In the case of PR (polynomial regression), Eqs. 5 to 8 were used for the four types of PR, for each set (combination) of inputs (Table 2). Regarding the construction of the ANN models, there is no randomness in the selection of the input variables. The complete procedure was applied to each set of inputs (Table 2), and the errors were also calculated for each set of inputs. The iterations served to evaluate and select the best parameters of the ANN models for each set of inputs, varying the number of hidden layer nodes (from 1 to 10); for this, we applied 100 iterations for each number of nodes, giving a total of 1000 iterations for each set of inputs. We have improved our text to indicate that the iterative procedure, including the calculation of RMSE, was carried out for each set of input variables in the ANN models. We have also included an additional figure in our manuscript to show the general scheme of how the input and output variables are tied to facilitate the reading of our methodology.
  
  [7] Rice is a crop that may depend on water from regulation or other sources. Is it the best example to fit with SPEI? Since this is a meteorological index, which, although it correlates very well with agricultural droughts when using 3–12-month scales, it does not detect the possible regulations that rice may have. Are we sure that it is a rainfed production in the whole area or where there is irrigation? These issues are not even mentioned in the results, which confuses the reading.
  [Reply] Thanks. The three regions in which the methodology was tested show various configurations between rain-fed and irrigated production, with different percentages in each case, as shown in Figure 4.d (original manuscript). In regions 1 and 2, the rain-fed and irrigated production percentages are more or less than 50% in each case. Only in region 3 irrigated production reaches 35%. The correlation analysis shows that for the case of region 3, the R values are high between the drought areas and crop yield. The high correlation is due to in region 3, the lack of water resources is less mitigated for irrigation, and the water anomalies are more detected by the drought index that considers precipitation. Our original document pointed out this (lines 408-413). In the case of regions 1 and 2, although only half of the territory is rain-fed, the drought areas correlate well in the months of the growing season, although to a lesser extent than in the case of region 3.
  
  [8] To this last fact I also have two doubts. One is an observation of lack of clarity in figure 3, it could be further improved to appreciate the time series of rice crops. And the second is to discuss the use of a scale of 0.5deg. Is it in accordance with the agricultural data available and for the purposes of using it to calculate the area affected by droughts?
  
  Figure 5 could be improved or summarised and shown separately (including as supplementary material and/or tables).
  [Reply] Thanks. Figures 3 and 5 have been improved. In the case of Figure 3, the figure shows the de-trended crop yield data. We have also added in Figure 1 the time series of crop yield for each region.
  
  On the other hand, although the 0.5-degree resolution of the drought indicator might seem coarse, it was found to be a suitable resolution for the drought indicator to calculate drought areas according to the results. Finer resolutions could improve results and even build models for smaller regions like the crop district level. We use drought data from the SPEI global monitor to motivate the use of our methodology in other study areas. If there is a lack of data, the SPEI data from this drought monitor could be used as the initial implementation.
  
  [9] The limitations of the work are mentioned, but I think they could be expanded, so that the application of the methodology is justified despite the limitations.
  [Reply] Thank you very much for the issues raised. We have extended the discussion in our section on methodology’s limitation (Sect. 4.5).
  
  [10] Are the presented results (equations) suitable for operational use in the region?
  [Reply] Yes, the constructed PR equations are suitable for operational use. We have also enhanced the text to indicate so. We also include the description of an example of the PR equations application.
  
  Kind regards,
  
  Vitali Diaz
  
  on behalf of the authors
  
  Citation: https://doi.org/10.5194/hess-2021-600-AC1
RC2:
'Comment on hess-2021-600', Anonymous Referee #2, 07 Mar 2022

Thank you for this proposal.

Major comment 1: "Machine-learning approach to crop yield prediction with the spatial extent of drought" doesn't exactly reflect the exact object of your work as drought is mainly considered as agricultural drought through the SPEI index (ie the difference between precipitation and evapotranspiration) used as "drought indicator" in your study. Please modify in accordance the paper title: "...with the spatial extent of agricultural drought" and precize your definition of drought in the introduction Section. Please also explain if soil characteristics (structure, nature) and farmer operational pratices could be taken into account in the ML approach in the futur and how they could influence the "agricultural drought area" calculation? Please discusse it.

Major comment 2: the correlation analysis seems very tricky and is hard to read. In general the figures are not pleasant and easy to read (Figures 6-7-8-9). There is too much text in the figures and tables.

Minor comment 1: The proposed approach is a large-scale approach is disconnected from the drivers of the local scales of agricultural management (plot, farm or irrigated schemes) and masking the local variability of soil moisture. How does this complex approach provide added value for these local managers? Please discusse it.

Minor comment 2: Up to line 308 there is mention of "drought indicator" without really knowing what it is. Could you move the data section at the beginning of the paper?

Additional comment: The impact of a drought on agriculture cannot be measured by assessing crop yields because it depends on farming practices and cropping systems. In this approach, the joint result of the agricultural practices currently practiced during a period of plant water stress is evaluated.

Citation: https://doi.org/10.5194/hess-2021-600-RC2
- AC2: 'Reply on RC2', Vitali Diaz, 04 Apr 2022
  
  Response to the reviewers
  Thank you very much for the review and valuable suggestions. We describe below how we addressed each comment. Our responses begin with the word '[Reply]'. We have numbered each reviewer’s comment for ease of reading, and this numbering is indicated between brackets.
  Reviewer 2
  
  Thank you for this proposal.
  
  [1] Major comment 1: "Machine-learning approach to crop yield prediction with the spatial extent of drought" doesn't exactly reflect the exact object of your work as drought is mainly considered as agricultural drought through the SPEI index (ie the difference between precipitation and evapotranspiration) used as "drought indicator" in your study. Please modify in accordance the paper title: "...with the spatial extent of agricultural drought" and precize your definition of drought in the introduction Section. Please also explain if soil characteristics (structure, nature) and farmer operational pratices could be taken into account in the ML approach in the futur and how they could influence the "agricultural drought area" calculation? Please discusse it.
  [Reply] Thank you very much. Although it is true that drought indices are originally created to analyse a specific type of drought, by considering different aggregation periods it is also possible to identify other types for which they were originally created. In our study this is the case. For this reason, in the title we do not emphasise agricultural drought, because we are not using only aggregation periods usually used for agricultural drought.
  
  Our correlation analysis between crop yield and drought areas derived from different aggregation periods indicates that different types of drought (i.e. meteorological, agricultural, and hydrological) affect (impact) the crop yield to varying degrees throughout the months of the crop period. To build the ML models, this level of affectation could be taken into account by using the different hydro-meteorological variables or selecting different aggregation periods of the meteorological variables, as in this case.
  
  Due to the above, we trust that the title is correct because our analysis is not focused on a type of drought, although it is focused on a specific type of dependent variable, that is, the crop yield.
  
  Regarding the characteristics of the soil to be considered within the ML approach, of course they can be considered. Whether to include them depends largely on the availability and accuracy of this information and the need to do so. In our case, due, in large part, to that the cultivated cropland is high with respect to the area of the region, it was not necessary to include this type of information in our ML approach.
  
  In future studies, we envision different topics in this regard. Here are some examples.
  
  The degree of influence of anthropogenic factors, such as farmer operational practices, other types including soil conditions, or the impact of natural phenomena such as drought, could be included in the ML approach presented.
  
  One way to implement the above is as follows. Three or more types of input could be classified, anthropogenic, natural, and different types of combinations. For each set of inputs, the variable selection analysis could be carried out and thereby identify which are the ones that mostly manage agricultural production. Subsequently, the ML models could be built following our proposed approach, i.e. ANN models and equations.
  
  Also, optimisation analysis could be carried out, the best practices (combination) could be found. This type of analysis can be done at different spatial and temporal scales, if the data allows it.
  
  Another line that we see a lot of development in the future is the construction of ML models considering the study area totally discretised in cells. For each cell, the whole approach can be carried out. The availability of spatial data is crucial in this type of analysis, advances in remote sensing and the different earth monitors developed in the last decades could facilitate the implementation of this type of methodologies with the use of more advanced ML approaches.
  
  This research can also be extended into the future to analyse the climate change scenario, either to elucidate the consequences or to find the best crop management practices to face it.
  
  An analysis that is not very complicated to implement and that would, however, include anthropogenic factors or other variables such as the type and conditions of the soil, would be, for example, to weight the drought areas with factors calculated with the additional variables. In this way, the drought areas would be modified to a greater or lesser extent, increasing or attenuating the effects of the drought.
  [2] Major comment 2: the correlation analysis seems very tricky and is hard to read. In general the figures are not pleasant and easy to read (Figures 6-7-8-9). There is too much text in the figures and tables.
  [Reply] Thank you. We have modified and improved the description of the correlation analysis methodology as well as the presentation of the results. We have also included an additional figure in our manuscript to show the general scheme of how the input and output variables are tied.
  
  [3] Minor comment 1: The proposed approach is a large-scale approach is disconnected from the drivers of the local scales of agricultural management (plot, farm or irrigated schemes) and masking the local variability of soil moisture. How does this complex approach provide added value for these local managers? Please discusse it.
  [Reply] Thank you. The use of methodologies that consider the variables you mention, agricultural practices, soil properties and condition, among others, are ideal, however, this is not always possible. Our study presents a methodological alternative for predicting crop yield. In the study area, there are current approaches for crop yield calculation, one based on field visits and a monitoring system based on remote sensing inputs. The drawbacks and advantages are indicated in the Introduction. Our methodology is a complement to these two mentioned tools and provides crop yield predictions that can be compared with the current tools, with the difference that our ML approach produces results before the harvest (i.e. prediction)
  
  Our analysis could be extended further. In subsequent studies we consider that an analysis of irrigation practices could be made, where the best practices could be identified. Our results indicate that the increase in drought area is highly correlated with the decrease in crop yield. A more detailed analysis will make it possible to identify the best agricultural management practices, identify sub-regions more/less vulnerable to the effects of the different types of drought, and detect various demands on water resources throughout the different farming systems. See also the answer to the comment [1].
  
  [4] Minor comment 2: Up to line 308 there is mention of "drought indicator" without really knowing what it is. Could you move the data section at the beginning of the paper?
  [Reply] Thanks. For a better understanding and reading, we have reorganised sections 2 and 3, now Data is presented first in Sect. 2 and then Methodology in Sect. 3. The text in both sections has also been adjusted accordingly.
  
  [5] Additional comment: The impact of a drought on agriculture cannot be measured by assessing crop yields because it depends on farming practices and cropping systems. In this approach, the joint result of the agricultural practices currently practiced during a period of plant water stress is evaluated.
  [Reply] Thank you. We have updated our text.
  Kind regards,
  
  Vitali Diaz
  
  on behalf of the authors
  
  Citation: https://doi.org/10.5194/hess-2021-600-AC2

Vitali Diaz, Ahmed A. A. Osman, Gerald A. Corzo Perez, Henny A. J. Van Lanen, Shreedhar Maskey, and Dimitri Solomatine

Viewed

Total article views: 2,621 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,929	610	82	2,621	109	158

HTML: 1,929
PDF: 610
XML: 82
Total: 2,621
BibTeX: 109
EndNote: 158

Views and downloads (calculated since 24 Nov 2021)

Month	HTML	PDF	XML	Total
Nov 2021	262	63	1	326
Dec 2021	111	19	0	130
Jan 2022	71	13	1	85
Feb 2022	48	9	2	59
Mar 2022	48	19	3	70
Apr 2022	58	23	5	86
May 2022	32	13	1	46
Jun 2022	23	11	1	35
Jul 2022	25	6	0	31
Aug 2022	14	13	0	27
Sep 2022	12	13	0	25
Oct 2022	12	7	1	20
Nov 2022	11	9	0	20
Dec 2022	9	8	0	17
Jan 2023	9	12	0	21
Feb 2023	9	6	0	15
Mar 2023	16	5	1	22
Apr 2023	4	3	0	7
May 2023	9	5	1	15
Jun 2023	11	8	1	20
Jul 2023	20	7	1	28
Aug 2023	9	8	1	18
Sep 2023	19	11	2	32
Oct 2023	31	7	0	38
Nov 2023	15	0	15
Dec 2023	19	4	0	23
Jan 2024	27	3	1	31
Feb 2024	21	9	1	31
Mar 2024	31	13	4	48
Apr 2024	24	3	7	34
May 2024	16	4	3	23
Jun 2024	37	9	3	49
Jul 2024	11	5	2	18
Aug 2024	13	3	2	18
Sep 2024	20	4	0	24
Oct 2024	7	4	0	11
Nov 2024	11	4	1	16
Dec 2024	5	6	0	11
Jan 2025	7	4	0	11
Feb 2025	13	6	0	19
Mar 2025	9	6	7	22
Apr 2025	6	9	2	17
May 2025	15	5	0	20
Jun 2025	31	42	0	73
Jul 2025	27	9	1	37
Aug 2025	91	10	0	101
Sep 2025	318	9	2	329
Oct 2025	30	16	1	47
Nov 2025	29	25	3	57
Dec 2025	21	45	2	68
Jan 2026	30	8	5	43
Feb 2026	37	3	3	43
Mar 2026	37	10	1	48
Apr 2026	68	14	6	88
May 2026	22	11	2	35
Jun 2026	5	2	0	7
Jul 2026	3	7	1	11

Cumulative views and downloads (calculated since 24 Nov 2021)

Month	HTML	PDF	XML	Total
Nov 2021	262	63	1	326
Dec 2021	111	19	0	130
Jan 2022	71	13	1	85
Feb 2022	48	9	2	59
Mar 2022	48	19	3	70
Apr 2022	58	23	5	86
May 2022	32	13	1	46
Jun 2022	23	11	1	35
Jul 2022	25	6	0	31
Aug 2022	14	13	0	27
Sep 2022	12	13	0	25
Oct 2022	12	7	1	20
Nov 2022	11	9	0	20
Dec 2022	9	8	0	17
Jan 2023	9	12	0	21
Feb 2023	9	6	0	15
Mar 2023	16	5	1	22
Apr 2023	4	3	0	7
May 2023	9	5	1	15
Jun 2023	11	8	1	20
Jul 2023	20	7	1	28
Aug 2023	9	8	1	18
Sep 2023	19	11	2	32
Oct 2023	31	7	0	38
Nov 2023	15	0	15
Dec 2023	19	4	0	23
Jan 2024	27	3	1	31
Feb 2024	21	9	1	31
Mar 2024	31	13	4	48
Apr 2024	24	3	7	34
May 2024	16	4	3	23
Jun 2024	37	9	3	49
Jul 2024	11	5	2	18
Aug 2024	13	3	2	18
Sep 2024	20	4	0	24
Oct 2024	7	4	0	11
Nov 2024	11	4	1	16
Dec 2024	5	6	0	11
Jan 2025	7	4	0	11
Feb 2025	13	6	0	19
Mar 2025	9	6	7	22
Apr 2025	6	9	2	17
May 2025	15	5	0	20
Jun 2025	31	42	0	73
Jul 2025	27	9	1	37
Aug 2025	91	10	0	101
Sep 2025	318	9	2	329
Oct 2025	30	16	1	47
Nov 2025	29	25	3	57
Dec 2025	21	45	2	68
Jan 2026	30	8	5	43
Feb 2026	37	3	3	43
Mar 2026	37	10	1	48
Apr 2026	68	14	6	88
May 2026	22	11	2	35
Jun 2026	5	2	0	7
Jul 2026	3	7	1	11

Viewed (geographical distribution)

Total article views: 2,526 (including HTML, PDF, and XML) Thereof 2,526 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 25 Jul 2026

Short summary

Drought effects on crops are usually evaluated through crop yield (CY). The hypothesis is that the drought spatial extent is a good input to predict CY. A machine learning approach to predict crop yield is introduced. The use of drought area was found suitable. Since it is currently possible to calculate drought areas within drought monitoring systems, the direct application to predict drought effects can be integrated into them by following approaches such as the one presented or similar.


Total:	0
HTML:	0
PDF:	0
XML:	0

Machine-learning approach to crop yield prediction with the spatial extent of drought

Interactive discussion

Interactive discussion

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.