the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Spatiotemporal changes of drought area as input for a machine-learning approach for crop yield prediction
Abstract. Climate change has increased the possibility of more severe and prolonged droughts worldwide, which requires innovative methods to predict their impacts on different sectors such as agriculture. Crop growth models calculate yield and variables related to plant development and are used for crop yield estimation, a useful variable for monitoring drought impacts. Although used for prediction, these crop models are not explicit forecasting models; they are limited to the physical assumptions reflected in their conceptual model. In addition, the input data availability, the spatial and temporal aggregation, and different sources of uncertainty make the crop yield prediction challenging. Given these limitations, machine learning (ML) models are often utilised following a multivariable forecasting approach, but their use with the spatial characteristics of droughts as input data is limited. This research explored the spatial extent of drought as input data for building an approach for predicting seasonal crop yield. This ML approach is made up of two components. The first includes polynomial regression (PR) models, and the second considers artificial neural network (ANN) models. This approach aimed to evaluate both types of ML models (PR and ANN) and integrate them into one operational tool. The logic is as follows: ANN models determine the most accurate predictions, but in practice, issues regarding data retrieval and processing can make the use of equations, i.e. PR, preferable. The proposed approach provides these PR equations with early and preliminary input to perform such calculations. The estimates can be further improved when the ANN models are run with the final input data. The results indicated that the empirical equations (PR) produced good predictions when using drought area as the input. ANN provides better estimates, in general. Research results show that the spatiotemporal changes of drought area and its temporal aggregation provide an important pre-processing alternative to implement ML models for drought impact prediction.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(5646 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2022-252', Anonymous Referee #1, 08 Jan 2023
General comments
The manuscript (MS) “Spatiotemporal changes of drought area as input for a machine-learning approach for crop yield prediction” written by Diaz et al., which argued the limitation of dynamic crop model in predicting crop yield and thus introduced machine learning (ML) method for yield forecasting in three main rice growing regions in India (1967-2015). Two ML approaches: polynomial regression (PR) and artificial neural network (ANN) were employed to investigate in separated or combined modes using drought area as single input for grain yield prediction. Since ML comes to practices and being helpful tools and different applications in our life nowadays, especially in agriculture such as yield predication, remote sensing, this study and MS could provide meaningful approaches for yield forecasting as complementary knowledge for other existing approaches, especially in India.
The figure and visual features are informative and easy to follow. English grammar was well-written. The data 1967-2015 was also a strong point for this MS. However, given some major issues which are listed here (i) the objectives of the work and MS were not well determined and clearly stated (ii) structure of MS was not in well-designed and formulated with concrete objectives (iii) a lot of repetition and redundant information among sections, figures and tables were not followed with the main text (iv) lack of more detailed discussion of how other work/other approaches (crop models + ML) has been done elsewhere (in the introduction and discussion) (v) critical issue via using drought area as input for model without clarification of other factor or drought intensity. With these, it could not be accepted as the current MS state. Please see many comments and suggestions in detail below.
Abstract
Line 20-28: it is a bit too long for approach description while it is lack of concrete (overall) statistical number for the results
Line 26: explicitly mentioned to PR, only two approaches here
Line 33: space after “implement”
Introduction
There is redundant information in the first paragraph (line 38-51) that needs to be rewritten.
The MS emphasized the limitation of crop modeling which has been well established in long time in crop yield simulation, yield prediction, and climate change impact assessment as well as understandings crop responses to different abiotic or biotic stresses. Both crop models and ML have uncertainties with regards of spatial-temporal input data when bring into larger scales and long-term application. The comparison between ML and crop model should be further elaborated in the text to convince the reader towards ML? (line 52-59).
Similarly, the MS focused on spatial extent of drought, and it convinced it as an issue that ML model could cover but there is no detail literature and reference that have been done for that in the MS (line 68). Why it is important?
Line 78: what are the specific objectives, about spatial extent impact on grain yield prediction in ML or determine which the best approach of ML are or temporal aggregation effects? Please clearly state
Line 89-123: paragraph “Crop yield prediction in India” came to this. This section should be rewritten or merged with above section to make the introduction more streamline with clear issues and associated objectives. The mentioned information in this section was repeated in section 2
Line 99-109: writing need to be improved
Line 119: which are “other solutions”?
Is there any study using the drought area for yield prediction before?Materials and Methods
Section 2 and 3 need to be reconstructed for more concise and easy reading. It is better to merge in one: like “Materials and methods” with further subheadings.
Line 131-135 is repetition with lines 99-102
Line 131: accessed when? Also the DAC is not similar to the name in line 95
Line 143, separately for each state?
Line 145: it is not clear, it is the spatial aggregation of two states with the average yield?
Figure 1: Why are the color of left and right figures are so different? Same color scale? What is spatial resolution of grid at legend?
Line 156: there is no reference on the reference list
Line 160: access when?
Line 162-163: this information is really important for the whole MS that do not need to repeat explanations. Please state clearly the aggregation: how to get DI and DA? DA1 is aggregated of what from when to when? And soon DA3, 6, 9, 12 because it is confusing with 12 months or 24 months (line 245, 246).
Line 185-203 and section 2.2 was rather replicated.
It is really important to explain further how to estimate such SPEI, in term of equation, variables and since this is only input for the model. The MS mentioned many times the limitation of different drought types, by explanations further this SPEI could determine or clearly show drought? Which ET approach was used and climatic variables? Information of irrigation (if it is available) should be mentioned and described for all years.
Using a single input variable like DA might not be concrete enough for yield prediction and the soundness of approach is rather weak, how about other climatic factors like temperature? How is uncertainties of SPEI at global scales?
Figure 2 should be right away after line 203
Line 207 how about pest and diseases, heat stress, ozone?
Line 229-237 was repeated somewhere else before, for instance line 160-163 or 199-203
Section 32. it was too long and need to be sharpened due to a lot of repeated information
Line 280: Table 2 should be mentioned right away. Line 280 to 289 should be in the result and discussion section, i.e. line 457
Section 3.3 need to be restructured following subsequence equations
Section 3.4 also too long and overlapped with the introduction. Did the work choose the FFNN?
Line 346? Is that a common threshold for different objects? Any justification to use this threshold for single input variable model?
Line 350: is that “period” or whole dataset?
Section 3.5: mentioning various approaches but which one do you choose and what are criteria that has been used?
Results and Discussion
It was too lengthy and repeated information. Substantial improvement in writing is required to make the MS well-structured following the objectives with good discussion and reflections with previous studies
Line 362-366: legend does the job.
Line 368: “theree” -> “three”
Line 394: the decrease and maximum of what?
Line 394: where is Figure 4? It should be shown directly.
Any explanations of the de-trended yield from 2003-2015 of region 1 was much fluctuated as compared to region 2 and 3 in the same period?
Line 403: why is so much different in three regions although only yield from Kharif was presented? Any studies before?
Line 407: what is SPEI6?
Line 411-416 about figure 5: peak of what and in which figure? 5a 5b or 5c, please more precise
Figure 5: each point on 5 a, b, and c from how many n sample? Line 440: “rein” -> “rain” Line 441: data for “2014 or for which years? Or average of which years? This is very important information together with SPEI and DA that should be used to interpret the input data and yield prediction results.
Figure 447 (figure 6): “, respectively” Is that correlation coefficiency with significant level of 95%
Line 466-470 is redundant since it has mentioned in the material and method.
Section 4.3 too much information was shown in same time, fig. 7, 8, 9 as once but less discussion and comparison with other literature for this section. Is there any study elsewhere has been done?
Is there any explanation why both models are less accurate from around 2000-2015 as compared to 1967-2000 for instance for region 1 and region 3? Authors mentioned about the “spatial extent” which was considered in the models. But, this was not well discussed.
Section 4.4. Table 4, 5, 6 could be moved to Supplementary material if this is possible since these has not been discussed much or not informative. Line 539, 547, and 556: “moth” -> “month”
Section 4.5 The limitation was listed but has not been shown through the discussion of results and how they affected to the model performance? Or they has not been clearly discussed and compared with other studies?
Point 6 (line 580-581) it is not clear. In fact, India could provide 3 sets of yield data per year (three growing seasons). Three sets of yield could correspond to at least three periods of temporal aggregation. Why did the work not take three sets of yield data then having more grain yield data with montly DA?
Section 4.6: Repetition of introduction and too general without literature comparison and discussion.
Line 596-598: is similar to point 2 Section 4.5
Section 4.7 a lot information was mentioned and repeated with the previous section line 4.5 and 4.6
Citation: https://doi.org/10.5194/hess-2022-252-RC1 -
AC2: 'Reply on RC1', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC2-supplement.pdf
-
AC2: 'Reply on RC1', Vitali Diaz, 01 Oct 2023
-
CC1: 'Comment on hess-2022-252', Xiaofeng Li, 08 Jun 2023
The manuscript uses spatial range as an input variable and utilizes machine learning algorithms for crop yield prediction. This is an interesting and innovative study. This study can provide a new approach and method for crop yield prediction, while also reducing the dependence of crop models on input data.
Although this study proposes a novel method, the final results have good accuracy and have been compared and analyzed with site observation data, confirming the reliability of the results. However, I still have some questions about some of the content of the manuscript, and the author still needs to revise it and add some explanations. At the same time, there are some format problems in the manuscript, and there are also some citation format problems in the references. The following are detailed comments and suggestions:
Specific modification suggestions:
Data:
- In this study, rice was the research objective and there was a lack of introduction to the characteristics of rice cultivation in the study area. In addition, is the rice in this study area significantly affected by drought? Relevant content should be supplemented.
- Although SPEI has a wide range of applications for drought monitoring, this study should also supplement some literature on this indicator in similar research areas and similar research objectives.
- Should the author supplement the sources of land use type data?
Results and discussions:
- From Figure 5, it can be observed that the correlation between yield after trend removal and drought area changes over time, but overall, the correlation coefficient is relatively small. Can this result support subsequent analytical applications?
- From Figure 7 to Figure 9, it can be found that the root mean square error of simulated yield in the three study areas has very high accuracy. Should the applicability and differences between the two methods be appropriately supplemented?
- In section 4.4, a large number of models are listed. Can the author discuss the universality of these models? In addition to accuracy, the applicability and ease of application of the model are key considerations for its future construction.
- In section 4.5, the threshold of SPEI indicators should be supplemented with relevant basis.
Citation: https://doi.org/10.5194/hess-2022-252-CC1 -
AC1: 'Reply on CC1', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC1-supplement.pdf
-
RC2: 'Comment on hess-2022-252', Anonymous Referee #2, 22 Jul 2023
In this manuscript, the authors employ data-driven techniques to predict rice crop yields in India. The paper's objective is clear; however, the methodology is not rigorously employed, the novelty is limited, and the document's structure could be enhanced. In order to improve the study, the authors could consider the following points:
- In lines 64-68 you mention that ML techniques have already been tested to predict crop yield but that “the use of spatial characteristics of drought such as its spatial extent has not been fully explored to crop yield prediction”. Does this mean that the only conceptual novelty of this work is that it considers a new variable?
- The authors write in the Modeling Limitations section that insufficient crop yield data is an issue, however, the last year for which crop yield data is available is 2015, is it possible to increase the dataset? Much more importantly, the basis of data-driven techniques (of which ML algorithms are part) is that a lot of information is available, and the algorithm can learn from the data. If you don’t have enough information, how can you justify the application of a ML algorithm?
- Some of the plots presented in Figure 7 show a serious problem. Your predictions present a lag of one year (the red curve is shifted one year to the right). This usually indicates that an auto-regressive algorithm (like the one that you are using) is not capable of learning and that the prediction of year t+1 is strongly influenced by the crop yield of year t.
- Go through the entire document and check English usage and typos.
- I suggest that the authors revisit the document and avoid repeating information (unless strictly necessary) and avoid presenting graphs with excessive information.
- You need to improve the description of your work in the introduction. As it is right now, it is unclear. What do you mean by “the crop yield calculation is clear”? What do you mean by “is not as clear”? What does “The ANN is expected to be used with the final input data” mean?
- Did you evaluate the cross-correlation between input variables? Is it possible that you provide redundant information to the algorithm?
- In the results section you write sentences using terms like “perhaps” and “may”. However, the results should be able to prove or reject a hypothesis. I strongly recommend that you avoid that type of sentences in the work.
Citation: https://doi.org/10.5194/hess-2022-252-RC2 -
AC3: 'Reply on RC2', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC3-supplement.pdf
Interactive discussion
Status: closed
-
RC1: 'Comment on hess-2022-252', Anonymous Referee #1, 08 Jan 2023
General comments
The manuscript (MS) “Spatiotemporal changes of drought area as input for a machine-learning approach for crop yield prediction” written by Diaz et al., which argued the limitation of dynamic crop model in predicting crop yield and thus introduced machine learning (ML) method for yield forecasting in three main rice growing regions in India (1967-2015). Two ML approaches: polynomial regression (PR) and artificial neural network (ANN) were employed to investigate in separated or combined modes using drought area as single input for grain yield prediction. Since ML comes to practices and being helpful tools and different applications in our life nowadays, especially in agriculture such as yield predication, remote sensing, this study and MS could provide meaningful approaches for yield forecasting as complementary knowledge for other existing approaches, especially in India.
The figure and visual features are informative and easy to follow. English grammar was well-written. The data 1967-2015 was also a strong point for this MS. However, given some major issues which are listed here (i) the objectives of the work and MS were not well determined and clearly stated (ii) structure of MS was not in well-designed and formulated with concrete objectives (iii) a lot of repetition and redundant information among sections, figures and tables were not followed with the main text (iv) lack of more detailed discussion of how other work/other approaches (crop models + ML) has been done elsewhere (in the introduction and discussion) (v) critical issue via using drought area as input for model without clarification of other factor or drought intensity. With these, it could not be accepted as the current MS state. Please see many comments and suggestions in detail below.
Abstract
Line 20-28: it is a bit too long for approach description while it is lack of concrete (overall) statistical number for the results
Line 26: explicitly mentioned to PR, only two approaches here
Line 33: space after “implement”
Introduction
There is redundant information in the first paragraph (line 38-51) that needs to be rewritten.
The MS emphasized the limitation of crop modeling which has been well established in long time in crop yield simulation, yield prediction, and climate change impact assessment as well as understandings crop responses to different abiotic or biotic stresses. Both crop models and ML have uncertainties with regards of spatial-temporal input data when bring into larger scales and long-term application. The comparison between ML and crop model should be further elaborated in the text to convince the reader towards ML? (line 52-59).
Similarly, the MS focused on spatial extent of drought, and it convinced it as an issue that ML model could cover but there is no detail literature and reference that have been done for that in the MS (line 68). Why it is important?
Line 78: what are the specific objectives, about spatial extent impact on grain yield prediction in ML or determine which the best approach of ML are or temporal aggregation effects? Please clearly state
Line 89-123: paragraph “Crop yield prediction in India” came to this. This section should be rewritten or merged with above section to make the introduction more streamline with clear issues and associated objectives. The mentioned information in this section was repeated in section 2
Line 99-109: writing need to be improved
Line 119: which are “other solutions”?
Is there any study using the drought area for yield prediction before?Materials and Methods
Section 2 and 3 need to be reconstructed for more concise and easy reading. It is better to merge in one: like “Materials and methods” with further subheadings.
Line 131-135 is repetition with lines 99-102
Line 131: accessed when? Also the DAC is not similar to the name in line 95
Line 143, separately for each state?
Line 145: it is not clear, it is the spatial aggregation of two states with the average yield?
Figure 1: Why are the color of left and right figures are so different? Same color scale? What is spatial resolution of grid at legend?
Line 156: there is no reference on the reference list
Line 160: access when?
Line 162-163: this information is really important for the whole MS that do not need to repeat explanations. Please state clearly the aggregation: how to get DI and DA? DA1 is aggregated of what from when to when? And soon DA3, 6, 9, 12 because it is confusing with 12 months or 24 months (line 245, 246).
Line 185-203 and section 2.2 was rather replicated.
It is really important to explain further how to estimate such SPEI, in term of equation, variables and since this is only input for the model. The MS mentioned many times the limitation of different drought types, by explanations further this SPEI could determine or clearly show drought? Which ET approach was used and climatic variables? Information of irrigation (if it is available) should be mentioned and described for all years.
Using a single input variable like DA might not be concrete enough for yield prediction and the soundness of approach is rather weak, how about other climatic factors like temperature? How is uncertainties of SPEI at global scales?
Figure 2 should be right away after line 203
Line 207 how about pest and diseases, heat stress, ozone?
Line 229-237 was repeated somewhere else before, for instance line 160-163 or 199-203
Section 32. it was too long and need to be sharpened due to a lot of repeated information
Line 280: Table 2 should be mentioned right away. Line 280 to 289 should be in the result and discussion section, i.e. line 457
Section 3.3 need to be restructured following subsequence equations
Section 3.4 also too long and overlapped with the introduction. Did the work choose the FFNN?
Line 346? Is that a common threshold for different objects? Any justification to use this threshold for single input variable model?
Line 350: is that “period” or whole dataset?
Section 3.5: mentioning various approaches but which one do you choose and what are criteria that has been used?
Results and Discussion
It was too lengthy and repeated information. Substantial improvement in writing is required to make the MS well-structured following the objectives with good discussion and reflections with previous studies
Line 362-366: legend does the job.
Line 368: “theree” -> “three”
Line 394: the decrease and maximum of what?
Line 394: where is Figure 4? It should be shown directly.
Any explanations of the de-trended yield from 2003-2015 of region 1 was much fluctuated as compared to region 2 and 3 in the same period?
Line 403: why is so much different in three regions although only yield from Kharif was presented? Any studies before?
Line 407: what is SPEI6?
Line 411-416 about figure 5: peak of what and in which figure? 5a 5b or 5c, please more precise
Figure 5: each point on 5 a, b, and c from how many n sample? Line 440: “rein” -> “rain” Line 441: data for “2014 or for which years? Or average of which years? This is very important information together with SPEI and DA that should be used to interpret the input data and yield prediction results.
Figure 447 (figure 6): “, respectively” Is that correlation coefficiency with significant level of 95%
Line 466-470 is redundant since it has mentioned in the material and method.
Section 4.3 too much information was shown in same time, fig. 7, 8, 9 as once but less discussion and comparison with other literature for this section. Is there any study elsewhere has been done?
Is there any explanation why both models are less accurate from around 2000-2015 as compared to 1967-2000 for instance for region 1 and region 3? Authors mentioned about the “spatial extent” which was considered in the models. But, this was not well discussed.
Section 4.4. Table 4, 5, 6 could be moved to Supplementary material if this is possible since these has not been discussed much or not informative. Line 539, 547, and 556: “moth” -> “month”
Section 4.5 The limitation was listed but has not been shown through the discussion of results and how they affected to the model performance? Or they has not been clearly discussed and compared with other studies?
Point 6 (line 580-581) it is not clear. In fact, India could provide 3 sets of yield data per year (three growing seasons). Three sets of yield could correspond to at least three periods of temporal aggregation. Why did the work not take three sets of yield data then having more grain yield data with montly DA?
Section 4.6: Repetition of introduction and too general without literature comparison and discussion.
Line 596-598: is similar to point 2 Section 4.5
Section 4.7 a lot information was mentioned and repeated with the previous section line 4.5 and 4.6
Citation: https://doi.org/10.5194/hess-2022-252-RC1 -
AC2: 'Reply on RC1', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC2-supplement.pdf
-
AC2: 'Reply on RC1', Vitali Diaz, 01 Oct 2023
-
CC1: 'Comment on hess-2022-252', Xiaofeng Li, 08 Jun 2023
The manuscript uses spatial range as an input variable and utilizes machine learning algorithms for crop yield prediction. This is an interesting and innovative study. This study can provide a new approach and method for crop yield prediction, while also reducing the dependence of crop models on input data.
Although this study proposes a novel method, the final results have good accuracy and have been compared and analyzed with site observation data, confirming the reliability of the results. However, I still have some questions about some of the content of the manuscript, and the author still needs to revise it and add some explanations. At the same time, there are some format problems in the manuscript, and there are also some citation format problems in the references. The following are detailed comments and suggestions:
Specific modification suggestions:
Data:
- In this study, rice was the research objective and there was a lack of introduction to the characteristics of rice cultivation in the study area. In addition, is the rice in this study area significantly affected by drought? Relevant content should be supplemented.
- Although SPEI has a wide range of applications for drought monitoring, this study should also supplement some literature on this indicator in similar research areas and similar research objectives.
- Should the author supplement the sources of land use type data?
Results and discussions:
- From Figure 5, it can be observed that the correlation between yield after trend removal and drought area changes over time, but overall, the correlation coefficient is relatively small. Can this result support subsequent analytical applications?
- From Figure 7 to Figure 9, it can be found that the root mean square error of simulated yield in the three study areas has very high accuracy. Should the applicability and differences between the two methods be appropriately supplemented?
- In section 4.4, a large number of models are listed. Can the author discuss the universality of these models? In addition to accuracy, the applicability and ease of application of the model are key considerations for its future construction.
- In section 4.5, the threshold of SPEI indicators should be supplemented with relevant basis.
Citation: https://doi.org/10.5194/hess-2022-252-CC1 -
AC1: 'Reply on CC1', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC1-supplement.pdf
-
RC2: 'Comment on hess-2022-252', Anonymous Referee #2, 22 Jul 2023
In this manuscript, the authors employ data-driven techniques to predict rice crop yields in India. The paper's objective is clear; however, the methodology is not rigorously employed, the novelty is limited, and the document's structure could be enhanced. In order to improve the study, the authors could consider the following points:
- In lines 64-68 you mention that ML techniques have already been tested to predict crop yield but that “the use of spatial characteristics of drought such as its spatial extent has not been fully explored to crop yield prediction”. Does this mean that the only conceptual novelty of this work is that it considers a new variable?
- The authors write in the Modeling Limitations section that insufficient crop yield data is an issue, however, the last year for which crop yield data is available is 2015, is it possible to increase the dataset? Much more importantly, the basis of data-driven techniques (of which ML algorithms are part) is that a lot of information is available, and the algorithm can learn from the data. If you don’t have enough information, how can you justify the application of a ML algorithm?
- Some of the plots presented in Figure 7 show a serious problem. Your predictions present a lag of one year (the red curve is shifted one year to the right). This usually indicates that an auto-regressive algorithm (like the one that you are using) is not capable of learning and that the prediction of year t+1 is strongly influenced by the crop yield of year t.
- Go through the entire document and check English usage and typos.
- I suggest that the authors revisit the document and avoid repeating information (unless strictly necessary) and avoid presenting graphs with excessive information.
- You need to improve the description of your work in the introduction. As it is right now, it is unclear. What do you mean by “the crop yield calculation is clear”? What do you mean by “is not as clear”? What does “The ANN is expected to be used with the final input data” mean?
- Did you evaluate the cross-correlation between input variables? Is it possible that you provide redundant information to the algorithm?
- In the results section you write sentences using terms like “perhaps” and “may”. However, the results should be able to prove or reject a hypothesis. I strongly recommend that you avoid that type of sentences in the work.
Citation: https://doi.org/10.5194/hess-2022-252-RC2 -
AC3: 'Reply on RC2', Vitali Diaz, 01 Oct 2023
The comment was uploaded in the form of a supplement: https://hess.copernicus.org/preprints/hess-2022-252/hess-2022-252-AC3-supplement.pdf
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,019 | 276 | 63 | 1,358 | 40 | 50 |
- HTML: 1,019
- PDF: 276
- XML: 63
- Total: 1,358
- BibTeX: 40
- EndNote: 50
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1