When ancient numerical demons meet physics-informed  machine learning: adjoint-based gradients  for implicit differentiable modeling

Song, Yalan; Knoben, Wouter J. M.; Clark, Martyn P.; Feng, Dapeng; Lawson, Kathryn; Sawadekar, Kamlesh; Shen, Chaopeng

doi:10.5194/hess-28-3051-2024

Articles | Volume 28, issue 13

https://doi.org/10.5194/hess-28-3051-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-28-3051-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 28, issue 13

Research article

|

15 Jul 2024

Research article |

| 15 Jul 2024

When ancient numerical demons meet physics-informed machine learning: adjoint-based gradients for implicit differentiable modeling

Yalan Song, Wouter J. M. Knoben, Martyn P. Clark, Dapeng Feng, Kathryn Lawson, Kamlesh Sawadekar, and Chaopeng Shen

Download

Final revised paper (published on 15 Jul 2024)
Preprint (discussion started on 09 Nov 2023)

Interactive discussion

Status: closed

RC1:
'Comment on hess-2023-258', Ilhan Özgen-Xian, 09 Dec 2023

Dear authors, Dear editor,
Here is my review of the submitted work. I recommend accepting the manuscript after minor revisions.
Kindly
Ilhan Özgen-Xian
Summary
The authors explore the use of the adjoint method to replace automatic differentiation for differentiable models, with application to lumped hydrological modelling with implicit time integration. This is proposed to overcome some of the limitations of the automatic differentiation when using implicit time integration, namely excessive memory usage when using iterative linear system solvers and vanishing gradients.
The authors couple the hydrological model HBV to an LSTM and train the coupled hybrid model (NN-HBV) using CONUS and CAMELS data. The improvement in model performance due to the implicit time integration is demonstrated convinvingly.
In addition, the authors add the process of capillary rise to the model and show that this improves the model performance in all model variants.
Overall, the presented work is of interest to the readers of Hydrology and Earth System Sciences. The manuscript is well written.
General comments and questions
1. The authors convincingly make an argument for implicit time integration. The forward Euler time stepping used in this work is indeed at a disadvantage if fixed time steps are used. However, it is not clear to me how higher order explicit time integration methods such as schemes from the explicit Runge-Kutta family (RK) would perform in comparison to the implicit one. If I understood correctly, some of the numerical issues mentioned in the manuscript might also be addressed by (adaptive) multistep schemes of this type. The advantage of RK-type schemes in this context is that the number of computations per time step is known a priori. In contrast, the Newton-Raphson iterative solver may require any number of steps until convergence. High order RK schemes, for example the standard RK45 or the adaptive RK-Fehlberg method, could also potentially benefit from the adjoint method presented in this paper to avoid excessive memory usage. Perhaps the authors can comment on this.

2. The authors mention that the Newton-Raphson solver introduces some overhead to the computation. On average, in the results shown in this paper, how many iteration steps were necessary for the solver to converge?
Minor comments
1. P.2, L.70: "graphical processing units" should be "graphics processing units"

2. P.3, LL.105ff.: Does "elliptic operator" in this context correspond to the Laplacian? If so, some of the examples might require some annotation. The Saint-Venant equation only contains Laplacian operators if molecular/turbulent diffusion is accounted for. Many forms of the Saint-Venant equation omit these terms, for example (García-Navarro et al., 2019, doi:10.1007/s10652-018-09657-7; LeVeque et al., 2011, doi:10.1017/S0962492911000043).

3. P.3, LL.105ff. (continued) When I looked at the paper by Aboelyazeed et al. (2023) (cited by the authors), I couldn't see Laplacians in the Farquhar model equations.

4. P6, L.209: "The same forcings ... was used" should be "The same forcings ... were used"

5. P.12, L.335: Should it be Eq. (28) instead of Eq. (27)? May be I am misunderstanding something.

6. P.14, L.398: The authors state that the mass balance preservation of the adjoint-driven NN-HBV model might be the reason behind the improved model performance. I don't understand why the mass conservation should significantly differ from the explicit sequential NN-HBV model if the hydrological process representation remains untouched. Is this related to the use of thresholds to avoid negative storages? Can the authors elaborate a bit more?

7. P.24, L.580: The additional computational cost introduced by the implicit solver is quite substantial (18 h vs. 133 h), suggesting either poor convergence or large communication overhead in the implicit scheme.

Citation: https://doi.org/10.5194/hess-2023-258-RC1
- AC1: 'Reply on RC1', Chaopeng Shen, 07 Jan 2024
  
  Thank you for your constructive comments. Please see the attached file for our responses
  
  Citation: https://doi.org/10.5194/hess-2023-258-AC1
RC2:
'Comment on hess-2023-258', Uwe Ehret, 18 Dec 2023

Dear Editor, dear Authors,
Please find my review in the attachment.
Yours sincerely, Uwe Ehret

Citation: https://doi.org/10.5194/hess-2023-258-RC2
- AC2: 'Reply on RC2', Chaopeng Shen, 07 Jan 2024
  
  We thank Dr. Ehret for the constructive suggestions. Unfortunately over the AGU and winter break time frame the interactive discussion has ended, we wonder if it could be extended so we can have more discussion.
  Please see the attached file for our replies, but here briefly, we respectfully do not agree with the "reject" recommendation and the suggestion to run on very small time steps, for several reasons:
  
  1. Here the main paper of the paper is to enable implicit schemes, not to say explicit scheme is not usable. We will revise to clarify this. It is well-known they have different applicable problems.
  In practice, only using explicit schemes can indeed run into many issues in the context of differentiable modeling on large datasets:
  2. Running very small times with automatic differentiation can incur huge GPU memory use, limiting the window length allowable for training.
  3. Minibatch parallelism is super important for learning. However, adaptive time stepping schemes that adapts to the numerical characteristics of each basin is not friendly to minibatching on the GPU, which prefers more uniform operations across the batch.
  4. there are numerical reasons explored in previous studies (Clark, et al., 2010) why implicit scheme is preferred. This paper's main purpose to enable implicit scheme. Also, matching input forcing functions/dynamic parameters to those tiny time steps require interpolation and can add much to the complexity.
  Considering all of these challenges, we argue it is a bit unfair to ask us to perform the explicit simulations on very time steps. Our current opinion is that to run such a scheme at small time steps at large scale is actually quite difficult with current computing constraints. Rather, we welcome the community to show such comparisons. We also came to the current solution not by chance but through a long exploration process. Please see more detail in the attached PDF.
  
  Citation: https://doi.org/10.5194/hess-2023-258-AC2
- AC3:
  'Reply on RC2', Chaopeng Shen, 12 Jan 2024
  Please allow us to revise our response here
  The main point of the paper is to enable implicit methods, not to discourage explicit ones. Both are useful, and implicit solvers are long known to be very necessary for many problems
  
  If the editor insists, we can add some hourly model (or RK, but not both) results, but this should not be a reject decision because we can already run this and can show some results rather easily. Matching inputs to the accurate hourly time steps is harder and is out of the scope. Adaptive time stepping is also out of scope.
  
  We in fact already ran some initial tests. Changing to hourly model also raised the computational cost. at least 50% more computational time and 30% more RAM than the implicit daily model. We are attempting to make the implicit code, which should have a large room of efficiency. The room of improvement seems smaller with the explicit model.
  
  Minibatch parallelism is super important for learning and we always need to keep this in mind.
  
  There are numerical reasons explored in previous studies (Clark, et al., 2010) why implicit scheme is preferred.
  
  We thank the reviewer and editor for your considerations.
  
  Citation: https://doi.org/10.5194/hess-2023-258-AC3
  - RC3: 'Reply on AC3', Uwe Ehret, 17 Jan 2024
    
    Dear Authors, dear Editor,
    
    Please see my reply in the attachment.
    Yours sincerely, Uwe Ehret
    
    Citation: https://doi.org/10.5194/hess-2023-258-RC3
    
    AC4: 'Reply on RC3', Chaopeng Shen, 03 Feb 2024
    
    Dear Dr. Ehret,
    Please see the attached PDF for our response.
    
    Citation: https://doi.org/10.5194/hess-2023-258-AC4

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (10 Feb 2024) by Ralf Loritz

Dear Song et al.,
First of all, I would like to commend the authors for this interesting work, which aligns well with the scope of HESS. Furthermore, I would like to underscore the great discussion with both reviewers. In your last response to Dr. Ehret, you outlined a promising strategy for enhancing your manuscript, and I eagerly anticipate a revised and more streamlined version.

Two minor comments to augment the reviewers' feedback:

1. The term “outperform” and the pronounced emphasis on model comparison appear somewhat misaligned with your model's results and also seem to occupy excessive space in this manuscript. How much does your model's performance vary by merely re-training with an alternative weight initialization method, adjusting your hyperparameters, or by interchanging the training and testing datasets? My intention is constructive; I am not suggesting further tests but rather expressing that the approach and concept itself are intriguing and novel. I would have appreciated a deeper explanation and discussion on how this work could be extended to problems beyond bucket models or / and a nice schematic figure with some extra text how this is implemented in the network. Particular because your code is not shared yet which seems unnecessary because the preprint in HESS is open.

2. "Since daily forcing and streamflow data are readily available and accessible, while hourly input data is often more challenging to acquire, our study focuses exclusively on daily hydrological modeling using conceptual models." A quick Google search (10.5281/zenodo.4072700) revealed the existence of datasets (as well as a API of the USGS), and it's likely there are more I am unaware of. This is merely a suggestion should you wish to conduct your experiments at an hourly resolution. I do not think these experiments are necessary here (although interesting); however, I found the argument regarding the lack of data to be somewhat tenuous.

Again congratulations on the nice work and I look forward to the revised manuscript.

Sincerely,

Ralf Loritz

Hide

AR by Chaopeng Shen on behalf of the Authors (23 Mar 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (28 Mar 2024) by Ralf Loritz

RR by Ilhan Özgen-Xian (05 Apr 2024)

RR by Uwe Ehret (26 Apr 2024)

ED: Publish as is (06 May 2024) by Ralf Loritz

AR by Chaopeng Shen on behalf of the Authors (16 May 2024)

Short summary

Differentiable models (DMs) integrate neural networks and physical equations for accuracy, interpretability, and knowledge discovery. We developed an adjoint-based DM for ordinary differential equations (ODEs) for hydrological modeling, reducing distorted fluxes and physical parameters from errors in models that use explicit and operation-splitting schemes. With a better numerical scheme and improved structure, the adjoint-based DM matches or surpasses long short-term memory (LSTM) performance.