Fully differentiable, fully distributed rainfall-runoff modeling

Scholz, Fedor; Traub, Manuel; Zarfl, Christiane; Scholten, Thomas; Butz, Martin V.

doi:10.5194/hess-29-6257-2025

Articles | Volume 29, issue 21

https://doi.org/10.5194/hess-29-6257-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/hess-29-6257-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 29, issue 21

Research article

|

13 Nov 2025

Research article |

| 13 Nov 2025

Fully differentiable, fully distributed rainfall-runoff modeling

Fedor Scholz, Manuel Traub, Christiane Zarfl, Thomas Scholten, and Martin V. Butz

Download

Final revised paper (published on 13 Nov 2025)
Preprint (discussion started on 07 Mar 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4119', Shijie Jiang, 23 Mar 2025

This is a thoughtful and well-executed study. The authors propose DRRAiNN, a fully differentiable and fully distributed neural architecture for rainfall-runoff modeling. The model design is novel and ambitious. In general, the model is interesting and many components are well motivated, while several parts of the paper would benefit from clearer framing, stronger justification, and more focused discussion. Below are my detailed comments and suggestions.
1. In the Introduction and Related Work, the authors use a large portion (to line 67) to review process-based and data-driven hydrological models. However, much of this content is already well-established and could be condensed. More importantly, the link between the challenges described in the background and the specific research goal of DRRAiNN is not clearly established. If the main goal is to improve the interpretability of ML models or to incorporate physical constraints into neural architectures, then the extended discussion of PBM limitations seems unnecessarily long. The core research motivation (around line 74) is somewhat buried between the general background and the introduction of differentiable modeling.

If the intended contribution is to implement a distributed hydrological model using NNs, then the literature review does not sufficiently acknowledge recent progress in this direction. In the Related Work section, the authors list several NN-based rainfall-runoff models, but the discussion is somewhat narrow. For instance, the paper states that "not many" fully distributed data-driven models exist (without citation), but several relevant models with explicit routing modules have been proposed in the last two years (e.g., https://doi.org/10.1029/2023WR036170, https://doi.org/10.1029/2023WR035337, https://doi.org/10.1016/j.jhydrol.2024.132165). These are not acknowledged or compared.

I suggest shortening the background section, clearly identifying the research gap, and referencing recent distributed or physically guided models to more accurately position DRRAiNN in the current landscape. Importantly, to clarify the contribution, it would be helpful to include a concise statement explaining why a fully differentiable, fully distributed rainfall-runoff model is needed and what specific challenge it addresses, as motivated by prior work.
2. I like the general idea behind the model -- using separate neural networks to handle spatial and temporal dynamics makes a lot of sense. I do have a few suggestions:

I) One thing I found a bit confusing is the use of the term “runoff.” Even though the authors explain it is not the physical runoff, it still might be misleading. The variable comes from the LSTM’s hidden state and goes through several layers, including a CNN that does not consider flow direction or water accumulation. So it is more like a learned feature than an actual variable. Maybe calling it something like “runoff embedding” or “runoff representation” would make things clearer.

II) The model uses solar radiation to model local ET. I get the idea, but I am not fully convinced it is enough. Radiation tends to be spatially smooth, especially at the catchment scale. I would expect that including vegetation-related variables (like LAI, NDVI, or GPP) could help better capture spatial variability in ET.

III) A few model choices could use more explanation. For example, why use hidden size 4 for the LSTM and 8 for the GRUs? These seem small for a model working across space and time.

IV) The link between the rainfall-runoff module and the discharge model (although I think alternative terms might help avoid confusion with traditional concepts) seems functionally effective but conceptually weak. Right now, it is unclear what information the embedding actually carries for downstream discharge estimation, and whether it supports realistic routing behavior. Some clarification on what this embedding is supposed to represent in hydrological terms would be helpful.

V) Since the model emphasizes interpretability, it might be useful to consider whether the internal states could reflect more structured hydrological components. For example, separating fast and slow flow signals, or introducing latent variables that relate to soil moisture or baseflow. I understand the authors may not follow a process-based philosophy, but some explanation of why those processes were not treated as explicit model components, while lateral propagation was, would be helpful.
3. I understand the idea behind using symmetry-based data augmentation for generalization in purely statistical terms, but I am not sure if it makes hydrological sense. Rotating or flipping the DEM and precipitation might result in flow directions that are not physically meaningful, especially since the river network and station layout stay fixed. Some clarification or discussion around this would be useful.
4. In model evaluation, one thing that could make the analysis stronger is to also report metrics for specific types of hydrological conditions — for instance, during rising limbs, low flows, or flood events. This would help clarify whether the model is merely capturing average behavior or actually learning the dynamics that matter most.
5. The current attribution analysis focuses entirely on where the rainfall matters spatially. However, hydrological response is also highly dependent on timing (e.g., time to peak). It might be worth considering how the model distributes attention over past precipitation steps, or whether it systematically over- or under-reacts to delayed signals. Even a simple attention plot or error histogram over lag times could be insightful.
6. The comparison of hydrographs in Fig. 4 only provides a qualitative perspective. I would suggest including some quantitative metrics to support the statements made in Sect. 4.1. For example, it is mentioned that the large peak on day 80 in Lauffen and Rockenau is underestimated by both models - is this meant to suggest that the error comes from the input data (e.g., underestimation in the precipitation forcing)? If so, it would be helpful to state this explicitly. Otherwise, it could still be related to limitations in the model architecture or the autoregressive setup. Also, only one model instance is shown in the figure. I wonder how consistent the five trained seeds are - do they show similar hydrographs, especially in the later part of the prediction window, or is there high variability? Some indication of uncertainty or model spread would help.
7. There are a couple of claims in Section 4.2 that could benefit from clarification or more substantive support.

I) The authors mention that different seeds lead to different behaviors, with some instances performing better on short lead times and others on long lead times. It is not clear how weight initialization alone would systematically bias a model toward short- or long-term prediction. From a modeling perspective, are there specific components (e.g., LSTM, GRU) that are more sensitive to initialization in this regard?

II) The explanation that some stations are more difficult to model due to unobservable underground flows or pipes feels vague and speculative (lines 401-407). Since the authors already emphasize that DRRAiNN is distributed and physically interpretable, it would be more convincing to check whether these “hard-to-predict” stations differ in observable properties in their controlling catchments, such as elevation range, forest cover, drainage density, geology (e.g., using map overlay analysis with the HydroATLAS dataset).
8. I find the idea of reconstructing catchment areas from saliency maps very interesting. However, I have a few concerns about the framing and the strength of the conclusions in Section 4.3:

I) The attribution map shown is only for a single seed, with the justification that all seeds were temporally validated. However, attribution is often highly sensitive to parameter noise, especially for gradient-based methods. It would strengthen the argument to show whether the attributions are consistent across seeds, or to provide a measure of saliency variance.

II) While it is interesting that a known discrepancy exists, there is no real evidence that DRRAiNN “discovered” the underground flow. A more conservative interpretation might be that the model failed to align with the delineated catchment, and this could be due to unmodeled processes. If the authors want to keep this discussion, it would help to at least show whether the model consistently de-emphasizes that region across multiple sequences.

III) In Figure 7, I noticed that most of the examples shown are for stations in smaller headwater catchments. It would be helpful to also evaluate downstream stations, where we would expect the model to aggregate signals over a broader upstream area. If the attributions remain very local in those cases, it might suggest that the model is not truly learning large-scale accumulation, but rather reacting to recent local precipitation.

IV) More generally, it is unclear whether the attribution reflects actual learned flow dynamics, or just highlights locations where recent rainfall occurred. A simple test might be to check whether attribution strength correlates with rainfall intensity rather than hydrologically relevant pathways. This would help clarify whether the model is truly learning how water moves through the network, or just where it rains.
9. The discussion is rich and touches on many aspects of the model’s behavior and potential. However, it currently reads more like a collection of loosely connected observations and future directions (e.g., abrupt shifts from ablation to data choices to flood forecasting), rather than a focused and structured analysis. Some points are speculative without clear support, while others deserve more detailed treatment.

I) The authors mention that DRRAiNN is “not designed for scalability,” but it remains unclear what limits its scalability: is it due to computational costs, architectural complexity (e.g., the combined grid and graph operations), or something else? It would help to provide a clearer picture of the computational resources and time required for training and inference.

II) The discussion includes many potential future directions (e.g., hourly discharge, new inputs, removal of the warm-up phase, ....). While these are all interesting, I would recommend narrowing the focus to 1-2 directions that are most promising or directly tied to the current model's limitations.

III) The observation that the best attribution map does not correspond to the best predictive performance is interesting. However this claim may be based on a small number of seeds, and attribution can be sensitive to initialization. Could this divergence be due to noise or model variance?

Citation: https://doi.org/10.5194/egusphere-2024-4119-RC1
- AC1: 'Reply on RC1', Fedor Scholz, 17 Apr 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2024-4119/egusphere-2024-4119-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-4119-AC1
CC1:
'Comment on egusphere-2024-4119', Benedikt Heudorfer, 01 Apr 2025

Great work. It adds to the increasing body of literature of spatially distributed ML models in hydrology. You might be interested in recent research by Martin Gauch as well.
Alyway, regarding the manuscript I have one minor comment and one non-binding suggestion.
First, in line 162 you state that you "derive static topography features from the digital elevation model". It does not get entirely clear what kind of static features you use. I "classic" entity-aware LSTM setup, usually some derived values are used. I interpret it that you simply use the DEM grid-specific elevation as such? Please specifiy what exactly you do here.
Second, benchmarking against a process-based model is fine, but process-based and conceptional hydrological models have long been outmatched by DL models in performance, including by multiple studies you cite, and can not be called benchmark anymore. I don't question the obvious good performance of your model (figure 5), but to really showcase what it can do, it should be additionally compared to a regular entity-aware LSTM. Implementing this in a similar, multi-horizon setup should be straight-forward. This suggestion is non-binding, but would greatly improve the significance of this work.
Thank you.

Citation: https://doi.org/10.5194/egusphere-2024-4119-CC1
- AC2: 'Reply on CC1', Fedor Scholz, 17 Apr 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2024-4119/egusphere-2024-4119-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-4119-AC2
RC2:
'Comment on egusphere-2024-4119', Peter Nelemans, 08 Apr 2025

General comments
The authors present DRRAiNN, a ML rainfall runoff model capable of predicting streamflow at multiple locations. The model architecture is interesting, as it is designed to incorporate certain inductive biases. The model shows good performance, and a notable highlight of the study is the authors' approach to identifying which grid cells influence the simulated discharge at specific locations. Overall, the study is well-executed, but at times, the manuscript appears to lack important details, which could pose challenges for readers attempting to build upon this work.
Specific comments
One of my main issues with the manuscript is its characterization of DRRAiNN as a fully distributed model, even though discharge is estimated only at specific stations. While the input is fully distributed and the rainfall runoff predicts embedded runoff in a distributed manner, no attempt at interpretation of these embeddings was done, so I assume they may not be interpretable. Consequently, if the only physically meaningful output is generated at discrete locations, the model should in my opinion be classified as semi-distributed. I suggest that the authors either provide additional justification for considering DRRAiNN fully distributed or reclassify it as a semi-distributed model.
Another one of my concerns with the manuscript is the absence of any reporting or discussion on the computational efficiency. Although it does not have to be a large part of this study, it remains relevant. This is even more so when considering scaling up DRRAiNN to larger areas, increasing temporal or spatial resolution, or applying the model in flood forecasting scenarios where many weather forecast ensembles must be processed in short time periods.
1. Introduction
The introduction is generally well-written. I suggest that the authors consider emphasizing the “dirty little secret of hydrology”: the common assumption that there is no leakage into or out of a simulated catchment through the subsurface. This issue could potentially be addressed by DRRAiNN and represents one of the most interesting and significant contributions of the study.
The authors present the model as fully differentiable. Here, "differentiable" does not refer to the ability to perform gradient-based optimization, but rather to a specific approach of combining physics with ML, as defined by Shen et al. (2023, see https://doi.org/10.1038/s43017-023-00450-9). The authors do not elaborate on the distinction between a model that is simply “differentiable” and one that is “fully differentiable”, nor do they explain why DRRAiNN qualifies as the latter. While I am not suggesting that it is not, I suggest that the authors provide a clearer explanation of these terms and justify labelling DRRAiNN as fully differentiable, especially given that this feature is highlighted in the manuscript’s title.
Notwithstanding the two aforementioned (non-binding) suggestions, I find the introduction to be somewhat lengthy. Although there is in my opinion not a specific paragraph that must be removed in its entirety, I do suggest condensing most, if not all, paragraphs to some extent.
Lines 30-31: A lumped model implies spatial averaging at the basin scale but does not necessarily imply temporal averaging. This is also clarified later in the manuscript (lines 116–118)
Lines 31-32, "Therefore, the outline .... be feasible.": The outline of the basin must always be available for PBM-based catchment modelling, regardless whether one employs a lumped, semi-distributed, or fully distributed approach.
Lines 49-50, "...for which no detailed land...": I suggest replacing "no" with "few".
Lines 59-60, "[ANNs] high complexity also often leads to neural networks not respecting physical laws despite very good performance": I would argue the complexity of ANNs is not the reason they have difficulty to respect physical laws. Rather, their data-driven approach is.
Lines 69-70, "[A] considerable amount of runoff is situated below the ground and therefore not observable": I suggest clarifying a bit: "[A] considerable amount of runoff is often situated below the ground and therefore not directly observable."
Line 70, "It is not yet possible to see through the ground": I suggest rephrasing to: "It is not yet possible to directly measure subsurface flow."
Line 78, "A combination of the above-mentioned approaches could leverage the advantages of both worlds": I suggest to clarify that you are referencing PBMs and ML here.
Line 96, "Another one": I suggest to clarify: "Another inductive bias"
2. Related work
This section is well-written but does not sufficiently address the relevant literature. Given that the paper focuses on ML-based hydrological modelling, I suggest expanding this section to include additional studies that explore models with architectures other than LSTM and CNN-LSTM. Without this, readers might mistakenly get the impression that the field has only explored these particular architectures.
Specifically, I suggest discussing studies on the use of neural ODEs (e.g., Höge et al., 2022, https://doi.org/10.5194/hess-26-5085-2022), as these models focus strongly on knowledge discovery and system understanding. Furthermore, I suggest including studies that propose physics-informed, fully distributed ML models. E.g.: Sun et al. (2022, https://doi.org/10.5194/hess-26-5163-2022) and Chen et al. (2022, https://doi.org/10.1145/3534678.3539115). Otherwise, this section might unintentionally imply that the pursuit of physics-informed ML models is a recent development.
Line 121, "[F]ully distributed models operate on a grid without any spatial aggregation": In some cases, such as in this study, the forcings are still aggregated, despite the inputs being used in a fully distributed manner.
Line 124: I would specifically mention that the model presented by Xiang and Demir (2022) is a GNN, especially given the use of a GNN in this study. Readers might be familiar with GNNs, but not with the work by Xiang and Demir (2022). That being said, I do not think there is a need to go into detail. A simple addition would suffice. E.g.: “The GNN-based model presented by Xiang and Demir (2022) indeed…"
Perhaps this is a matter of personal preference, but I suggest placing the authors' names outside the parentheses when referring to them directly. E.g. for line 130: “In contrast, Oddo et al. (2024) used a…”
3. Method
I suggest starting this chapter with an introduction to the model and moving what is currently section 3.3 to the beginning. I have several reasons for this. First, the DRRAiNN model is the central focus of the paper, so I would begin the chapter with its introduction. The study site is less relevant to most readers and, in my opinion, would be better placed later in the chapter.
Secondly, reordering the sections in this way would improve the overall flow. The current description of the DRRAiNN model is, as it should be, general and independent of the study area or data sources. After introducing the model, the study site and data sources can be presented, as they are more specific. This creates a natural lead into section 3.4, the experimental setup. The chapter would thus progress smoothly from the general workings of the model to the specific details of the experiment, rather than jumping between the two.
Lastly, introducing the data after the model would provide better context. For example, in lines 170-171, the data used to construct the adjacency matrix is mentioned. However, this is also the first time the adjacency matrix is introduced, and its function only becomes fully clear in section 3.3.
As it stands, the exact role of the different kinds of data in the model is, in my opinion, sometimes difficult to ascertain. For instance, while it can be inferred that solar radiation and precipitation together form the dynamic meteorological forcings F, this is not explicitly stated. Similarly, the static maps S, which, as far as I can tell, only include the DEM, suffer from the same issue.
I also suggest to add a sentence on the study area's climate and some of the characteristics of the river itself (mean, max, min discharge).
Lines 152-153: Temperature is typically included in hydrological models for two main reasons: to model snow (and glacier) processes, and to model evaporation. Since snow is a rare occurrence in the study area, as far as I can tell (which highlights the importance of including climate details for the catchment), the lack of improvement when including temperature in the input data may be more related to the specific study area than to the model’s internal mechanics.
Additionally, solar radiation is strongly correlated with temperature, so for evaporation modelling, it might be sufficient to include either solar radiation or temperature, as both could potentially inform the model’s prediction of evaporation. While I have no issue with the approach or the sentence itself, I suggest adding some further explanation.
Figure 1: Please include a colour bar for the DEM. For the x- and y-axes, I recommend using latitude and longitude, as the kilometre scale on both axes serves no function given its arbitrary origin, other than acting as a scale bar (which should therefore be included when switching to a lat-long axis).
Additionally, the river network is shown in great detail with a consistent width, which makes it difficult to discern which segment a hydrological station belongs to, as the red dots indicating the station locations sometimes overlap multiple streams. To improve clarity, it might help to omit small side streams and/or adjust the width of the river segments according to, for example, the Strahler order.
Furthermore, the grey arrows are not included in the legend. While I assume they are meant to indicate the hierarchy between the stations in relation to the flow of the river, I believe they make the figure more confusing rather than clearer. These arrows may be redundant and could be removed if the river network were illustrated more clearly, as I suggested earlier.
Lastly, the figure is not referenced in the text. While this might be personal preference, I suggest to include a reference to the figure, for example, in line 141.
Line 157: I suggest mentioning the temporal resolution as well.
Lines 160-161: For additional clarity, I suggest to specifically mention that the mean and standard deviation are computed after having added 1 and taking the logarithm.
Line 161: I suggest using “zeros” instead of "0s". Additionally, the fact that 0 is the mean is a result of the standardization process, not the logarithmic transformation. Therefore, I suggest to refer to 0 as the “standardized mean” rather than the “log-standardized mean”.
Lines 163 and 168: The authors mention using rasterio to transform data to the RADOLAN CRS twice. For conciseness, I suggest stating this once at the end for both datasets.
Line 167: I suggest mentioning the spatial and temporal resolution.
Lines 175-176: If the discharge measurement station locations are corrected manually, it seems unnecessary to snap them to the river network afterward. I would expect the corrected locations to already be on the river network.
Line 188: I suggest removing "future", since historical forcings are used.
Line 206: I suggest clarifying that the process described between lines 200-206 is repeated per timestep in an auto-regressive manner.
Line 234: The authors could consider adding an equation for the PWConv as well, next to Equation 2.
Lines 253-255: Why did the authors choose to implement two kernels, and effectively employ a GNN with two types of nodes? Why not opt for a single node type (the stations), using river segment length and elevation difference as edge features between station nodes? A GRU could still be implemented, before, after, or during the message-passing iterations. I am wondering what motivated the authors to adopt this specific (and rather complex) model architecture? Was a simpler setup unable to achieve the desired results, or was there a conscious design decision behind this choice? I suggest elaborating on this part of the model architecture.
Line 261: Initially, two types of kernels are introduced: station and segment kernels. However, line 261 refers to a “transition kernel”. Based on Figure 2, I assume this is the same as the segment kernel. If so, I suggest consistently using the term “segment kernel” to avoid confusion. If my assumption is incorrect, then I suggest clarifying what is meant by “transition kernel”.
Lines 261-262, "Each kernel first concatenates its static, dynamic, and lateral inputs…": Based on Figure 2, it appears that the segment kernel only receives static inputs (adjacency matrix, elevation difference) and lateral inputs (from neighbouring stations). While these lateral inputs are dynamic in nature, the phrasing suggests a distinction between “dynamic” and “lateral” inputs. If “dynamic inputs” refer specifically to discharge and embedded runoff , then only the station kernels receive them. I suggest clarifying this distinction or rephrasing to avoid confusion.
Figure 2: I really like this figure, especially how the rainfall-runoff component is visualized and the connection between the two models is illustrated. However, I suggest adding more detail on the types and dimensions of the data streams, as these are not always clear from the text. For example, between the linear block and the StationGRU, you could add the label “Embedded runoff” and indicate its dimensions (e.g., H_x,y,c). I also recommend clarifying that the discharge data in the bottom left is observed discharge, while the discharge in the bottom right is simulated discharge. It would also be helpful to include dimensions for all input data, such as solar radiation, precipitation, altitude difference, river segment lengths, and the adjacency matrix.
Line 264, "Afterward, the tensor is split into dynamic and lateral outputs.”: I suggest clarifying that the dynamic outputs refer to the estimated discharges, while the lateral outputs are embeddings used to update the SegmentGRU again, as indicated by the two outgoing arrows from the StationGRU in Figure 2.
Line 291: I suggest that the authors specify that they refer to a NVIDIA A100.
Lines 292-297: I suggest the authors provide further explanation regarding their choice of hyperparameters. For instance, were other optimizers considered besides Ranger? Did they explore the use of learning rate schedulers? What criteria were used to determine the sizes of the hidden states? Additionally, why was the SiLU activation function selected?
It is no problem if these decisions were made without any detailed exploration of alternatives, as the focus may have been on demonstrating the proof of concept rather than exhaustive model optimization. However, some more information about these choices (or the lack thereof) could help future researchers in determining what works well and what has not been tried yet (and thus where there is potentially still room for improvement).
Lines 326-327: I am unsure whether the inclusion of four different metrics adds value to the study, particularly if they are conceptually similar. For example, KGE consists of three components, one of which is PCC, and is essentially the same as NSE but without a certain bias. I suggest either replacing NSE and PCC with other, more distinct metrics or omitting them entirely.
4. Results
The results are described and illustrated in detail. However, aside from the four (relatively similar) metrics, the performance of DRRAiNN is not compared to that of EFAS in any other way. Additionally, the inferred contributing area per discharge station, as simulated by the DRRAiNN model, is compared to the catchment area delineated from the DEM. As the authors correctly point out in the introduction, this DEM-based delineation can be problematic, particularly due to its disregard for underground flows.
Nevertheless, a mismatch between the DEM-delineated catchment and the contributing areas from DRRAiNN is sometimes framed as an error by DRRAiNN (e.g., lines 420, 425). As the authors correctly note (lines 426-429), this mismatch does not necessarily indicate an error, but could reflect an unobserved subsurface flow path. I suggest presenting both the DEM-delineated catchment and the contributing area from DRRAiNN as feasible options, rather than assessing one against the other.
If the authors wish to include a more objective assessment of the physical plausibility of the DRRAiNN model, they could train the model on EFAS-simulated (or any other PBM model) discharge instead of observed discharge. Since the contributing area is known a priori in this case (the DEM-delineated catchment), training DRRAiNN on EFAS would ideally result in a perfect alignment between the contributing area according to DRRAiNN and the DEM-derived catchment. While such experiments may be outside the scope of the current manuscript, they could be explored in a follow-up study.
Furthermore, I had hoped to see in this chapter some analysis of the internal system states, especially given their limited number (either 4 or 8) and they're location in GRUs or LSTMs. It’s conceivable that certain hidden states respond to specific inputs, which could make them at least partially physically interpretable. While I don’t expect this to be included in the current study, it’s a valuable direction the authors could consider for future research.
Lines 369-371: "We likely chose … considered stations." I suggest leaving this line out, as in my opinion it is redundant.
Figure 5: Nice figure! I suggest including the unit (m³/s) in the MAE plot. Also, if I understand correctly, the DRRAiNN performance is based on the three best-performing seeds. I would suggest explicitly mentioning this in the caption.
The large difference in performance across different seeds is interesting. With better learning rate scheduling, this issue could potentially be mitigated or at least reduced. Bentivoglio et al. (2023, see https://doi.org/10.5194/hess-27-4227-2023) employed a curriculum learning strategy, similar to the TBPTT used here, and found that gradually lowering the learning rate every few epochs improved performance.
Line 406: In lines 442-444, the authors (correctly) list some factors that could contribute to the difficulty in predicting streamflow at certain stations beside unobserved subsurface flows, but these are not listed here. I suggest listing these also (or only) here.
Figure 6: I find the information in this figure very interesting, but the figure itself could be improved. While this is subjective, the figure might look more visually appealing if the performance at each station were represented as markers rather than by a line connecting the performance per station. The range of DRRAiNN performance across different seeds could be indicated with whiskers. Additionally, the mean discharge per station could be plotted as a dashed line behind the performance markers for better visibility of the markers.
Figure 7: This is a very interesting figure and, in my opinion, showcases one of the most important contributions of this study. However, I believe the figure could be improved in several ways. First, the individual plots are quite small and difficult to read; I suggest making them larger. The DEM-based catchment area is barely visible. Additionally, displaying the same legend in all four plots is in my opinion unnecessary; I recommend having a single legend, placed outside the plots.
The station name is already in the title of each plot, so repeating it inside the plot is redundant and makes the plot harder to read. Instead, I suggest indicating the station of interest with a different marker.
Furthermore, though this is subjective, I believe the plots would benefit from being zoomed in on the station and its catchment area. There is no need to display the entire river network, as the goal of this figure is comparing the DEM-derived catchment and the attribution area from the DRRAiNN model. Therefore, it would be sufficient to zoom in per station and show the station, the DEM-derived catchment area, the attribution area, and possibly the river network within the catchment.
Finally, as an ambitious suggestion (and possibly challenging to implement in 2D), it would be fascinating to see the attribution overlaid with the DEM itself. This would provide insight into how the area identified by DRRAiNN as significant aligns with the DEM.
5. Discussion
I recommend expanding the discussion to address whether and how future work could overcome the limitation of estimating discharge only at observation stations. More importantly, I suggest to discuss whether the model could even become fully independent from discharge observations, which would expand its applicability to ungauged basins. These are two important drawbacks of the current model, but neither are addressed explicitly.
Furthermore, I suggest reordering the discussion somewhat as to group related paragraphs together. E.g. lines 485 – 489 discuss including more input data, and line 494-499 discuss including more output data.
Additionally, the overall discussion is quite long, and I suggest trying to condense some paragraphs. Line 447-453 could maybe be moved to the chapter 4.
Lines 462-470: In my opinion, this should be part of section 3.2.
Line 489: A warm-up period of 10 days is quite an achievement! Most models, especially large and complex PBMs, often require warm-up periods of a year or longer. A model with such a short warm-up period is impressive, and I believe the authors should highlight this accomplishment more explicitly.
Lines 491-492: Assuming the authors are referring to remotely sensed soil moisture, this would be limited to the moisture in the upper soil layers, not the deeper, saturated layers. Therefore, I believe the initial 10-day warm-up period is still necessary, even with soil moisture as an input, since the model will also need time to adjust the deeper soil moisture. However, including compressed meteorological data might proof effective.
6. Conclusion
The conclusion is well written and concise.
Appendix A
This is overall a very nice addition to the study, and similarly well written.
Figure A1: To be consistent with Figure 5 and Figure 6, I suggest to plot the data from EFAS again in orange.
Figure A2: This is very interesting, and I believe plotting the entire region is now warranted, as the attributed cells are located far outside the DEM-derived catchment. However, I still suggest increasing the figure size, using a single legend for all four plots placed outside the plots, and differentiating the stations with distinct markers rather than displaying their names.
Lines 543-544: If I understand correctly, the entire ConvNeXt block has been removed from the architecture. I suggest explicitly stating this for clarity.
Figure A4 and Figure A7: same comment as for Figure A1.
Figure A5 and Figure A8: same comment as for Figure A2.
In the ablation studies described in A2 and A3, significant portions of the model are removed, which likely results in fewer model parameters and faster runtimes. I suggest discussing the model size and computational efficiency in this context as well.
Technical corrections
Line 290: Missing "table".
Figure 4: Missing “discharge” after “… and highest (d)”.

Citation: https://doi.org/10.5194/egusphere-2024-4119-RC2
- AC3:
  'Reply on RC2', Fedor Scholz, 17 Apr 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2024-4119/egusphere-2024-4119-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-4119-AC3
  - RC3: 'Reply on AC3', Peter Nelemans, 18 Apr 2025
    
    I thank the authors for their extensive reply to my comments, and am looking forward to seeing the final version of the manuscript.
    I apologize for my unclear comment on line 121. Line 121 of the original manuscript reads: "[F]ully distributed models operate on a grid without any spatial aggregation”, to which I commented in RC2 that "[in] some cases, such as in this study, the forcings are still aggregated, despite the inputs being used in a fully distributed manner."
    
    I was trying to state that I believe line 121 of the original manuscript to be incorrect. Spatial aggregation of the input data (static parameters and forcings) still occurs in some fully distributed models, the model presented in this study being one of them. E.g., in this study the precipitation data is coarsened from a 1 km × 1 km grid to a 4 km × 4 km grid (line 154). Hence, this is a fully distributed model that operates on a grid, with spatial aggregation. As the authors mentioned in AC3 to remove the "spatial aggregation" from the sentence, my original comment on line 121 is addressed.
    
    Citation: https://doi.org/10.5194/egusphere-2024-4119-RC3
    
    AC5: 'Reply on RC3', Fedor Scholz, 22 Apr 2025
    
    Thank you very much for the clarification!
    
    Citation: https://doi.org/10.5194/egusphere-2024-4119-AC5
CC2:
'Comment on egusphere-2024-4119', Tianfang Xu, 15 Apr 2025
This is a well-executed, novel study demonstrating the potential of CNNs and GNNs for spatially distributed rainfall-runoff modeling.
However, we would like to clarify that our cited works (lines 131-135, Longyang et al., 2024; Xu et al., 2022; Tyson et al., 2023) were specifically designed for karst watersheds where subsurface connectivity is watershed-specific, often unknown, and largely independent of topography. The spatial attention mechanism in Longyang et al. (2024) was explicitly developed, in combination to ridge regression, to identify contributing areas and lateral flow paths that cross topographic boundaries, with results validated against hydrogeochemical tracers.
Additionally, these studies were watershed-specific implementations (as clearly stated in their objectives) and were not presented as generalizable to ungauged basins without additional training. Our ongoing work shows these methods perform well when trained on other watersheds, suggesting their transferability potential.
Given these differences in scope and methodology, we respectfully request either:
A more precise characterization of our work's context and contributions, or

Removal of the citations if they do not align with the discussion of global aggregation approaches.

We appreciate the opportunity to provide this clarification and would be happy to discuss further if helpful.
Citation: https://doi.org/10.5194/egusphere-2024-4119-CC2
- AC4: 'Reply on CC2', Fedor Scholz, 17 Apr 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2024-4119/egusphere-2024-4119-AC4-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-4119-AC4

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (further review by editor) (06 May 2025) by Daniel Klotz

Dear authors,

I am very pleased that your manuscript got both, excellent community comments and very thorough reviews. I firmly believe that incorporating the feedback as suggests will greatly improve the manuscript. Given the extensive review, it is not surprising that one of the review suggestion was to go for major revision, while the other for minor revisions. I will go with „minor revision“, because you already mentioned that you have new, better results — and, albeit, there are many changes to be made, only very little new experimental work has to be done. Specifically, I would like to emphasize out the following suggestions:

- Implement all changes that help with the clarity of the manuscript and the figures (e.g., runoff embeddings, learned representations of slow and fast flows, data augmentation, ...)
- Change the introduction as proposed in RC1 and RC2, but make sure to clearly define the notion of fully distributed data-driven model to minimize misunderstandings for feature readers. I tink it is fine to go with your setting as long as the manuscript defines it precisely.
- Streamline and the literature review (and extend where necessary; specifically see proposals by RC2 and the community comment by Benedikt Heudorfer). Also, account for the comment of Tianfang Xu.
- Rearrange and adapt the method as proposed by RC2.
- Include the new and better results that you mention to RC1.
- Make sure to condense the future research (RC1 and RC2) and take up the discussion points brought up by RC2
- Please report the additional hyperparameter experiments that you mention to RC1 and RC2 in the appendix. These are very interesting in themselves and can be super useful for potential future researcher that take on your work.
- The lack of stability of the attribution maps (see RC1) should be clearly discussed/communicated in the main manuscript.
- Add information about compute time (mentioned by RC1 and RC2)
- I would like to argue for keeping the NSE as a metric. It is by far the most used in hydrology and hence useful for communication, even if it correlates with the KGE. In general, I do not think it is good to remove metrics without using a suitable replacement.

Best,
Daniel

Hide

AR by Fedor Scholz on behalf of the Authors (26 May 2025) Author's response Author's tracked changes Manuscript

ED: Publish subject to revisions (further review by editor and referees) (02 Jun 2025) by Daniel Klotz

ED: Publish subject to revisions (further review by editor and referees) (19 Jul 2025) by Daniel Klotz

AR by Fedor Scholz on behalf of the Authors (21 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (22 Jul 2025) by Daniel Klotz

RR by Peter Nelemans (20 Aug 2025)

RR by Shijie Jiang (27 Aug 2025)

ED: Publish subject to minor revisions (review by editor) (11 Sep 2025) by Daniel Klotz

AR by Fedor Scholz on behalf of the Authors (23 Sep 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (28 Sep 2025) by Daniel Klotz

AR by Fedor Scholz on behalf of the Authors (06 Oct 2025)

Short summary

We present a neural network model that estimates river discharge based on gridded elevation, precipitation, and solar radiation. Some instances of our model produce more accurate forecasts than the European Flood Awareness System (EFAS) when simulating discharge with lead times of 50 days on the Neckar river network in Germany. It consists of multiple components that are designed to model distinct sub-processes. We show that this makes the model behave in a more physically realistic way.

Fully differentiable, fully distributed rainfall-runoff modeling

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection