<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">HESS</journal-id><journal-title-group>
    <journal-title>Hydrology and Earth System Sciences</journal-title>
    <abbrev-journal-title abbrev-type="publisher">HESS</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Hydrol. Earth Syst. Sci.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1607-7938</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/hess-29-6221-2025</article-id><title-group><article-title>How to deal w___ missing input data</article-title><alt-title>How to deal w___ missing input data</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes" rid="aff1 aff5">
          <name><surname>Gauch</surname><given-names>Martin</given-names></name>
          <email>gauch@google.com</email>
        <ext-link>https://orcid.org/0000-0002-4587-898X</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Kratzert</surname><given-names>Frederik</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-8897-7689</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2 aff3">
          <name><surname>Klotz</surname><given-names>Daniel</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-9843-6798</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Nearing</surname><given-names>Grey</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff4">
          <name><surname>Cohen</surname><given-names>Deborah</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-0153-8537</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff4">
          <name><surname>Gilon</surname><given-names>Oren</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>Google Research, Zurich, Switzerland</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Google Research, Vienna, Austria</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>IT:U Interdisciplinary Transformation University, Linz, Austria</institution>
        </aff>
        <aff id="aff4"><label>4</label><institution>Google Research, Tel Aviv, Israel</institution>
        </aff>
        <aff id="aff5"><label>🏅</label><institution>Invited contribution by Martin Gauch, recipient of the EGU Hydrological Sciences Virtual Outstanding Student and PhD candidate Presentation Award 2021.</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Martin Gauch (gauch@google.com)</corresp></author-notes><pub-date><day>13</day><month>November</month><year>2025</year></pub-date>
      
      <volume>29</volume>
      <issue>21</issue>
      <fpage>6221</fpage><lpage>6235</lpage>
      <history>
        <date date-type="received"><day>14</day><month>March</month><year>2025</year></date>
           <date date-type="rev-request"><day>7</day><month>April</month><year>2025</year></date>
           <date date-type="rev-recd"><day>14</day><month>August</month><year>2025</year></date>
           <date date-type="accepted"><day>15</day><month>October</month><year>2025</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2025 Martin Gauch et al.</copyright-statement>
        <copyright-year>2025</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025.html">This article is available from https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025.html</self-uri><self-uri xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025.pdf">The full text article is available as a PDF file from https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e154">Deep learning hydrologic models have made their way from research to applications. More and more national hydrometeorological agencies, hydro power operators, and engineering consulting companies are building Long Short-Term Memory (LSTM) models for operational use cases. All of these efforts come across similar sets of challenges – challenges that are different from those in controlled scientific studies. In this paper, we tackle one of these issues: how to deal with missing input data? Operational systems depend on the real-time availability of various data products – most notably, meteorological forcings. The more external dependencies a model has, however, the more likely it is to experience an outage in one of them. We introduce and compare three different solutions that can generate predictions even when some of the meteorological input data do not arrive in time, or not arrive at all: First, <italic>input replacing</italic>, which imputes missing values with a fixed number; second, <italic>masked mean</italic>, which averages embeddings of the forcings that are available at a given time step; third, <italic>attention</italic>, a generalization of the masked mean mechanism that dynamically weights the embeddings. We compare the approaches in different missing data scenarios and find that, by a small margin, the masked mean approach tends to perform best.</p>
  </abstract>
    </article-meta>
  </front>
<body>
      

      
<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e177">Deep learning approaches for hydrologic modeling are now making their way from research settings into real-world operational deployments <xref ref-type="bibr" rid="bib1.bibx33 bib1.bibx14 bib1.bibx39 bib1.bibx15" id="paren.1"><named-content content-type="pre">e.g.,</named-content></xref>. Unfortunately, the real world is messy and in many ways does not conform to the controlled settings we can assume in research studies <xref ref-type="bibr" rid="bib1.bibx31" id="paren.2"/>. One prime example for such complications is the occurrence of outages with input data products: state-of-the-art operational hydrologic models rely on the real-time availability of several externally provided meteorological forcing products. As an example, the hydrologic model in Google's flood forecasting system uses four different weather data products from four different data providers as inputs <xref ref-type="bibr" rid="bib1.bibx9" id="paren.3"/>. At any point in time, one or more of these providers might experience an outage and not deliver the data in time to make the next prediction. Where the timely arrival of data is usually not an issue in research contexts, not producing forecasts for days or even weeks is not an option for operational systems that are needed for flood forecasts or water management.</p>
      <p id="d2e191">Moreover, models that can cope with missing input data are useful in other settings, such as training on data products that are available for different time periods or different spatial extents: the observation that larger and more diverse training sets generally benefit the prediction quality <xref ref-type="bibr" rid="bib1.bibx29" id="paren.4"/> appears at odds with the fact that local meteorological forcings tend to have higher resolution and be more accurate than global ones <xref ref-type="bibr" rid="bib1.bibx8" id="paren.5"/>. Our proposed methods can mitigate this tension, as they allow us to train a single global model that incorporates local forcings where they are available (Fig. <xref ref-type="fig" rid="F1"/>). Orthogonally to spatial coverage, our methods further allow us to train models with forcings that have different temporal coverage. This is especially useful for more recent data products based on remote sensing information.</p>

      <fig id="F1"><label>Figure 1</label><caption><p id="d2e204">Different scenarios for missing input data (gray bars): outages at individual time steps (top), data products starting at different points in time (middle), and local data products that are not available for all basins (bottom). All of these scenarios reduce the number of training samples for models that are <italic>not</italic> robust, i.e., that cannot cope with missing data (yellow, small box), while the models presented in this paper <italic>are</italic> robust, i.e., they can be trained on all samples with valid targets (purple, large box).</p></caption>
        <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f01.png"/>

      </fig>

      <p id="d2e220">Inevitably, the quality of predictions degrades as fewer input data products are available <xref ref-type="bibr" rid="bib1.bibx26" id="paren.6"/>. Fortunately, deep learning methods are flexible enough to offer solutions that limit this decay while remaining competitively accurate when all data are available. In the following sections, we present three strategies to accomplish this goal: <list list-type="bullet"><list-item>
      <p id="d2e228">First, <italic>input replacing</italic> replaces missing forcing data with a fixed value and adds binary flags to indicate outages.</p></list-item><list-item>
      <p id="d2e235">Second, <italic>masked mean</italic> embeds each forcing product separately and averages the embeddings of all products that are available at a given time.</p></list-item><list-item>
      <p id="d2e242">Third, we show how the masked mean strategy is a special case of a theoretically more expressive but practically equally accurate <italic>attention</italic> mechanism <xref ref-type="bibr" rid="bib1.bibx5" id="paren.7"/> that can dynamically adjust the weighting of each forcing product, e.g., depending on the static attributes of a basin.</p></list-item></list></p>
      <p id="d2e251">We evaluate these strategies in three settings: <list list-type="bullet"><list-item>
      <p id="d2e256">First, random <italic>time step dropout</italic>. We investigate how accuracy deteriorates as forcings are missing at more and more time steps during training and inference (corresponding to the top row in Fig. <xref ref-type="fig" rid="F1"/>).</p></list-item><list-item>
      <p id="d2e265">Second, <italic>sequence dropout</italic>. We investigate how accuracy deteriorates as certain forcings become entirely unavailable during inference (corresponding to the middle row in Fig. <xref ref-type="fig" rid="F1"/>).</p></list-item><list-item>
      <p id="d2e274">Third, <italic>regional forcing products</italic>. We investigate how the proposed strategies allow training global models that leverage regional forcing data (corresponding to the bottom row in Fig. <xref ref-type="fig" rid="F1"/>).</p></list-item></list></p>
      <p id="d2e282">We are not the first to study deep learning models that are robust to missing input data <xref ref-type="bibr" rid="bib1.bibx2" id="paren.8"/>. In fact, today's large language models rely heavily on learning schemes that train the model to predict words given incomplete and masked-out input sentences <xref ref-type="bibr" rid="bib1.bibx10 bib1.bibx36 bib1.bibx7" id="paren.9"><named-content content-type="pre">e.g.,</named-content></xref>. These masked language models use special mask tokens to indicate dropped-out data, which – at a high level – are similar to the binary indicators we use in the input replacing strategy. Similar techniques are used in computer vision models, such as Masked Autoencoders <xref ref-type="bibr" rid="bib1.bibx22" id="paren.10"/>. <xref ref-type="bibr" rid="bib1.bibx43" id="text.11"/> highlight an additional benefit of dropping out inputs (or hidden activations) during training: dropout has a regularizing effect on training and therefore reduces overfitting and leads to models that generalize better.</p>
      <p id="d2e299">Data-driven methods are also used to explicitly impute missing data <xref ref-type="bibr" rid="bib1.bibx41 bib1.bibx47" id="paren.12"><named-content content-type="pre">e.g.,</named-content></xref>, including in hydrological and meteorological applications <xref ref-type="bibr" rid="bib1.bibx16 bib1.bibx49" id="paren.13"><named-content content-type="pre">e.g.,</named-content></xref>. Imputation subsequently allows using models that cannot cope with missing data. However, this strategy requires an additional imputation model that needs to be trained separately or jointly with the downstream model, making the setup and training more complex. As we are less focused on the reconstruction of missing data and more focused on maintaining prediction accuracy, we do not consider such approaches in this study.</p>
      <p id="d2e312">Our masked mean and attention mechanisms also bear similarity to deep learning approaches that merge multi-modal input data, such as LANISTR <xref ref-type="bibr" rid="bib1.bibx13" id="paren.14"/>. Their approach merges inputs from different modalities (such as images, text, or structured data) into a joint embedding space, while allowing individual modalities to be missing at training or inference time. Further, the attention mechanism's dynamic weighting of forcing embeddings can be seen as a variant of the conditioning operation described by <xref ref-type="bibr" rid="bib1.bibx35" id="text.15"/> and at a higher level by <xref ref-type="bibr" rid="bib1.bibx12" id="text.16"/>.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Data and methods</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Data</title>
      <p id="d2e339">We ran all experiments on the 531 basins of the CAMELS dataset <xref ref-type="bibr" rid="bib1.bibx34 bib1.bibx1" id="paren.17"/> that previous studies used, e.g., <xref ref-type="bibr" rid="bib1.bibx26" id="text.18"/>. The CAMELS dataset comes with three sets of daily meteorological forcings: Daymet <xref ref-type="bibr" rid="bib1.bibx44" id="paren.19"/>, Maurer <xref ref-type="bibr" rid="bib1.bibx30" id="paren.20"/>, and NLDAS <xref ref-type="bibr" rid="bib1.bibx48" id="paren.21"/>. We consider these forcings the “external dependencies” in this study. All models use all 15 forcing variables (precipitation, solar radiation, min/max temperature, and vapor pressure for each of the three forcing products) and the same set of 26 static attributes as <xref ref-type="bibr" rid="bib1.bibx26" id="text.22"/><fn id="Ch1.Footn1"><p id="d2e360">Unlike what is mentioned in <xref ref-type="bibr" rid="bib1.bibx26" id="text.23"/>, p_seasonality was actually not used as a static input, as the experiment configuration files show.</p></fn>. All models are trained with streamflow as the target variable.</p>
      <p id="d2e367">Again following <xref ref-type="bibr" rid="bib1.bibx26" id="text.24"/>, we trained our models on the period 1 October 1999 to 30 September 2008, validated on 1 October 1980 to 30 September 1989, and tested them on 1 October 1989 to 30 September 1999. All results in this paper refer to the test period.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Methods</title>
      <p id="d2e381">The models we train in this paper closely follow the architecture that was used in <xref ref-type="bibr" rid="bib1.bibx26" id="text.25"/>, except that we employ different mechanisms to feed the input data into the LSTM itself. The following paragraphs describe these approaches in more detail.</p>
<sec id="Ch1.S2.SS2.SSS1">
  <label>2.2.1</label><title>Input replacing</title>
      <p id="d2e394">The first mechanism to cope with missing input data sets any missing values to a fixed value and adds a binary flag to indicate these replacements, before concatenating all input data and flags <xref ref-type="bibr" rid="bib1.bibx33" id="paren.26"><named-content content-type="pre">Fig. <xref ref-type="fig" rid="F2"/>; see also</named-content></xref>. Optionally, we can embed the concatenated vector (in our case, through a small fully-connected network). This reduces the feature dimensions before the vector is finally used as input to the LSTM. Further, we can concatenate a positional encoding vector to the forcings before the embedding, making the model aware of the current input's position relative to the overall sequence length (not shown in Fig. <xref ref-type="fig" rid="F2"/>). In initial experiments, we also tried to make the replacement value a learned parameter instead of setting it to a fixed value, but we did not see meaningful improvements when doing so. Hence, in all subsequent experiments, we used zero as the fixed value.</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e408">Illustration of the input replacing strategy. Each box represents an input variable (like precipitation, temperature) from one of the forcing groups. NaNs in the input data for a given time step are replaced by zeros (gray boxes for forcing group 2), all forcings are concatenated, together with one binary flag for each forcing group which indicates whether that group was NaN or not. The resulting vector is passed through an embedding network to the LSTM.</p></caption>
            <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f02.png"/>

          </fig>

</sec>
<sec id="Ch1.S2.SS2.SSS2">
  <label>2.2.2</label><title>Masked mean</title>
      <p id="d2e425">This approach embeds the forcings of each provider through individual embedding networks, each of them yielding an embedding vector of the same size. At every input time step, we average the non-NaN embeddings of that time step (i.e., the embeddings that correspond to providers that were available at that time step; hence the name “masked mean”) and pass the resulting joint embedding on to the LSTM (Fig. <xref ref-type="fig" rid="F3"/>). The inputs to the embedding networks could be extended by additional features, such as the static catchment attributes. However, in our experiments we found that this deteriorated the performance. The flood forecasting system described by <xref ref-type="bibr" rid="bib1.bibx9" id="text.27"/> uses a masked mean approach in the current operational model.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e435">Illustration of the masked mean strategy. Each forcing provider is projected to the same size through its own embedding network. The resulting embeddings of valid providers are averaged and passed on to the LSTM.</p></caption>
            <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f03.png"/>

          </fig>

</sec>
<sec id="Ch1.S2.SS2.SSS3">
  <label>2.2.3</label><title>Attention</title>
      <p id="d2e452">Readers who are familiar with deep learning might recognize the masked mean architecture as the simplification of a more general attention mechanism <xref ref-type="bibr" rid="bib1.bibx5" id="paren.28"/>. Attention mechanisms have become ubiquitous in deep learning, as they are the core component of the popular Transformer architecture <xref ref-type="bibr" rid="bib1.bibx45" id="paren.29"/>. The most common realizations of attention allow the model to dynamically adjust its focus on different input time steps. Appendix <xref ref-type="sec" rid="App1.Ch1.S4"/> provides a brief introduction to the concept of attention for readers who are not familiar with the topic.</p>

      <fig id="F4" specific-use="star"><label>Figure 4</label><caption><p id="d2e465">Illustration of the attention embedding strategy. Each forcing provider is projected to the same size through its own embedding network. The resulting embedding vectors become the keys (<inline-formula><mml:math id="M1" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>) and values (<inline-formula><mml:math id="M2" display="inline"><mml:mi>v</mml:mi></mml:math></inline-formula>). The static attributes, together with a binary flag for each provider, serve as the query. The attention-weighted average of embeddings is passed on to the LSTM.</p></caption>
            <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f04.png"/>

          </fig>

      <p id="d2e488">In our case, we apply the attention mechanism over the different available providers at each time step. Figure <xref ref-type="fig" rid="F4"/> illustrates the process. Similar to the masked mean approach, we embed each forcing with its own embedding network, resulting in vectors that we use as both the keys and values of the attention mechanism. Additionally, we concatenate the static attributes with a positional embedding of the input time step (not shown in Fig. <xref ref-type="fig" rid="F4"/> for brevity) and three binary flags that indicate the availability of each forcing product at the given time step. A separate embedding of the resulting concatenated vector acts as the query. Based on the similarity of the query and each of the key vectors, we obtain a weighting by which we average the values, i.e., the embedding vectors of the forcing products. This weighted average is the input to the LSTM. Hence, the attention mechanism, could – at least in theory – learn to dynamically adjust its focus on each forcing product based on the basin it is asked to predict.</p>
</sec>
</sec>
<sec id="Ch1.S2.SS3">
  <label>2.3</label><title>Experiments</title>
      <p id="d2e504">We conducted three experiments to test how well each architecture can cope with different scenarios where input data are missing in certain temporal periods or spatial regions. To save computational resources, we performed one hyperparameter tuning and used the resulting best hyperparameters for all further experiments. Appendix <xref ref-type="sec" rid="App1.Ch1.S1"/> covers our tuning procedure in more detail. In all experiments, we trained each model with three different random seeds.</p>
<sec id="Ch1.S2.SS3.SSS1">
  <label>2.3.1</label><title>Experiment 1: Forcings missing at individual time steps</title>
      <p id="d2e516">This experiment simulates short-term outages of certain input products. Because the LSTMs used in hydrologic applications typically ingest input data with one year (365 d) of lookback, even an outage for a single time step can cause problems for the next year to come: for the next 365 d, there will be a NaN input time step, which breaks models that cannot deal with missing input data. We trained and evaluated the different models with an increasing probability of randomly missing input time steps. The time step dropout is sampled independently at random, i.e., at each input time step, each forcing is missing with probability <inline-formula><mml:math id="M3" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula>. This means that all products can be missing at once for certain time steps. We sweep <inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> from <inline-formula><mml:math id="M5" display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula> to <inline-formula><mml:math id="M6" display="inline"><mml:mn mathvariant="normal">0.6</mml:mn></mml:math></inline-formula> in increments of <inline-formula><mml:math id="M7" display="inline"><mml:mn mathvariant="normal">0.1</mml:mn></mml:math></inline-formula>.</p>
      <p id="d2e562">As baselines, we used the three-forcing model from <xref ref-type="bibr" rid="bib1.bibx26" id="text.30"/>. This shows the upper bound of performance we can expect when no data are missing. We also included the worst of the three single-forcing models (based solely on NLDAS) from the same source as a point of reference.</p>
</sec>
<sec id="Ch1.S2.SS3.SSS2">
  <label>2.3.2</label><title>Experiment 2: Forcings missing for the entire time sequence</title>
      <p id="d2e576">This experiment simulates extended time periods with missing input data. In practical applications, this may happen when an input product has limited temporal coverage, either because it became available later than other products, or because it went out of service or had an extended outage while the model was still in use. We evaluated this scenario by running inference with samples where all time steps of one or two providers were set to NaN, and we report the results for each combination of one or two missing providers.</p>
      <p id="d2e579">To make sure the models can cope with this scenario, we trained the models with samples that contained NaNs of two types: (1) dropout of individual time steps (as in the previous experiment) with <inline-formula><mml:math id="M8" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn></mml:mrow></mml:math></inline-formula>, and (2) dropout of entire input sequences with <inline-formula><mml:math id="M9" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>sequence</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn></mml:mrow></mml:math></inline-formula><fn id="Ch1.Footn2"><p id="d2e611">We also performed some preliminary experiments with <inline-formula><mml:math id="M10" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.0</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M11" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>sequence</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn></mml:mrow></mml:math></inline-formula> since this more closely matches the evaluation setup, but saw no meaningful differences in the results.</p></fn>. We made sure to never drop all three sequences entirely, but allowed the case where all three products are missing at individual time steps.</p>
      <p id="d2e645">The natural baselines in these experiments are the corresponding one- and two-forcing models from <xref ref-type="bibr" rid="bib1.bibx26" id="text.31"/>. These baselines are not robust to missing input data, and they simply ingest the concatenated forcing variables from one or two forcing groups.</p>
</sec>
<sec id="Ch1.S2.SS3.SSS3">
  <label>2.3.3</label><title>Experiment 3: Forcings missing for certain spatial regions</title>
      <p id="d2e659">Finally, we explored how the different approaches to missing input data fare in settings where an input product is missing for certain regions in space. This is relevant because for many regions there exist local meteorological data products that are of higher quality than globally available ones. At the same time, training on diverse sets of basins benefits performance <xref ref-type="bibr" rid="bib1.bibx29" id="paren.32"><named-content content-type="pre">see</named-content></xref>. Hence, being able to merge local high-quality forcing data with global streamflow could – at least in theory – combine the best of two worlds.</p>

      <fig id="F5"><label>Figure 5</label><caption><p id="d2e669">Map of the 531 CAMELS basins used in this study. For the 51 basins in the Ohio, Cumberland, and Tennessee River basins (purple), we assumed all three forcing to be available. For all other basins (blue), we assumed only Daymet and Maurer forcings to be available.</p></caption>
            <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f05.png"/>

          </fig>

      <p id="d2e678">We simulated this setting on the CAMELS dataset by training models that received Daymet and Maurer forcings everywhere, but NLDAS forcings only for the 51 basins in the Ohio, Cumberland, and Tennessee River basins <xref ref-type="bibr" rid="bib1.bibx46" id="paren.33"><named-content content-type="pre">USGS site numbers starting in 03, cf.</named-content><named-content content-type="post">depicted in Fig. <xref ref-type="fig" rid="F5"/></named-content></xref>. As baselines, we trained a model on all three forcings but only the 51 basins, and a model on all 531 basins but only the two forcings that we assumed as available anywhere (Daymet and Maurer).</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Results</title>
<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Experiment 1: Forcings missing at individual time steps</title>
      <p id="d2e707">In the first experiment, we trained models at different probabilities <inline-formula><mml:math id="M12" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> of input products being NaN at individual time steps. Figure <xref ref-type="fig" rid="F6"/> shows the resulting Nash–Sutcliffe efficiency <xref ref-type="bibr" rid="bib1.bibx32" id="paren.34"><named-content content-type="pre">NSE;</named-content></xref> and Kling–Gupta efficiency <xref ref-type="bibr" rid="bib1.bibx21" id="paren.35"><named-content content-type="pre">KGE;</named-content></xref> values at <inline-formula><mml:math id="M13" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mn mathvariant="normal">0.6</mml:mn></mml:mrow></mml:math></inline-formula>, and Appendix <xref ref-type="sec" rid="App1.Ch1.S3"/> contains plots with additional metrics. As expected, the accuracy of all methods drops with increasing amounts of NaNs. At 0 % NaNs, all methods perform roughly as good as the three-forcing baseline from <xref ref-type="bibr" rid="bib1.bibx26" id="text.36"/>, which cannot cope with missing input data. The models exhibit slightly worse NSE values than the baseline, while masked mean and input replacing are slightly better in KGE. These minor differences arise because our newly trained models were tuned for a setting with moderate amounts of missing input data and therefore use slightly different hyperparameters than the three-forcings baseline.</p>

      <fig id="F6" specific-use="star"><label>Figure 6</label><caption><p id="d2e768">Median NSE and KGE across 531 basins at different amounts of missing input time steps. The dotted horizontal line provides the baseline of a model that cannot deal with missing data but is trained to ingest all three forcing groups at every time step. The dashed line represents the baseline of a model that uses the worst individual set of forcings (NLDAS). Both baselines stem from <xref ref-type="bibr" rid="bib1.bibx26" id="text.37"/>. The shaded areas indicate the spread between minimum and maximum values across three seeds; the solid lines represent the median.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f06.png"/>

        </fig>

      <p id="d2e780">As <inline-formula><mml:math id="M14" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> increases, we see no clear winner in terms of NSE; all methods decay by roughly equal amounts in this metric. For KGE, the masked mean architecture tends to perform better than input replacing and attention: except for <inline-formula><mml:math id="M15" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.2</mml:mn></mml:mrow></mml:math></inline-formula>, the masked mean results are significantly better than those of input replacing (one-sided Wilcoxon signed-rank test at <inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>). The attention mechanism generally performs significantly worse than masked mean and input replacing, except at the highest missing data probabilities. To investigate why attention under-performs at low <inline-formula><mml:math id="M17" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula>, we plotted the attention weights placed by the model on each set of forcings (Fig. <xref ref-type="fig" rid="FC3"/> in Appendix C), and found that, apart from a select few basins, the weights fluctuate closely around <inline-formula><mml:math id="M18" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>/</mml:mo><mml:mn mathvariant="normal">3</mml:mn></mml:mrow></mml:math></inline-formula>. Hence, the model merely attempted to recover the solution that is hard-coded in the masked mean strategy.</p>
</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Experiment 2: Forcings missing for the entire time sequence</title>
      <p id="d2e855">In this experiment, we evaluated to what extent the different architectures can maintain their accuracy when one or two sets of forcings are missing entirely at inference time. Figure <xref ref-type="fig" rid="F7"/> shows the resulting empirical cumulative distribution functions (CDFs) of NSE values. <xref ref-type="bibr" rid="bib1.bibx26" id="text.38"/> already provide results which indicate that the availability of fewer forcing products implies worse model performance.</p>

      <fig id="F7" specific-use="star"><label>Figure 7</label><caption><p id="d2e865">Empirical cumulative distribution functions of NSE values across all 531 basins when two (first column) or one (second column) forcing groups are continuously missing. The subplot titles denote which products we passed to the model during inference. The dotted line represents the upper bound baseline, a model that is trained and evaluated with all three forcings; the dashed line represents the performance of a model trained specifically for the available combination of forcings. All results show the mean performance across three seeds; curves further to the right are better.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f07.png"/>

        </fig>

      <p id="d2e874">The results from experiment 2 corroborate this finding. In the experiments where one set of forcings is available at inference time (first column in Fig. <xref ref-type="fig" rid="F7"/>), the baseline trained on that one set of forcings (dashed line) performs significantly better than the missing-inputs architectures, but the effect sizes in the comparison to masked mean and attention are small (Cohen's <inline-formula><mml:math id="M19" display="inline"><mml:mrow><mml:mi>d</mml:mi><mml:mo>&lt;</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn></mml:mrow></mml:math></inline-formula>). The only exception to this is the NLDAS-only experiment, where the baseline does not perform significantly better than masked mean and attention. Input replacing tends to perform the worst across all evaluations.</p>
      <p id="d2e892">In the experiments where two sets of forcings are available at inference time (second column in Fig. <xref ref-type="fig" rid="F7"/>), we find similar results as in the experiments with one missing set of forcings. However, the margins in accuracy between the different methods are even smaller and likely not relevant for most practical applications.</p>
</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Experiment 3: Forcings missing for certain spatial regions</title>
      <p id="d2e905">The last experiment investigated how well the missing-input architectures can incorporate regional input data, i.e., forcings that are available only in a subset of the training basins. Figure <xref ref-type="fig" rid="F8"/> shows the resulting empirical CDF curves of NSE and KGE values, and Appendix <xref ref-type="sec" rid="App1.Ch1.S3"/> provides figures with additional metrics.</p>

      <fig id="F8" specific-use="star"><label>Figure 8</label><caption><p id="d2e914">Empirical CDFs of NSE values across the 51 basins of the Ohio, Cumberland, and Tennessee River basins. The dashed line represents the baseline model trained only on those basins but with all forcings. The dotted line is the baseline two-forcing model trained on all 531 basins. The other models are trained on all 531 basins with NLDAS set to NaN outside of the 51 basins. All results show the mean performance across three seeds.</p></caption>
          <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f08.png"/>

        </fig>

      <p id="d2e923">Masked mean, attention, and input replacing all improve the predictions when compared to the globally trained two-forcing model. The three-forcing regional model trained only on the 51 basins in the Ohio, Cumberland, and Tennessee River basins is significantly better than input replacing and attention, but not significantly better than masked mean (one-sided Wilcoxon signed-rank test, <inline-formula><mml:math id="M20" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>). This pattern is similar for the additional metrics from Appendix <xref ref-type="sec" rid="App1.Ch1.S3"/>. However, from a practical hydrological perspective, all approaches perform quite similar, despite the statistical significance.</p>
</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <label>4</label><title>Discussion and conclusions</title>
      <p id="d2e950">In this study, we presented three different strategies to build models that can provide streamflow predictions when parts of the meteorological input data are missing. <italic>Input replacing</italic> replaces NaNs with a fixed value, concatenates all forcings, and adds binary flags to indicate the missing data. <italic>Masked mean</italic> embeds each forcing product separately and averages the embeddings of available forcings. Finally, <italic>attention</italic> generalizes the masked mean approach and dynamically calculates a weighting of the different embeddings. Across all experiments (missing individual time steps, missing sequences, regional forcings), the masked mean strategy tends to perform best, although the differences are often small and depend on metrics. The fact that the models are unable to outperform the baseline trained on all three forcings but only 51 local basins (experiment 3) lets us conjecture that the high-quality CAMELS forcings may not be the ideal testbed for an evaluation of regional forcings. All three forcings are of similar quality and the basins in the chosen region are comparably similar and easy to predict, hence, a rather small set of training gauges appears to already yield satisfactory predictions and it becomes difficult to discern meaningful differences. We therefore hypothesize that evaluations on larger datasets and with forcings of more varied quality would yield clearer conclusions. Unfortunately, these larger datasets are still missing the type of widely accepted baseline models and state-of-the-art LSTM configurations that exist for CAMELS. Hence, for this study we chose to stick with the CAMELS dataset in order to maintain consistency with <xref ref-type="bibr" rid="bib1.bibx26" id="text.39"/> and to allow for easy reproduction of experiments with limited resources. We see great potential for future work that extends the experiment to such settings.</p>
      <p id="d2e965">Notably, the attention mechanism – despite being strictly more expressive than the masked mean strategy – does not improve upon these results and largely learns to recover the masked mean solution. We also experimented with analyzing the attention weights grouped by time steps with falling/rising streamflow or by the forcing whose precipitation deviated the furthest from the mean, but could not identify any patterns (results not shown). Therefore, in its current form, attention appears unnecessary. Nevertheless, we do encourage further work in this direction as our experiments do not fully exhaust the space of possible attention configurations, and we hypothesize that attention might play to its strengths especially in settings where the quality of inputs varies significantly across forcings, space, or time. Extending the scope beyond established baselines, future work could evaluate this, for example, with the new Caravan MultiMet dataset <xref ref-type="bibr" rid="bib1.bibx42" id="paren.40"/>. Caravan MultiMet provides forcings from seven different providers for all basins in the Caravan dataset and its extensions <xref ref-type="bibr" rid="bib1.bibx28" id="paren.41"/>. There are also many alternative approaches to calculating query, keys, and values: e.g., incorporating the forcing information also into the query vector or incorporating static information into the keys and values.</p>
      <p id="d2e974">Lastly, we would like to look at the presented strategies from a different perspective: we can view them as means to <italic>inject</italic> additional data into a model. Such injections can happen already during training (the multiple forcings we use in our experiments are an example for this), but they could also happen after training: for example, hydromet agencies could download a publicly available global model and inject locally available forcings or even lagged observations into the model. We encourage exploring such approaches further, as they could alleviate current trade-offs between training set size and input data resolution.</p>
</sec>

      
      </body>
    <back><app-group>

<app id="App1.Ch1.S1">
  <label>Appendix A</label><title>Hyperparameter tuning</title>
      <p id="d2e992">All hyperparameter tuning experiments used <inline-formula><mml:math id="M21" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mtext>time</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mtext>sequence</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn></mml:mrow></mml:math></inline-formula>. We chose these values as an intermediate level of missing data to avoid the computational expense of tuning each architecture for each experiment setup separately. As we built upon the established baselines from <xref ref-type="bibr" rid="bib1.bibx26" id="text.42"/>, we did not tune the LSTM architecture itself for the experiments in this paper. Hence, all LSTMs are trained with 365 daily input time steps, a hidden size of 256, batch size 256, dropout fraction of 0.4 on the output head, and an Adam optimizer with initial learning rate of <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, which we lowered to <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mn mathvariant="normal">5</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> in epoch 10 and to <inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> in epoch 25. We used the <inline-formula><mml:math id="M25" display="inline"><mml:mrow><mml:msup><mml:mtext>NSE</mml:mtext><mml:mo>*</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> loss function from <xref ref-type="bibr" rid="bib1.bibx25" id="text.43"/>. For a more in-depth description of these settings, we refer to <xref ref-type="bibr" rid="bib1.bibx25" id="text.44"/>.</p>
      <p id="d2e1092">We did, however, tune the hyperparameters of the missing-inputs mechanisms as well as the number of training epochs. For input replacing configurations, we chose slightly larger embedding sizes, such that the total parameter count in input replacing configurations is roughly equal to the parameter count in masked mean configurations. Attention configurations are marginally larger as they have an additional query embedding network, but we consider this difference irrelevant for the results in our comparisons – especially given that the optimal attention configuration was not the largest one in the hyperparameter grid.</p>
      <p id="d2e1095">We performed a grid search of the hyperparameter combinations listed in Table <xref ref-type="table" rid="TA1"/>. As for the main experiments, we trained each combination with three different random seeds. Finally, we chose the best configuration for each architecture as the one with the best median NSE value across all basins in the validation period, averaged across seeds. Table <xref ref-type="table" rid="TA2"/> lists the best configuration for each architecture.</p><table-wrap id="TA1"><label>Table A1</label><caption><p id="d2e1107">Hyperparameter tuning grid.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Hyperparameter</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">Values</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Embedding hidden layer sizes</oasis:entry>
         <oasis:entry colname="col2">Input replacing</oasis:entry>
         <oasis:entry colname="col3">[5], [7, 5], [17, 10], [17, 17, 10], [17, 17, 17, 10]</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">(ReLU-activated)</oasis:entry>
         <oasis:entry colname="col2">Masked mean</oasis:entry>
         <oasis:entry colname="col3">[5], [5, 5], [10, 10], [10, 10, 10], [10, 10, 10, 10]</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Attention<sup>*</sup></oasis:entry>
         <oasis:entry colname="col3">[10, 10], [10, 10, 10], [10, 10, 10, 10]</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Positional encoding size</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">0, 5</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Number of attention heads</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">1, 2, 5</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Evaluated epochs</oasis:entry>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3">5, 10, 15, 20, 25, 30, 35, 40</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table><table-wrap-foot><p id="d2e1110"><sup>*</sup> We excluded configurations with hidden size 5, because the final embedding size must be divisible by the number of attention heads.</p></table-wrap-foot></table-wrap>

<table-wrap id="TA2"><label>Table A2</label><caption><p id="d2e1230">Best hyperparameter configurations based on validation period results.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1">Architecture</oasis:entry>
         <oasis:entry colname="col2">Embedding hidden</oasis:entry>
         <oasis:entry colname="col3">Positional</oasis:entry>
         <oasis:entry colname="col4">Number of</oasis:entry>
         <oasis:entry colname="col5">Epoch</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">layer sizes</oasis:entry>
         <oasis:entry colname="col3">encoding size</oasis:entry>
         <oasis:entry colname="col4">attention heads</oasis:entry>
         <oasis:entry colname="col5"/>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Input replacing</oasis:entry>
         <oasis:entry colname="col2">[17, 10]</oasis:entry>
         <oasis:entry colname="col3">5</oasis:entry>
         <oasis:entry colname="col4">–</oasis:entry>
         <oasis:entry colname="col5">30</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Masked mean</oasis:entry>
         <oasis:entry colname="col2">[10, 10, 10, 10]</oasis:entry>
         <oasis:entry colname="col3">0</oasis:entry>
         <oasis:entry colname="col4">–</oasis:entry>
         <oasis:entry colname="col5">35</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Attention</oasis:entry>
         <oasis:entry colname="col2">[10, 10, 10]</oasis:entry>
         <oasis:entry colname="col3">5</oasis:entry>
         <oasis:entry colname="col4">1</oasis:entry>
         <oasis:entry colname="col5">30</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</app>

<app id="App1.Ch1.S2">
  <label>Appendix B</label><title>Computational resources</title>
      <p id="d2e1351">We conducted all experiments on Nvidia P100 GPU machines running Python 3.11 and NeuralHydrology 1.11.0 (with local modifications that are part of the 1.12.0 release). In total, including preliminary experiments, hyperparameter tuning, and final experiments, we trained approximately 800 models. This amounts to approximately 286 wall-time computation days (measuring the time from writing the configuration to disk to the last Tensorboard update). We did not spend any effort optimizing the runtime of these jobs; many runs could have been sped up significantly, e.g., through increased parallelism in data loading.</p>
</app>

<app id="App1.Ch1.S3">
  <label>Appendix C</label><title>Additional figures</title>
      <p id="d2e1364">In consideration of the fact that no single metric adequately captures the quality of a model <xref ref-type="bibr" rid="bib1.bibx19" id="paren.45"/>, we provide Fig. <xref ref-type="fig" rid="FC1"/> as an extended version of Fig. <xref ref-type="fig" rid="F6"/> (showing the performance with increasing number of NaN inputs for a variety of additional metrics). Further, Fig. <xref ref-type="fig" rid="FC2"/> extends Fig. <xref ref-type="fig" rid="F8"/> and shows empirical CDFs of the experiment with regional forcings for additional metrics. We refer to <xref ref-type="bibr" rid="bib1.bibx19" id="text.46"/> for the definitions of these measures.</p>
      <p id="d2e1382">Lastly, Fig. <xref ref-type="fig" rid="FC3"/> shows the fractional attention to each forcing product for three models trained with different random seeds (see experiment 1).</p><fig id="FC1"><label>Figure C1</label><caption><p id="d2e1389">Extended version of Fig. <xref ref-type="fig" rid="F6"/>, showing additional metrics (see <xref ref-type="bibr" rid="bib1.bibx19" id="altparen.47"/> for the definitions of these metrics).</p></caption>
        
        <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f09.png"/>

      </fig>

<fig id="FC2"><label>Figure C2</label><caption><p id="d2e1409">Extended version of Fig. <xref ref-type="fig" rid="F8"/>, showing additional metrics (see <xref ref-type="bibr" rid="bib1.bibx19" id="altparen.48"/> for the definitions of these metrics).</p></caption>
        
        <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f10.png"/>

      </fig>

<fig id="FC3"><label>Figure C3</label><caption><p id="d2e1428">Fraction of attention each product received at each basin, averaged over time. The pie slices are scaled by their fraction to (overly) emphasize differences. Each subplot shows the results for a model trained with a different seed. For better overview, we only plot a random sample of 100 gauges.</p></caption>
        <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f11.png"/>

      </fig>

</app>

<app id="App1.Ch1.S4">
  <label>Appendix D</label><title>A very brief introduction to attention</title>
      <p id="d2e1445">This section gives a brief high-level introduction to attention, since, as of now, attention is not a widely used concept in hydrologic deep learning applications. As the name suggests, the main idea of “attention” is to provide neural networks with a way to focus on specific parts of their inputs, depending on the current context. Early attention mechanisms come from language applications <xref ref-type="bibr" rid="bib1.bibx20 bib1.bibx5" id="paren.49"/>, where models would focus on relevant words in the source language to produce the corresponding translated words in the target language. With the introduction of the Transformer architecture, attention became one of the most widely used concepts in deep learning <xref ref-type="bibr" rid="bib1.bibx45" id="paren.50"/>. By now, attention and similar approaches have made their way into applications in various fields, including hydrology <xref ref-type="bibr" rid="bib1.bibx4 bib1.bibx38" id="paren.51"><named-content content-type="pre">e.g.,</named-content></xref>.</p>

      <fig id="FD1"><label>Figure D1</label><caption><p id="d2e1461">High-level illustration of attention. The query vector (left) is compared to each key vector (middle), and the corresponding value vectors are merged in a weighted average according to the similarity measure, producing the attention output (right).</p></caption>
        <graphic xlink:href="https://hess.copernicus.org/articles/29/6221/2025/hess-29-6221-2025-f12.png"/>

      </fig>

      <p id="d2e1470">One way to think about attention – and the origin of today's query/key/value nomenclature – is as a learned similarity-based soft database retrieval (Fig. <xref ref-type="fig" rid="FD1"/>). Let us deconstruct this: by “database”, we refer to pairs of so-called <italic>values</italic> and <italic>keys</italic>. That is, each value is an entry in the database that we can retrieve with its associated key. Given a <italic>query</italic>, we calculate a similarity score between the query and each key (this constitutes the “similarity-based” component). All three elements (query/keys/values) are network embeddings, i.e., vectors. For example, one could embed a timeseries of runoff observations as keys, create a one-to-one mapping to the values and then use a given event as the query to search for similar occurrences. The output of the attention operation is a weighted mean of all values, where the weight is higher for values whose keys are more similar to the query (hence “soft” lookup; we do not return a specific value from the database but a weighted average across all values). For example, if we use attention for a translation task, the query would be a learned embedding of the word currently being processed, and keys and values would be embeddings of all source language words<fn id="App1.Ch1.Footn1"><p id="d2e1484">We ignore some specifics to language modeling here (e.g., positional encoding or tokenization), because they are not immediately relevant to the attention mechanism at the high level of our explanation.</p></fn>. By adjusting the embedding networks, the model can now learn to achieve higher similarity between query and words that are relevant for translating the current word and lower similarity between query and irrelevant words. Finally, we can apply masking (setting the similarity to zero) to disallow attention to certain words.</p>
      <p id="d2e1489">While the most common application of attention is retrieval along a temporal axis (such as the progression of a sentence), the concept generalizes to retrieval of values from arbitrary sets <xref ref-type="bibr" rid="bib1.bibx11 bib1.bibx37" id="paren.52"/>. In this paper, we consider the embeddings of meteorological forcings as our key–value database (the embeddings act both as keys and as values), and the static attributes of a basin as our query. Hence, the model can learn to retrieve different forcing combinations in different places.</p>
      <p id="d2e1495">We conclude this short introduction with the caveat that deep learning is an active field, and at this point there are thousands of publications leveraging, improving, or analyzing attention mechanisms. Therefore, this introduction is by far not exhaustive, nor does it cover any of the formal and mathematical aspects. For a deeper introduction, including the actual equations, we refer to <xref ref-type="bibr" rid="bib1.bibx3" id="text.53"/>, <xref ref-type="bibr" rid="bib1.bibx40" id="text.54"/>, and <xref ref-type="bibr" rid="bib1.bibx6" id="text.55"/>.</p>
</app>
  </app-group><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d2e1511">We conducted all experiments with the NeuralHydrology library <xref ref-type="bibr" rid="bib1.bibx27" id="paren.56"/>. The CAMELS dataset necessary to run the experiments is available at <uri>https://ral.ucar.edu/solutions/products/camels</uri> (last access: 10 November 2025; <xref ref-type="bibr" rid="bib1.bibx34 bib1.bibx1" id="altparen.57"/>). The extended Maurer and NLDAS forcings (which include daily minimum and maximum temperature) are available at <ext-link xlink:href="https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077" ext-link-type="DOI">10.4211/hs.17c896843cf940339c3c3496d0c1c077</ext-link> <xref ref-type="bibr" rid="bib1.bibx23" id="paren.58"/> and <uri>https://doi.org/10.4211/hs.0a68bfd7ddf642a8be9041d60f40868c</uri> <xref ref-type="bibr" rid="bib1.bibx24" id="paren.59"/>. The additional code for analyses and figures presented in this paper are available at <uri>https://github.com/gauchm/missing-inputs</uri> <xref ref-type="bibr" rid="bib1.bibx18" id="paren.60"><named-content content-type="pre"><ext-link xlink:href="https://doi.org/10.5281/zenodo.17362593" ext-link-type="DOI">10.5281/zenodo.17362593</ext-link>,</named-content></xref>. Finally, all trained models and results files are available at <ext-link xlink:href="https://doi.org/10.5281/zenodo.15008460" ext-link-type="DOI">10.5281/zenodo.15008460</ext-link> <xref ref-type="bibr" rid="bib1.bibx17" id="paren.61"/>.</p>
  </notes><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e1555">MG, FK, and DK developed the idea, conceptualization, and methods of the paper. MG wrote the code and ran the experiments. All authors were involved in the writing of the paper.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e1561">At least one of the (co-)authors is a member of the editorial board of <italic>Hydrology and Earth System Sciences</italic>. The peer-review process was guided by an independent editor, and the authors also have no other competing interests to declare.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e1570">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e1576">This paper was edited by Albrecht Weerts and reviewed by Juliane Mai and one anonymous referee.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Addor et al.(2017)Addor, Newman, Mizukami, and Clark</label><mixed-citation>Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, <ext-link xlink:href="https://doi.org/10.5194/hess-21-5293-2017" ext-link-type="DOI">10.5194/hess-21-5293-2017</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Afifi and Elashoff(1966)</label><mixed-citation> Afifi, A. A. and Elashoff, R. M.: Missing Observations in Multivariate Statistics I. Review of the Literature, Journal of the American Statistical Association, 61, 595–604, 1966.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Alammar(2018)</label><mixed-citation>Alammar, J.: Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention), Google Research Blog, <ext-link xlink:href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/</ext-link> (last access: 10 November 2025), 2018.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Auer et al.(2024)Auer, Gauch, Kratzert, Nearing, Hochreiter, and Klotz</label><mixed-citation>Auer, A., Gauch, M., Kratzert, F., Nearing, G., Hochreiter, S., and Klotz, D.: A data-centric perspective on the information needed for hydrological uncertainty predictions, Hydrol. Earth Syst. Sci., 28, 4099–4126, <ext-link xlink:href="https://doi.org/10.5194/hess-28-4099-2024" ext-link-type="DOI">10.5194/hess-28-4099-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Bahdanau et al.(2015)Bahdanau, Cho, and Bengio</label><mixed-citation>Bahdanau, D., Cho, K., and Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate, in: 3rd International Conference on Learning Representations (ICLR), arXiv, <ext-link xlink:href="https://doi.org/10.48550/arXiv.1409.0473" ext-link-type="DOI">10.48550/arXiv.1409.0473</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Bishop and Bishop(2023)</label><mixed-citation>Bishop, C. M. and Bishop, H.: Deep learning: Foundations and concepts, Springer Nature, <ext-link xlink:href="https://doi.org/10.1007/978-3-031-45468-4" ext-link-type="DOI">10.1007/978-3-031-45468-4</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, and Amodei</label><mixed-citation>Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.: Language Models are Few-Shot Learners, in: Advances in Neural Information Processing Systems, vol. 33, 1877–1901, Curran Associates, <uri>https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html</uri> (last access: 10 November 2025), 2020.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Clerc-Schwarzenbach et al.(2024)Clerc-Schwarzenbach, Selleri, Neri, Toth, van Meerveld, and Seibert</label><mixed-citation>Clerc-Schwarzenbach, F., Selleri, G., Neri, M., Toth, E., van Meerveld, I., and Seibert, J.: Large-sample hydrology – a few camels or a whole caravan?, Hydrol. Earth Syst. Sci., 28, 4219–4237, <ext-link xlink:href="https://doi.org/10.5194/hess-28-4219-2024" ext-link-type="DOI">10.5194/hess-28-4219-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Cohen(2024)</label><mixed-citation>Cohen, D.: An improved flood forecasting AI model, trained and evaluated globally, Google Research Blog, <uri>https://research.google/blog/a-flood-forecasting-ai-model-trained-and-evaluated-globally/</uri> (last access: 10 November 2025), 2024.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Devlin et al.(2019)Devlin, Chang, Lee, and Toutanova</label><mixed-citation>Devlin, J., Chang, M., Lee, K., and Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, Association for Computational Linguistics, <ext-link xlink:href="https://doi.org/10.18653/v1/N19-1423" ext-link-type="DOI">10.18653/v1/N19-1423</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Dosovitskiy et al.(2021)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, and Houlsby</label><mixed-citation>Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, in: 9th International Conference on Learning Representations (ICLR), openreview.net, <uri>https://openreview.net/forum?id=YicbFdNTTy</uri> (last access: 10 November 2025), 2021.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Dumoulin et al.(2018)Dumoulin, Perez, Schucher, Strub, Vries, Courville, and Bengio</label><mixed-citation>Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. d., Courville, A., and Bengio, Y.: Feature-wise transformations, Distill, <ext-link xlink:href="https://doi.org/10.23915/distill.00011" ext-link-type="DOI">10.23915/distill.00011</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Ebrahimi et al.(2023)Ebrahimi, Arik, Dong, and Pfister</label><mixed-citation>Ebrahimi, S., Arik, S. O., Dong, Y., and Pfister, T.: LANISTR: Multimodal learning from structured and unstructured data, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2305.16556" ext-link-type="DOI">10.48550/arXiv.2305.16556</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Frame et al.(2025)Frame, Araki, Bhuiyan, Bindas, Rapp, Bolotin, Deardorff, Liu, Haces-Garcia, Liao, Frazier, and Ogden</label><mixed-citation>Frame, J. M., Araki, R., Bhuiyan, S. A., Bindas, T., Rapp, J., Bolotin, L., Deardorff, E., Liu, Q., Haces-Garcia, F., Liao, M., Frazier, N., and Ogden, F. L.: Machine Learning for a Heterogeneous Water Modeling Framework, JAWRA Journal of the American Water Resources Association, 61, e70000, <ext-link xlink:href="https://doi.org/10.1111/1752-1688.70000" ext-link-type="DOI">10.1111/1752-1688.70000</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Franken et al.(2022)Franken, Gullentops, Wolfs, Defloor, Cabus, and De Jongh</label><mixed-citation>Franken, T., Gullentops, C., Wolfs, V., Defloor, W., Cabus, P., and De Jongh, I.: An operational framework for data driven low flow forecasts in Flanders, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6191, <ext-link xlink:href="https://doi.org/10.5194/egusphere-egu22-6191" ext-link-type="DOI">10.5194/egusphere-egu22-6191</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Gao et al.(2018)Gao, Merz, Lischeid, and Schneider</label><mixed-citation>Gao, Y., Merz, C., Lischeid, G., and Schneider, M.: A review on missing hydrological data processing, Environmental earth sciences, 77, 47, <ext-link xlink:href="https://doi.org/10.1007/s12665-018-7228-6" ext-link-type="DOI">10.1007/s12665-018-7228-6</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Gauch(2025a)</label><mixed-citation>Gauch, M.: Models and predictions for “How to deal w___ missing input data”, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.16983185" ext-link-type="DOI">10.5281/zenodo.16983185</ext-link>, 2025a.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Gauch(2025b)</label><mixed-citation>Gauch, M.: How to deal w___ missing inputs: Code (v1.0), Zenodo [code, ICE], <ext-link xlink:href="https://doi.org/10.5281/zenodo.17362593" ext-link-type="DOI">10.5281/zenodo.17362593</ext-link>, 2025b.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Gauch et al.(2023)Gauch, Kratzert, Gilon, Gupta, Mai, Nearing, Tolson, Hochreiter, and Klotz</label><mixed-citation>Gauch, M., Kratzert, F., Gilon, O., Gupta, H., Mai, J., Nearing, G., Tolson, B., Hochreiter, S., and Klotz, D.: In Defense of Metrics: Metrics Sufficiently Encode Typical Human Preferences Regarding Hydrological Model Performance, Water Resources Research, 59, e2022WR033918, <ext-link xlink:href="https://doi.org/10.1029/2022WR033918" ext-link-type="DOI">10.1029/2022WR033918</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Graves(2013)</label><mixed-citation>Graves, A.: Generating Sequences With Recurrent Neural Networks, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.1308.0850" ext-link-type="DOI">10.48550/arXiv.1308.0850</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Gupta et al.(2009)Gupta, Kling, Yilmaz, and Martinez</label><mixed-citation> Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling, Journal of Hydrology, 377, 80–91, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>He et al.(2022)He, Chen, Xie, Li, Dollár, and Girshick</label><mixed-citation>He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R.: Masked Autoencoders Are Scalable Vision Learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16000–16009, <ext-link xlink:href="https://doi.org/10.1109/CVPR52688.2022.01553" ext-link-type="DOI">10.1109/CVPR52688.2022.01553</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Kratzert(2019a)</label><mixed-citation>Kratzert, F.: CAMELS Extended Maurer Forcing Data, HydroShare [data set], <ext-link xlink:href="https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077" ext-link-type="DOI">10.4211/hs.17c896843cf940339c3c3496d0c1c077</ext-link>, 2019a.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Kratzert(2019b)</label><mixed-citation>Kratzert, F.: CAMELS Extended NLDAS Forcing Data, HydroShare [data set], <ext-link xlink:href="https://doi.org/10.4211/hs.0a68bfd7ddf642a8be9041d60f40868c" ext-link-type="DOI">10.4211/hs.0a68bfd7ddf642a8be9041d60f40868c</ext-link>, 2019b.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Kratzert et al.(2019)Kratzert, Klotz, Shalev, Klambauer, Hochreiter, and Nearing</label><mixed-citation>Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, <ext-link xlink:href="https://doi.org/10.5194/hess-23-5089-2019" ext-link-type="DOI">10.5194/hess-23-5089-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Kratzert et al.(2021)Kratzert, Klotz, Hochreiter, and Nearing</label><mixed-citation>Kratzert, F., Klotz, D., Hochreiter, S., and Nearing, G. S.: A note on leveraging synergy in multiple meteorological data sets with deep learning for rainfall–runoff modeling, Hydrol. Earth Syst. Sci., 25, 2685–2703, <ext-link xlink:href="https://doi.org/10.5194/hess-25-2685-2021" ext-link-type="DOI">10.5194/hess-25-2685-2021</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Kratzert et al.(2022)Kratzert, Gauch, Nearing, and Klotz</label><mixed-citation>Kratzert, F., Gauch, M., Nearing, G., and Klotz, D.: NeuralHydrology – A Python library for Deep Learning research in hydrology, Journal of Open Source Software, 7, 4050, <ext-link xlink:href="https://doi.org/10.21105/joss.04050" ext-link-type="DOI">10.21105/joss.04050</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Kratzert et al.(2023)Kratzert, Nearing, Addor, Erickson, Gauch, Gilon, Gudmundsson, Hassidim, Klotz, Nevo et al.</label><mixed-citation>Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., Gudmundsson, L., Hassidim, A., Klotz, D., Nevo, S., Shalev, G., and Matias, Y.: Caravan – A   global community dataset for large-sample hydrology, Scientific Data, 10, 61, <ext-link xlink:href="https://doi.org/10.1038/s41597-023-01975-w" ext-link-type="DOI">10.1038/s41597-023-01975-w</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Kratzert et al.(2024)Kratzert, Gauch, Klotz, and Nearing</label><mixed-citation>Kratzert, F., Gauch, M., Klotz, D., and Nearing, G.: HESS Opinions: Never train a Long Short-Term Memory (LSTM) network on a single basin, Hydrol. Earth Syst. Sci., 28, 4187–4201, <ext-link xlink:href="https://doi.org/10.5194/hess-28-4187-2024" ext-link-type="DOI">10.5194/hess-28-4187-2024</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Maurer et al.(2002)Maurer, Wood, Adam, Lettenmaier, and Nijssen</label><mixed-citation> Maurer, E. P., Wood, A. W., Adam, J. C., Lettenmaier, D. P., and Nijssen, B.: A Long-Term Hydrologically Based Dataset of Land Surface Fluxes and States for the Conterminous United States, Journal of Climate, 15, 3237–3251, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Mitchell and Jolley(1988)</label><mixed-citation>Mitchell, M. and Jolley, J.: Research design explained, Holt, Rinehart &amp; Winston Inc., ISBN: 0030040248, <uri>https://psycnet.apa.org/record/1987-98845-000</uri> (last access: 10 November 2025), 1988.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>Nash and Sutcliffe(1970)</label><mixed-citation> Nash, J. E. and Sutcliffe, J. V.: River flow forecasting through conceptual models part I – A discussion of principles, Journal of Hydrology, 10, 282–290, 1970.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>Nearing et al.(2024)Nearing, Cohen, Dube, Gauch, Gilon, Harrigan, Hassidim, Klotz, Kratzert, Metzger, Nevo, Pappenberger, Prudhomme, Shalev, Shenzis, Tekalign, Weitzner, and Matias</label><mixed-citation>Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and Matias, Y.: Global prediction of extreme floods in ungauged watersheds, Nature,  volume 627, 559–563 pp., <ext-link xlink:href="https://doi.org/10.1038/s41586-024-07145-1" ext-link-type="DOI">10.1038/s41586-024-07145-1</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Newman et al.(2015)Newman, Clark, Sampson, Wood, Hay, Bock, Viger, Blodgett, Brekke, Arnold et al.</label><mixed-citation>Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. R., Hopson, T., and Duan, Q.: Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance, Hydrol. Earth Syst. Sci., 19, 209–223, <ext-link xlink:href="https://doi.org/10.5194/hess-19-209-2015" ext-link-type="DOI">10.5194/hess-19-209-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Perez et al.(2018)Perez, Strub, de Vries, Dumoulin, and Courville</label><mixed-citation>Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A.: FiLM: Visual Reasoning with a General Conditioning Layer, Proceedings of the AAAI Conference on Artificial Intelligence, 32, 3942–3951 pp., <ext-link xlink:href="https://doi.org/10.1609/aaai.v32i1.11671" ext-link-type="DOI">10.1609/aaai.v32i1.11671</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>Raffel et al.(2020)Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, and Liu</label><mixed-citation> Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal of Machine Learning Research, 21, 1–67, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Ramsauer et al.(2021)Ramsauer, Schäfl, Lehner, Seidl, Widrich, Gruber, Holzleitner, Adler, Kreil, Kopp, Klambauer, Brandstetter, and Hochreiter</label><mixed-citation>Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Adler, T., Kreil, D., Kopp, M. K., Klambauer, G., Brandstetter, J., and Hochreiter, S.: Hopfield Networks is All You Need, in: 9th International Conference on Learning Representations (ICLR), openreview.net, <uri>https://openreview.net/forum?id=tL89RnzIiCd</uri> (last access: 10 November 2025), 2021.</mixed-citation></ref>
      <ref id="bib1.bibx38"><label>Rasiya Koya and Roy(2024)</label><mixed-citation>Rasiya Koya, S. and Roy, T.: Temporal Fusion Transformers for streamflow Prediction: Value of combining attention with recurrence, Journal of Hydrology, 637, 131301, <ext-link xlink:href="https://doi.org/10.1016/j.jhydrol.2024.131301" ext-link-type="DOI">10.1016/j.jhydrol.2024.131301</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Read et al.(2021)Read, Sampson, Lambl, Butcher, Gulland, and Elkurdy</label><mixed-citation> Read, L., Sampson, A. K., Lambl, D., Butcher, P., Gulland, L., and Elkurdy, M.: Lessons learned applying a machine learning hydrologic forecast model in a live forecasting competition, in: AGU Fall Meeting Abstracts, Vol. 2021, H22A–07, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Rohrer(2021)</label><mixed-citation>Rohrer, B.: Transformers from Scratch, <uri>https://e2eml.school/transformers.html</uri> (last access: 10 November 2025), 2021.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Schafer(1997)</label><mixed-citation>Schafer, J. L.: Analysis of incomplete multivariate data, CRC press, <uri>http://dx.doi.org/10.1201/9781439821862</uri>, 1997.</mixed-citation></ref>
      <ref id="bib1.bibx42"><label>Shalev and Kratzert(2024)</label><mixed-citation>Shalev, G. and Kratzert, F.: Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and Forecasts, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2411.09459" ext-link-type="DOI">10.48550/arXiv.2411.09459</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx43"><label>Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov</label><mixed-citation> Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., 15, 1929–1958, 2014.</mixed-citation></ref>
      <ref id="bib1.bibx44"><label>Thornton et al.(1997)Thornton, Running, and White</label><mixed-citation> Thornton, P. E., Running, S. W., and White, M. A.: Generating surfaces of daily meteorological variables over large regions of complex terrain, Journal of Hydrology, 190, 214–251, 1997.</mixed-citation></ref>
      <ref id="bib1.bibx45"><label>Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin</label><mixed-citation>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.: Attention is All you Need, in: Advances in Neural Information Processing Systems, vol. 30, Curran Associates, <uri>https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html</uri> (last access: 10 November 2025), 2017.</mixed-citation></ref>
      <ref id="bib1.bibx46"><label>Wells(1960)</label><mixed-citation>Wells, J.: Compilation of records of surface waters of the United States through September 1950: Part 1-B. North Atlantic slope basins, New York to York River, Tech. rep., US Geological Survey, <ext-link xlink:href="https://doi.org/10.3133/wsp1302" ext-link-type="DOI">10.3133/wsp1302</ext-link>, 1960. </mixed-citation></ref>
      <ref id="bib1.bibx47"><label>Wu et al.(2020)Wu, Zhang, Ilyas, and Rekatsinas</label><mixed-citation> Wu, R., Zhang, A., Ilyas, I., and Rekatsinas, T.: Attention-based Learning for Missing Data Imputation in HoloClean, Proceedings of Machine Learning and Systems, 2, 307–325, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx48"><label>Xia et al.(2012)Xia, Mitchell, Ek, Sheffield, Cosgrove, Wood, Luo, Alonge, Wei, Meng, Livneh, Lettenmaier, Koren, Duan, Mo, Fan, and Mocko</label><mixed-citation>Xia, Y., Mitchell, K., Ek, M., Sheffield, J., Cosgrove, B., Wood, E., Luo, L., Alonge, C., Wei, H., Meng, J., Livneh, B., Lettenmaier, D., Koren, V., Duan, Q., Mo, K., Fan, Y., and Mocko, D.: Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 (NLDAS-2): 1. Intercomparison and application of model products, Journal of Geophysical Research: Atmospheres, 117, <ext-link xlink:href="https://doi.org/10.1029/2011JD016048" ext-link-type="DOI">10.1029/2011JD016048</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx49"><label>Yozgatligil et al.(2013)Yozgatligil, Aslan, Iyigun, and Batmaz</label><mixed-citation> Yozgatligil, C., Aslan, S., Iyigun, C., and Batmaz, I.: Comparison of missing value imputation methods in time series: the case of Turkish meteorological data, Theoretical and applied climatology, 112, 143–167, 2013.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>How to deal w___ missing input data</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Addor et al.(2017)Addor, Newman, Mizukami, and
Clark</label><mixed-citation>
      
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, <a href="https://doi.org/10.5194/hess-21-5293-2017" target="_blank">https://doi.org/10.5194/hess-21-5293-2017</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Afifi and Elashoff(1966)</label><mixed-citation>
      
Afifi, A. A. and Elashoff, R. M.: Missing Observations in Multivariate
Statistics I. Review of the Literature, Journal of the American Statistical
Association, 61, 595–604, 1966.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Alammar(2018)</label><mixed-citation>
      
Alammar, J.: Visualizing A Neural Machine Translation Model (Mechanics of
Seq2seq Models With Attention), Google Research Blog,
<a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/" target="_blank">https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/</a> (last access: 10 November 2025),
2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Auer et al.(2024)Auer, Gauch, Kratzert, Nearing, Hochreiter, and
Klotz</label><mixed-citation>
      
Auer, A., Gauch, M., Kratzert, F., Nearing, G., Hochreiter, S., and Klotz, D.: A data-centric perspective on the information needed for hydrological uncertainty predictions, Hydrol. Earth Syst. Sci., 28, 4099–4126, <a href="https://doi.org/10.5194/hess-28-4099-2024" target="_blank">https://doi.org/10.5194/hess-28-4099-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Bahdanau et al.(2015)Bahdanau, Cho, and Bengio</label><mixed-citation>
      
Bahdanau, D., Cho, K., and Bengio, Y.: Neural Machine Translation by Jointly
Learning to Align and Translate, in: 3rd International Conference on Learning
Representations (ICLR), arXiv, <a href="https://doi.org/10.48550/arXiv.1409.0473" target="_blank">https://doi.org/10.48550/arXiv.1409.0473</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Bishop and Bishop(2023)</label><mixed-citation>
      
Bishop, C. M. and Bishop, H.: Deep learning: Foundations and concepts, Springer
Nature, <a href="https://doi.org/10.1007/978-3-031-45468-4" target="_blank">https://doi.org/10.1007/978-3-031-45468-4</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal,
Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan,
Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess,
Clark, Berner, McCandlish, Radford, Sutskever, and Amodei</label><mixed-citation>
      
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S.,
Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,
S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever,
I., and Amodei, D.: Language Models are Few-Shot Learners, in: Advances in
Neural Information Processing Systems, vol. 33, 1877–1901, Curran
Associates, <a href="https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html" target="_blank"/> (last access: 10 November 2025), 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Clerc-Schwarzenbach et al.(2024)Clerc-Schwarzenbach, Selleri, Neri,
Toth, van Meerveld, and Seibert</label><mixed-citation>
      
Clerc-Schwarzenbach, F., Selleri, G., Neri, M., Toth, E., van Meerveld, I., and Seibert, J.: Large-sample hydrology – a few camels or a whole caravan?, Hydrol. Earth Syst. Sci., 28, 4219–4237, <a href="https://doi.org/10.5194/hess-28-4219-2024" target="_blank">https://doi.org/10.5194/hess-28-4219-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Cohen(2024)</label><mixed-citation>
      
Cohen, D.: An improved flood forecasting AI model, trained and evaluated
globally, Google Research Blog,
<a href="https://research.google/blog/a-flood-forecasting-ai-model-trained-and-evaluated-globally/" target="_blank"/> (last access: 10 November 2025),
2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Devlin et al.(2019)Devlin, Chang, Lee, and
Toutanova</label><mixed-citation>
      
Devlin, J., Chang, M., Lee, K., and Toutanova, K.: BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, in: Proceedings of the
2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 4171–4186, Association for Computational Linguistics, <a href="https://doi.org/10.18653/v1/N19-1423" target="_blank">https://doi.org/10.18653/v1/N19-1423</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Dosovitskiy et al.(2021)Dosovitskiy, Beyer, Kolesnikov, Weissenborn,
Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, and
Houlsby</label><mixed-citation>
      
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,
Uszkoreit, J., and Houlsby, N.: An Image is Worth 16x16 Words: Transformers
for Image Recognition at Scale, in: 9th International Conference on Learning
Representations (ICLR), openreview.net, <a href="https://openreview.net/forum?id=YicbFdNTTy" target="_blank"/> (last access: 10 November 2025), 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Dumoulin et al.(2018)Dumoulin, Perez, Schucher, Strub, Vries,
Courville, and Bengio</label><mixed-citation>
      
Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. d., Courville, A.,
and Bengio, Y.: Feature-wise transformations, Distill, <a href="https://doi.org/10.23915/distill.00011" target="_blank">https://doi.org/10.23915/distill.00011</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Ebrahimi et al.(2023)Ebrahimi, Arik, Dong, and
Pfister</label><mixed-citation>
      
Ebrahimi, S., Arik, S. O., Dong, Y., and Pfister, T.: LANISTR: Multimodal
learning from structured and unstructured data, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2305.16556" target="_blank">https://doi.org/10.48550/arXiv.2305.16556</a>,
2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Frame et al.(2025)Frame, Araki, Bhuiyan, Bindas, Rapp, Bolotin,
Deardorff, Liu, Haces-Garcia, Liao, Frazier, and Ogden</label><mixed-citation>
      
Frame, J. M., Araki, R., Bhuiyan, S. A., Bindas, T., Rapp, J., Bolotin, L.,
Deardorff, E., Liu, Q., Haces-Garcia, F., Liao, M., Frazier, N., and Ogden,
F. L.: Machine Learning for a Heterogeneous Water Modeling Framework, JAWRA
Journal of the American Water Resources Association, 61, e70000, <a href="https://doi.org/10.1111/1752-1688.70000" target="_blank">https://doi.org/10.1111/1752-1688.70000</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Franken et al.(2022)Franken, Gullentops, Wolfs, Defloor, Cabus, and
De Jongh</label><mixed-citation>
      
Franken, T., Gullentops, C., Wolfs, V., Defloor, W., Cabus, P., and De Jongh, I.: An operational framework for data driven low flow forecasts in Flanders, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6191, <a href="https://doi.org/10.5194/egusphere-egu22-6191" target="_blank">https://doi.org/10.5194/egusphere-egu22-6191</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Gao et al.(2018)Gao, Merz, Lischeid, and Schneider</label><mixed-citation>
      
Gao, Y., Merz, C., Lischeid, G., and Schneider, M.: A review on missing
hydrological data processing, Environmental earth sciences, 77, 47, <a href="https://doi.org/10.1007/s12665-018-7228-6" target="_blank">https://doi.org/10.1007/s12665-018-7228-6</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Gauch(2025a)</label><mixed-citation>
      
Gauch, M.: Models and predictions for “How to deal w___ missing input data”, Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.16983185" target="_blank">https://doi.org/10.5281/zenodo.16983185</a>, 2025a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Gauch(2025b)</label><mixed-citation>
      
Gauch, M.: How to deal w___ missing inputs: Code (v1.0), Zenodo [code, ICE], <a href="https://doi.org/10.5281/zenodo.17362593" target="_blank">https://doi.org/10.5281/zenodo.17362593</a>, 2025b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Gauch et al.(2023)Gauch, Kratzert, Gilon, Gupta, Mai, Nearing,
Tolson, Hochreiter, and Klotz</label><mixed-citation>
      
Gauch, M., Kratzert, F., Gilon, O., Gupta, H., Mai, J., Nearing, G., Tolson,
B., Hochreiter, S., and Klotz, D.: In Defense of Metrics: Metrics
Sufficiently Encode Typical Human Preferences Regarding Hydrological Model
Performance, Water Resources Research, 59, e2022WR033918, <a href="https://doi.org/10.1029/2022WR033918" target="_blank">https://doi.org/10.1029/2022WR033918</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Graves(2013)</label><mixed-citation>
      
Graves, A.: Generating Sequences With Recurrent Neural Networks, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.1308.0850" target="_blank">https://doi.org/10.48550/arXiv.1308.0850</a>,
2013.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Gupta et al.(2009)Gupta, Kling, Yilmaz, and
Martinez</label><mixed-citation>
      
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decomposition of
the mean squared error and NSE performance criteria: Implications for
improving hydrological modelling, Journal of Hydrology, 377, 80–91, 2009.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>He et al.(2022)He, Chen, Xie, Li, Dollár, and
Girshick</label><mixed-citation>
      
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R.: Masked
Autoencoders Are Scalable Vision Learners, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
16000–16009, <a href="https://doi.org/10.1109/CVPR52688.2022.01553" target="_blank">https://doi.org/10.1109/CVPR52688.2022.01553</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Kratzert(2019a)</label><mixed-citation>
      
Kratzert, F.: CAMELS Extended Maurer Forcing Data, HydroShare [data set], <a href="https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077" target="_blank">https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077</a>, 2019a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Kratzert(2019b)</label><mixed-citation>
      
Kratzert, F.: CAMELS Extended NLDAS Forcing Data, HydroShare [data set], <a href="https://doi.org/10.4211/hs.0a68bfd7ddf642a8be9041d60f40868c" target="_blank">https://doi.org/10.4211/hs.0a68bfd7ddf642a8be9041d60f40868c</a>, 2019b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Kratzert et al.(2019)Kratzert, Klotz, Shalev, Klambauer, Hochreiter,
and Nearing</label><mixed-citation>
      
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., and Nearing, G.: Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, <a href="https://doi.org/10.5194/hess-23-5089-2019" target="_blank">https://doi.org/10.5194/hess-23-5089-2019</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Kratzert et al.(2021)Kratzert, Klotz, Hochreiter, and
Nearing</label><mixed-citation>
      
Kratzert, F., Klotz, D., Hochreiter, S., and Nearing, G. S.: A note on leveraging synergy in multiple meteorological data sets with deep learning for rainfall–runoff modeling, Hydrol. Earth Syst. Sci., 25, 2685–2703, <a href="https://doi.org/10.5194/hess-25-2685-2021" target="_blank">https://doi.org/10.5194/hess-25-2685-2021</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Kratzert et al.(2022)Kratzert, Gauch, Nearing, and
Klotz</label><mixed-citation>
      
Kratzert, F., Gauch, M., Nearing, G., and Klotz, D.: NeuralHydrology – A
Python library for Deep Learning research in hydrology, Journal of Open
Source Software, 7, 4050, <a href="https://doi.org/10.21105/joss.04050" target="_blank">https://doi.org/10.21105/joss.04050</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Kratzert et al.(2023)Kratzert, Nearing, Addor, Erickson, Gauch,
Gilon, Gudmundsson, Hassidim, Klotz, Nevo et al.</label><mixed-citation>
      
Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., Gudmundsson, L., Hassidim, A., Klotz, D., Nevo, S., Shalev, G., and Matias, Y.: Caravan – A   global community dataset for large-sample hydrology, Scientific Data, 10, 61, <a href="https://doi.org/10.1038/s41597-023-01975-w" target="_blank">https://doi.org/10.1038/s41597-023-01975-w</a>,
2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Kratzert et al.(2024)Kratzert, Gauch, Klotz, and
Nearing</label><mixed-citation>
      
Kratzert, F., Gauch, M., Klotz, D., and Nearing, G.: HESS Opinions: Never train a Long Short-Term Memory (LSTM) network on a single basin, Hydrol. Earth Syst. Sci., 28, 4187–4201, <a href="https://doi.org/10.5194/hess-28-4187-2024" target="_blank">https://doi.org/10.5194/hess-28-4187-2024</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Maurer et al.(2002)Maurer, Wood, Adam, Lettenmaier, and
Nijssen</label><mixed-citation>
      
Maurer, E. P., Wood, A. W., Adam, J. C., Lettenmaier, D. P., and Nijssen, B.: A
Long-Term Hydrologically Based Dataset of Land Surface Fluxes and States for
the Conterminous United States, Journal of Climate, 15, 3237–3251, 2002.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Mitchell and Jolley(1988)</label><mixed-citation>
      
Mitchell, M. and Jolley, J.: Research design explained, Holt, Rinehart &amp;
Winston Inc., ISBN: 0030040248, <a href="https://psycnet.apa.org/record/1987-98845-000" target="_blank"/> (last access: 10 November 2025), 1988.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Nash and Sutcliffe(1970)</label><mixed-citation>
      
Nash, J. E. and Sutcliffe, J. V.: River flow forecasting through conceptual
models part I – A discussion of principles, Journal of Hydrology, 10,
282–290, 1970.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Nearing et al.(2024)Nearing, Cohen, Dube, Gauch, Gilon, Harrigan,
Hassidim, Klotz, Kratzert, Metzger, Nevo, Pappenberger, Prudhomme, Shalev,
Shenzis, Tekalign, Weitzner, and Matias</label><mixed-citation>
      
Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim,
A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F.,
Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T. Y., Weitzner, D., and
Matias, Y.: Global prediction of extreme floods in ungauged watersheds,
Nature,  volume 627, 559–563 pp., <a href="https://doi.org/10.1038/s41586-024-07145-1" target="_blank">https://doi.org/10.1038/s41586-024-07145-1</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Newman et al.(2015)Newman, Clark, Sampson, Wood, Hay, Bock, Viger,
Blodgett, Brekke, Arnold et al.</label><mixed-citation>
      
Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. R., Hopson, T., and Duan, Q.: Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance, Hydrol. Earth Syst. Sci., 19, 209–223, <a href="https://doi.org/10.5194/hess-19-209-2015" target="_blank">https://doi.org/10.5194/hess-19-209-2015</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Perez et al.(2018)Perez, Strub, de Vries, Dumoulin, and
Courville</label><mixed-citation>
      
Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A.: FiLM:
Visual Reasoning with a General Conditioning Layer, Proceedings of the AAAI
Conference on Artificial Intelligence, 32, 3942–3951 pp., <a href="https://doi.org/10.1609/aaai.v32i1.11671" target="_blank">https://doi.org/10.1609/aaai.v32i1.11671</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Raffel et al.(2020)Raffel, Shazeer, Roberts, Lee, Narang, Matena,
Zhou, Li, and Liu</label><mixed-citation>
      
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou,
Y., Li, W., and Liu, P. J.: Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer, Journal of Machine Learning Research, 21,
1–67, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Ramsauer et al.(2021)Ramsauer, Schäfl, Lehner, Seidl, Widrich,
Gruber, Holzleitner, Adler, Kreil, Kopp, Klambauer, Brandstetter, and
Hochreiter</label><mixed-citation>
      
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L.,
Holzleitner, M., Adler, T., Kreil, D., Kopp, M. K., Klambauer, G.,
Brandstetter, J., and Hochreiter, S.: Hopfield Networks is All You Need, in:
9th International Conference on Learning Representations (ICLR), openreview.net, <a href="https://openreview.net/forum?id=tL89RnzIiCd" target="_blank"/> (last access: 10 November 2025), 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Rasiya Koya and Roy(2024)</label><mixed-citation>
      
Rasiya Koya, S. and Roy, T.: Temporal Fusion Transformers for streamflow
Prediction: Value of combining attention with recurrence, Journal of
Hydrology, 637, 131301, <a href="https://doi.org/10.1016/j.jhydrol.2024.131301" target="_blank">https://doi.org/10.1016/j.jhydrol.2024.131301</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Read et al.(2021)Read, Sampson, Lambl, Butcher, Gulland, and
Elkurdy</label><mixed-citation>
      
Read, L., Sampson, A. K., Lambl, D., Butcher, P., Gulland, L., and Elkurdy, M.:
Lessons learned applying a machine learning hydrologic forecast model in a
live forecasting competition, in: AGU Fall Meeting Abstracts, Vol. 2021,
H22A–07, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Rohrer(2021)</label><mixed-citation>
      
Rohrer, B.: Transformers from Scratch,
<a href="https://e2eml.school/transformers.html" target="_blank"/> (last access: 10 November 2025), 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Schafer(1997)</label><mixed-citation>
      
Schafer, J. L.: Analysis of incomplete multivariate data, CRC press, <a href="http://dx.doi.org/10.1201/9781439821862" target="_blank"/>, 1997.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Shalev and Kratzert(2024)</label><mixed-citation>
      
Shalev, G. and Kratzert, F.: Caravan MultiMet: Extending Caravan with
Multiple Weather Nowcasts and Forecasts, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2411.09459" target="_blank">https://doi.org/10.48550/arXiv.2411.09459</a>,
2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and
Salakhutdinov</label><mixed-citation>
      
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov,
R.: Dropout: a simple way to prevent neural networks from overfitting, J.
Mach. Learn. Res., 15, 1929–1958, 2014.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Thornton et al.(1997)Thornton, Running, and
White</label><mixed-citation>
      
Thornton, P. E., Running, S. W., and White, M. A.: Generating surfaces of daily
meteorological variables over large regions of complex terrain, Journal of
Hydrology, 190, 214–251, 1997.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones,
Gomez, Kaiser, and Polosukhin</label><mixed-citation>
      
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, L. u., and Polosukhin, I.: Attention is All you Need, in: Advances in
Neural Information Processing Systems, vol. 30, Curran Associates, <a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html" target="_blank"/> (last access: 10 November 2025), 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Wells(1960)</label><mixed-citation>
      
Wells, J.: Compilation of records of surface waters of the United States
through September 1950: Part 1-B. North Atlantic slope basins, New York
to York River, Tech. rep., US Geological Survey, <a href="https://doi.org/10.3133/wsp1302" target="_blank">https://doi.org/10.3133/wsp1302</a>, 1960.


    </mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>Wu et al.(2020)Wu, Zhang, Ilyas, and Rekatsinas</label><mixed-citation>
      
Wu, R., Zhang, A., Ilyas, I., and Rekatsinas, T.: Attention-based Learning for
Missing Data Imputation in HoloClean, Proceedings of Machine Learning
and Systems, 2, 307–325, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>Xia et al.(2012)Xia, Mitchell, Ek, Sheffield, Cosgrove, Wood, Luo,
Alonge, Wei, Meng, Livneh, Lettenmaier, Koren, Duan, Mo, Fan, and
Mocko</label><mixed-citation>
      
Xia, Y., Mitchell, K., Ek, M., Sheffield, J., Cosgrove, B., Wood, E., Luo, L.,
Alonge, C., Wei, H., Meng, J., Livneh, B., Lettenmaier, D., Koren, V., Duan,
Q., Mo, K., Fan, Y., and Mocko, D.: Continental-scale water and energy flux
analysis and validation for the North American Land Data Assimilation
System project phase 2 (NLDAS-2): 1. Intercomparison and application of
model products, Journal of Geophysical Research: Atmospheres, 117, <a href="https://doi.org/10.1029/2011JD016048" target="_blank">https://doi.org/10.1029/2011JD016048</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>Yozgatligil et al.(2013)Yozgatligil, Aslan, Iyigun, and
Batmaz</label><mixed-citation>
      
Yozgatligil, C., Aslan, S., Iyigun, C., and Batmaz, I.: Comparison of missing
value imputation methods in time series: the case of Turkish meteorological
data, Theoretical and applied climatology, 112, 143–167, 2013.

    </mixed-citation></ref-html>--></article>
