In socio-hydrology, human–water interactions are simulated by mathematical models. Although the integration of these socio-hydrological models and observation data is necessary for improving the understanding of human–water interactions, the methodological development of the model–data integration in socio-hydrology is in its infancy. Here we propose applying sequential data assimilation, which has been widely used in geoscience, to a socio-hydrological model. We developed particle filtering for a widely adopted flood risk model and performed an idealized observation system simulation experiment and a real data experiment to demonstrate the potential of the sequential data assimilation in socio-hydrology. In these experiments, the flood risk model's parameters, the input forcing data, and empirical social data were assumed to be somewhat imperfect. We tested if data assimilation can contribute to accurately reconstructing the historical human–flood interactions by integrating these imperfect models and imperfect and sparsely distributed data. Our results highlight that it is important to sequentially constrain both state variables and parameters when the input forcing is uncertain. Our proposed method can accurately estimate the model's unknown parameters – even if the true model parameter temporally varies. The small amount of empirical data can significantly improve the simulation skill of the flood risk model. Therefore, sequential data assimilation is useful for reconstructing historical socio-hydrological processes by the synergistic effect of models and data.

Socio-hydrology is an emerging research field in which two-way feedback between social and water systems is investigated (Sivapalan et al., 2012, 2014). Understanding complex socio-hydrological phenomena contributes to solving water crises around the world. Socio-hydrology has been recognized as an important scientific grand challenge in meeting the United Nations' Sustainable Development Goals (Di Baldassarre et al., 2019).

The most popular approach to socio-hydrology is developing dynamic models which compute nonlinear interactions between humans and water. For instance, Di Baldassarre et al. (2013) developed a simplified model, which described human–flood interactions, to understand the levee effect in which high levees generate a false sense of security and induce social vulnerabilities to severe floods in communities (see also Viglione et al., 2014; Ciullo et al., 2017). Van Emmerik et al. (2014) developed a stylized model, which described two-way feedback between the environment and economic activities, to understand the historical competition for water between agricultural development and environment health in Australia (see also Roobavannan et al., 2017). Pande and Savenije (2016) modeled economic activities of smallholder farmers to analyze the agrarian crisis in Marathwada, India. While the socio-hydrological models described above assumed the existence of a single lumped decision maker, Yu et al. (2017) incorporated a collective action into their model and analyzed the dynamics of community-managed flood protection systems in coastal Bangladesh. Please refer to Di Baldassarre et al. (2019) for a comprehensive review of socio-hydrological modeling.

In addition to these modeling approaches, both qualitative and quantitative data related to socio-hydrological processes are important for understanding human–water interactions. For instance, Mostert (2018) revealed historical changes in river management, from water resources development to protection and restoration, by analyzing qualitative data. Dang and Konar (2018) applied econometric methods to analyze quantitative data in both human and water domains and quantified the causal relationship between trade openness and water use. Kreibich et al. (2017) performed a detailed case study analysis on paired floods, i.e. consecutive flood events which occurred in the same region with the second flood causing significantly lower damage. They found that the reduction in vulnerability played a key role in the successful adaptation to the second flood.

Although it is expected that the integration of model and data contributes to accurately understanding the socio-hydrological processes (Mount et al., 2016), the methodological development of the model–data integration in socio-hydrology is in its infancy. Generally, mathematical models can provide spatiotemporally continuous state variables and quantitative scenarios for future socio-hydrological developments. In addition, mathematical models can quantitatively provide possible scenarios unrealized in the real world, which gives insight to targeted processes (e.g., Viglione et al., 2014). The major limitation of socio-hydrological models is that they are often inaccurate due to the uncertainty in their input forcing, parameters, and descriptions of the processes. On the other hand, hydrological and social data are often more reliable than numerical models and can provide a more complete understanding of the socio-hydrological processes (e.g., Mostert, 2018), although data also have uncertainties. However, in many cases, relevant data in socio-hydrology are sparsely distributed so that it is difficult to completely reconstruct the historical socio-hydrological processes from data. The other limitation of the data-driven approach is that the quantification of the causal relationship cannot be easily done by empirical data only (e.g., Dang and Konar, 2018). Considering the advantages and disadvantages of model and data, previous studies used social statistics to calibrate and validate their socio-hydrological models (e.g., Barendrecht et al., 2019; Roobavannan et al., 2017; Ciullo et al., 2017; van Emmerik et al., 2014; Gonzales and Ajami, 2017).

In geosciences, sequential data assimilation has been widely used for the model–data integration. Data assimilation sequentially adjusts the predicted state variables and parameters of dynamic models by integrating observation data into models based on Bayes' theorem. Data assimilation has been widely applied to numerical weather prediction (e.g., Miyoshi and Yamane, 2007; Bauer et al., 2015; Poterjoy et al., 2019; Sawada et al., 2019), atmospheric reanalysis (e.g., Kobayashi et al., 2015; Hersbach et al., 2019), and hydrology and land surface modeling (e.g., Moradkhani et al., 2005; Sawada et al., 2015; Rasmussen et al., 2015; Lievens et al., 2017). The applicability of the data assimilation approach to socio-hydrological models has yet to be investigated.

In this study, we aim to develop the methodology of sequential data assimilation for the flood risk model proposed by Di Baldassarre et al. (2013). From a series of idealized experiments and a real data experiment in the city of Rome, we demonstrate the potential of data assimilation to accurately reconstruct the historical human–flood interactions. We focus on the case in which the socio-hydrological model's parameters, input forcing data, and social data are somewhat inaccurate.

In this study, we used a socio-hydrological flood risk model proposed by Di Baldassarre et al. (2013). This model conceptualizes human–flood interactions by a set of simple equations which describe the states of flood, economy, technology, politics, and society. Based on this original model of Di Baldassarre et al. (2013), many similar flood risk models have been proposed, validated, and applied (e.g., Viglione et al., 2014; Ciullo et al., 2017; Barendrecht et al., 2019). Here we briefly describe this model. Please refer to Di Baldassarre et al. (2013) for a complete description of this model.

The governing equations of the flood risk model are shown as follows:

Equation (1) calculates the intensity of the flooding events

Parameters of the flood risk model.

In this study, we used a sampling importance resampling particle filtering (SIRPF) algorithm as a method of data assimilation. The SIRPF algorithm has been widely used in hydrological data assimilation (e.g., Moradkhani et al., 2005; Qin et al., 2009; Sawada et al., 2015). Compared with the other data assimilation algorithms, such as the ensemble Kalman filter, SIRPF is robust against model nonlinearity and associated non-Gaussian error distribution. The disadvantage of SIRPF is that the infeasible computational resources are required if the numerical model is computationally expensive, which is not the case in the flood risk model.

The flood risk model can be formulated as a discrete state–space dynamic system as follows:

The SIRPF algorithm is a Monte Carlo approximation of a Bayesian update of the state
variables and parameters as follows:

The posterior probability of the state variables and parameters can be
approximated as follows:

The implementation of SIRPF is as follows:

Updating the model state variables from time

Calculating the simulated observations for all ensembles (Eq. 9).

Calculating the likelihood for each ensemble member (Eq. 11).

Obtaining the weights for all ensembles (Eq. 15).

Applying a resampling procedure according to the normalized weights. The normalized weights of ensemble

Adding the perturbation to the ensembles of parameters (Moradkhani et al., 2005), since there are no mechanisms to increase the variance of parameters of ensemble members, as follows:

In this study, we performed three observation system simulation experiments (OSSEs). In the OSSE, we generated the synthetic truth of the state and flux variables by driving the flood risk model with the specified parameters and input. Then, we generated synthetic observations by adding the noise to this synthetic truth. Those synthetic observations were assimilated into the model by SIRPF. The performance of SIRPF was evaluated by comparing the estimated state variables by SIRPF with the synthetic truth. Model parameters used to generate the synthetic truth can be found in Table 1. They are identical to Di Baldassarre et al. (2013). The OSSE has been recognized as an important preliminary step for verifying the newly developed data assimilation systems (e.g., Moradkhani et al., 2005; Vrugt et al., 2013; Penny and Miyoshi 2016; Sawada et al., 2018).

The high water level for the synthetic truth was generated by the following:

Synthetic observations were generated by adding the Gaussian white noise to
the

We used the ensemble mean of root mean square errors (mRMSEs) as an
evaluation metric as follows:

In the first OSSE, we assumed that there is no uncertainty in the model
parameters. We used the same parameter variables as the synthetic truth run,
and we did not perform the estimation of parameters. Our SIRPF updated only the state variables. Although the model had no uncertainty, it was assumed that the input data, i.e., the time series of the high water level, were uncertain.
Lognormal multiplicative noise was added to the synthetic true high water
level so that different ensemble members have different high water levels in
the data assimilation experiment. The two parameters of the lognormal
distribution, commonly called

In the second OSSE, we assumed that some of the synthetic true parameter
values were unknown. The unknown parameters in experiment 2 were the
cost of levee raising

To further demonstrate the potential of sequential data assimilation in
socio-hydrology, we assumed that the description of the model was biased in
experiment 3. Here we assumed that two of the model parameters were
temporally varied by the unknown dynamics. Specifically, the rate at which
new properties can be built,

In addition to the OSSEs, we performed the real-world experiment in the city of Rome, Italy. Ciullo et al. (2017) collected real-world data and calibrated their flood risk model. Using the data collected by Ciullo et al. (2017), we performed the data assimilation experiment. It should be noted that the flood risk model of Ciullo et al. (2017) is different from our model (i.e., Di Baldassarre et al., 2013), although they are conceptually similar.

All the data were collected from Fig. 1 of Ciullo et al. (2017) by
WebPlotDigitizer (

We added lognormal multiplicative noise to the observed high water level as we did in the OSSEs. The observation errors of levee height and population were set to 10 % and 25 % of the observed values, respectively. Since Ciullo et al. (2017) showed a large uncertainty in the estimation of the theoretical maximum population (see above), it is reasonable to assume that the estimation of the population values also has a relatively large uncertainty.

As in the second and third OSSEs, we have four unknown parameters in this
real-world experiment. We used the same settings of the parameters as for the OSSEs,
which are shown in Table 1, except for

The initial conditions of

Figure 1 shows the time series of the model variables calculated by 5000
ensembles with no data assimilation. Although the ensemble mean of the state
variables is close to the synthetic truth, the ensembles have a large
spread, especially for

Time series of

Figure 2 indicates that this uncertainty is mitigated by assimilating the
observations of

Time series of

RMSE of the no data assimilation (NoDA) experiment and the data assimilation (DA) experiment in which all observations are assimilated every 10 years, with 5000 ensembles, in experiment 1 (see Sect. 3.1).

While we can observe all of

The ratio of RMSEs of the no data assimilation (NoDA) experiment to those of the data assimilation (DA) experiments in which all of observations (

While we can observe every 10 years in Fig. 2 and Table 2, Fig. 4 shows the sensitivity of the observation intervals to the performance of our SIRPF. Our SIRPF algorithm improves the estimation of the state variables when we can obtain an observation once in 50 or 100 years (see also Fig. S1 in the Supplement for the time series of the model's variables), which is promising since we cannot expect frequent observations in the real-world applications.

The ratio of the RMSEs of the no data assimilation (NoDA) experiment to those of the data assimilation (DA) experiments in which all of observations (

We have set the observation error to 10 % of the synthetic truth thus far. The improvement of the simulation skill can be found with larger observation errors (Fig. S2). Although the SIRPF algorithm's performance gradually declines as the observation error increases, our SIRPF algorithm can significantly improve the simulation skill with a 25 % observation error.

Although we have demonstrated the potential of our SIRPF algorithm with 5000 ensembles thus far, the improvement of the simulation skill can be found in much smaller ensemble sizes. The performance of our SIRPF algorithm with 20 ensembles is similar to that with 5000 ensembles (Fig. S3).

Time series of

Figure 5 reveals that the flood risk model completely loses its ability to estimate the human–flood interactions if there are uncertainties in model parameters and high water levels, as described in Sect. 3. In contrast to experiment 1, the ensemble mean cannot accurately reproduce the synthetic truth.

Figure 6 indicates that our SIRPF algorithm can accurately estimate the model state
variables by assimilating the observations of

Time series of

Time series of

RMSE of the no data assimilation (NoDA) experiment and the data assimilation (DA) experiment in which all observations are assimilated every 10 years, with 5000 ensembles, in experiment 2 (see Sect. 3.2).

We analyzed the impacts of the individual observation types on the
simulation skill as we did in experiment 1. Figure 8a shows that the
effects of the individual observation types are similar to what we found in
experiment 1, as follows: (1) improving the ability to simulate
unobservable state variables is possible with our SIRPF algorithm, (2) observing

The ratio of the RMSEs of the no data assimilation (NoDA) experiment to those of the data assimilation (DA) experiments in which all of observations (

The good performance of our SIRPF algorithm can be found with the longer observation intervals, as we found in experiment 1. Figure 9 indicates that our SIRPF algorithm can improve the estimation of the state variables and parameters when we can obtain observations once in 50 or 100 years (see also Figs. S4 and S5 for the time series of the model's variables).

The ratio of the RMSEs of the no data assimilation (NoDA) experiment to those of the data assimilation (DA) experiments in which all of observations (

As we found in experiment 1, the SIRPF algorithm's performance declines with increased observation errors (Fig. S6). However, it is promising that our SIRPF algorithm can improve the simulation skill with larger observation errors of up to 25 % of the synthetic truth, considering that the observations in the socio-hydrological domain are often inaccurate.

In contrast to experiment 1, a larger ensemble size is required to stably estimate both state variables and parameters (Fig. S7). The increased degree of freedom and the nonlinear relationship between parameters and observations increase the necessary ensemble size.

In addition to experiment 2, two of the unknown parameters (

Time series of

RMSE of the no data assimilation (NoDA) experiment and the data assimilation (DA) experiment in which all observations are assimilated every 10 years, with 5000 ensembles, in experiment 3 (see Sect. 3.3).

Time series of

Figure 12 shows the time series of the model variables calculated by 5000 ensembles with no data assimilation. The 5000-ensemble simulation reveals the two bifurcated social systems. One builds a high levee and maintains a course of stable economic growth. The other one has no levee, and its economy is damaged by severe floods many times (the ensemble mean shown in Fig. 12b implies that there are many ensemble members with a zero levee height).

Time series of

In reality, the city of Rome constructed the levee in response to the severe
flood that occurred on 28 December 1870. After the construction of this levee, no
major flood losses occurred, allowing steady and undisturbed growth.
Figure 13 indicates that our SIRPF algorithm successfully constrains the trajectory of
the ensemble simulation to the real world (i.e., high levee and stable
economic growth) by assimilating the real data of

Time series of

We analyzed the impacts of the individual observation types (i.e.,

Same as Fig. 13 but only real data of

Same as Fig. 13 but only real data of

On the other hand, assimilating only levee height data cannot provide similar results to those shown above. Figure 15 shows the time series of the
model variables from the data assimilation experiment in which we assimilated
the observation data of

In this study, we developed the sequential data assimilation system for the widely adopted socio-hydrological model, i.e., the flood risk model by Di Baldassarre et al. (2013). We demonstrated that our SIRPF algorithm for the flood risk model is useful for reconstructing the historical human–flood interactions, which can be called socio-hydrological reanalysis, by integrating sparsely distributed observations and imperfect numerical simulations. In atmospheric science, atmospheric reanalysis has been intensively analyzed to understand complex feedback in the atmosphere, which cannot be done by analyzing observation data only due to their sparsity. Socio-hydrological reanalysis can work as a reliable and spatiotemporally homogeneous data set and may be helpful for deepening the understanding of human and water interactions. In addition, socio-hydrological reanalysis can be used as initial condition for predicting the future changes in socio-hydrological processes as atmospheric scientists predict the future weather and/or climate using atmospheric reanalysis. Since it is impossible to directly observe all state variables and parameters as initial conditions, socio-hydrological reanalysis is crucially important for accurate prediction. Socio-hydrological data assimilation has a high potential to improve the understanding of the complex feedback between social and flood systems and predict their future. Our idealized OSSE and real data experiments reveal several important findings.

First, the sequential data assimilation can mitigate the negative impact of the uncertainty in the input forcing on the simulation of socio-hydrological state variables. We found that the small perturbation of high water levels greatly affects the long-term trajectory of the socio-hydrological state variables, as Viglione et al. (2014) also found. It is necessary to sequentially constrain the state variables and parameters by sequential data assimilation if the input forcing is uncertain, although previous studies on the model–data integration in socio-hydrology mainly focused on parameter calibration and assumed no uncertainty in the input forcing (e.g., Barendrecht et al., 2019; Roobavannan et al., 2017; Ciullo et al., 2017; van Emmerik et al., 2014; Gonzales and Ajami, 2017). To deeply understand the socio-hydrological processes, long-term historical analysis should be performed. Although there are many studies on the accurate reconstruction of the historical weather conditions (e.g., Toride et al., 2017), it may be necessary to tackle the uncertainty in hydrometeorological data sets used for the input forcing of the socio-hydrological models.

Second, our SIRPF algorithm can efficiently improve the simulation of the
socio-hydrological state variables, using the sparsely distributed data. All
model variables should not necessarily be observed to constrain the model's
state variables and parameters. In some cases, observations of a single
state variable are enough to reconstruct the accurate socio-hydrological
state. In addition, observation intervals can be longer than 10 years. Since
it is difficult to obtain large volumes of data in socio-hydrology, this
finding is promising. We also give some insight about the informative
observation types in the flood risk model. With uncertain high water levels,
observations of the intensity of flooding events

Third, our SIRPF algorithm is robust to the imperfection of the socio-hydrological model. The unknown parameters can be efficiently estimated by the sequential data assimilation. While previous studies evaluated the trajectory in the whole study period to calibrate the socio-hydrological models by iteratively performing the long-term model integration (e.g., Barendrecht et al., 2019; Roobavannan et al., 2017; Ciullo et al., 2017; van Emmerik et al., 2014; Gonzales and Ajami 2017), we sequentially optimized the parameters based on the relatively short-term time series, thus allowing parameters to temporally vary in the study period. The advantage of this strategy is that we can deal with time-variant parameters, as previously demonstrated in the applications of hydrological models (e.g., Pathiraja et al., 2018). In the model development, parameters are formulated as time-invariant values so that the existence of time-variant parameters indicates the imperfect description of dynamic models. Sequential data assimilation can mitigate the negative impact of this imperfect model description. Vrugt et al. (2013) pointed out that the parameter optimization by the sequential filters is unstable if parameter sensitivity temporally changes (e.g., parameters affects the model's dynamics differently in the different seasons), which may be a potential limitation of our strategy, compared with Bayesian inference based on the long-term trajectory as given by Barendrecht et al. (2019).

A major limitation of this study is that we assume the modeled state variables can directly be observed, although it is difficult to directly observe state variables of the socio-hydrological models. For example, it is impossible to directly observe the social awareness of flood risk in the flood risk model, and several previous studies obtained the proxy of the social memory from interview data (Barendrecht et al., 2019) and a number of Google searches (Gonzales and Ajami, 2017). When these indirect observations are assimilated into a model, the (nonlinear) observation operator (see Eq. 9), the assignment of the observation error, and assimilation methods should be carefully designed as previously discussed in the context of numerical weather prediction (e.g., Sawada et al., 2019; Okamoto et al., 2019; Minamide and Zhang, 2017). Future work will focus on the methodological development in order to efficiently assimilate observations in the social domain with a complicated structure of observation operators and errors.

Code and data are available upon request from the corresponding author.

The supplement related to this article is available online at:

YS designed the study. RH and YS jointly developed the data assimilation system for the flood risk model and performed the numerical experiments. YS and RH contributed to the interpretation of the results. YS wrote the first draft of the paper, and RH contributed to the editing of the paper.

The authors declare that they have no conflict of interest.

We thank Giuliano Di Baldassarre for sharing the original source code of the flood risk model. We thank the two anonymous referees for their constructive comments. The Data Integration and Analysis System (DIAS) provided us with the computational resources.

This paper was edited by Giuliano Di Baldassarre and reviewed by two anonymous referees.