Combining split-sample testing and hidden Markov modelling to assess the robustness of hydrological models

The impacts of climate and land-use changes make the stationary assumption in hydrology obsolete. Moreover, there is still considerable uncertainty regarding the future evolution of the Earth’s climate and the extent of the alteration of flow regimes. Climate change impact assessment in the water sector typically involves a modelling chain in which a hydrological model is needed to generate hydrologic projections from climate forcings. Considering the inherent uncertainty of the future climate, it is crucial to assess the performance of the hydrologic model over a wide range of climates and their corresponding hydrologic conditions. In this paper, numerous, contrasted, climate sequences identified by a hidden Markov model (HMM) are used in a differential split-sample testing framework to assess the robustness of a hydrologic model. The differential split-sample test based on a HMM classification is implemented on the time series of monthly river discharges in the upper Senegal River basin in West Africa, a region characterized by the presence of low-frequency climate signals. A comparison with the results obtained using classical rupture tests shows that the diversity of hydrologic sequences identified using the HMM can help with assessing the robustness of the hydrologic model.


Introduction
According to some authors, humanity has entered a new geological epoch, the Anthropocene, characterized by rapid environmental changes due to human activities (Falkenmark et al., 2019). Among those activities, the massive release of carbon dioxide since the industrial revolution is expected to lead to global warming, which in turn will affect the hy-drological cycle (Gleeson et al., 2020). In the past, water engineers were able to design and operate water infrastructure based on the assumption that the climate was stationary and hence that time series of recorded hydrologic variables such as precipitation and river discharge were representative of future hydrologic conditions (Bernier, 1977;Payrastre, 2003;Naghettini, 2017). Now that the climate is changing, this assumption of stationarity is considered obsolete or even "dead" according to Milly et al. (2008). To deal with this issue, water planners and managers have devoted significant efforts to the development of new decision analytic frameworks that explicitly capture the uncertainties attached to climate change and its impacts on water resources (Brown and Wilby, 2012;Prudhomme et al., 2010).
There are essentially two categories of decision analytic frameworks: top-down versus bottom-up. The first relies on the sequential coupling of models: general circulation models (GCMs) are run to project future precipitation and temperatures which are then downscaled and used as inputs to hydrologic models whose outputs are then processed by water systems models (Peel and Blöschl, 2011). This is consistent with the traditional "predict-then-act" decision-making paradigm (Weaver et al., 2013). The second category rather seeks to identify robust solutions, i.e. solutions that will perform relatively well across a wide range of hydrologic conditions (Lempert et al., 2006). In terms of decision-making paradigm, the idea here is to "minimize regret".
Despite their differences, both frameworks rely at some point on a hydrological model to transform the climate forcings into streamflows. The hydrological model can be stochastic (Borgomeo et al., 2014;Poff et al., 2016), distributed, or conceptual (Fortin et al., 2007;Ludwig et al., 2009). When the model is conceptual, its performances must Published by Copernicus Publications on behalf of the European Geosciences Union. 4612 E. Guilpart et al.: Combining split-sample testing and HMM to assess hydrological models be assessed over contrasting climatic periods, because it should be able to perform well over contrasted hydroclimatic conditions (Klemes, 1986). For that purpose, the differential split-sample test principle of Klemes (1986) suggests dividing the whole period into independent periods with different stationary features. The hydrological model is then calibrated with a specific period and validated with another period(s). However, as the technique used to subdivide a time series affects the intrinsic variability embedded in the subsequences, it may impact the calibration and validation steps (Thirel et al., 2015a, b;Stephens et al., 2019;Motavita et al., 2019;Dakhlaoui et al., 2019;Huang et al., 2020).
Several statistical tests have then been proposed to detect shifts and trends in time series including the Mann-Kendall test (Mann, 1945;Kendall, 1948) and the Pettitt test (Pettitt, 1979). A review of those tests can be found in Liu et al. (2016). However, most of those tests can only make the distinction between two periods, before and after the change point, and are therefore unable to handle more complex climate sequences with multiple change points. In certain regions, for example, time series of river discharges are characterized by low-frequency shifts and hence multiple change points, because the underlying hydrological processes are influenced by low-frequency climate signals such as the El Niño-Southern Oscillation (Bracken et al., 2014;Nalley et al., 2019).
Hidden Markov models (HMMs) can be used to identify a succession of subsequences in a time series (Rabiner, 1989). Rather than focusing on shifts in the mean of a process, HMMs estimate shifts in the state of a process (Whiting et al., 2004). In other words, a HMM labels the observations according to their state, which ultimately leads to a new time series with states alongside the original one with the observations. If the latter is a time series of river discharges, then the HMM will generate a new time series of climate states. In hydrology, HMMs are typically used to analyze time series exhibiting a regime-like behaviour characterized by long-term persistence (Akintug and Rasmussen, 2005;Whiting et al., 2004;Turner and Galelli, 2016).
In this article, we combine a classification obtained by a HMM with the differential split-sample testing framework. The goal is to improve the robustness of the calibration/validation of a hydrological model, which is a prerequisite to climate change impact assessment. The term "robustness" refers to the ability of the hydrological model to perform well under contrasted hydroclimatic conditions. This definition is coherent with the so-called robust decisionmaking framework that is often advocated to handle the deep uncertainty attached to climate change (Lempert et al., 2006). This is illustrated using the Senegal River basin (SRB) as a case study. Headwaters in the SRB are still largely natural areas (Descroix et al., 2020;Faty, 2017), and the flow regime in the upper part of the basin exhibits regime-shifting behaviour with departures from the inter-annual average over extended periods of time (Faye et al., 2015;Paturel et al., 2004;Da-costa et al., 2002). These characteristics make the SRB an interesting case study to illustrate the differential split-sample testing framework with hydrologic sequences identified from a HMM.
The paper is organized as follows. Section 2 describes the methodology as well as the case study. Results are then discussed in Sect. 3. Finally, concluding remarks are given in Sect. 4.

Calibration and validation of a hydrological model under contrasted climates
Generally speaking, the calibration/validation of a hydrological model seeks to identify the unknown parameters of the model on one portion of historical data and then to judge the performance of the calibrated model over another portion (Roche et al., 2012). Subdividing the whole period into subsequences must be done carefully, keeping in mind that the validation period must be close to the conditions to which the model will be applied operationally (Brigode et al., 2013). Klemes (1986) proposes a hierarchical scheme for the systematic testing of hydrological models. When calibrating/validating a model under non-stationary conditions, the author recommends readers to follow the differential splitsample test. Depending on the nature of the change leading to non-stationary conditions, climate or land use, the differential split-sample test can take different forms. Since this paper is concerned with the robustness of hydrological models for climate change impact assessments, we focus on the differential split-sample test to handle non-stationary conditions due to a changing climate. In that case, the time series of river discharges must be divided into at least two stationary subsequences with contrasted climates, e.g. dry and wet; we then calibrate the model on one subsequence and use the other one for validation. The main idea is to make sure that the model is able to perform well under the transition required: from drier to wetter conditions or vice versa. This amounts to testing the stability of the parameters for different climate conditions (Brigode et al., 2013).
As explained in the introduction, classical rupture tests make the distinction between only two periods, therefore limiting the number of transitions that can be explored to assess the robustness of the hydrological model. This paper addresses this limitation by identifying multiple subsequences using a hidden Markov model (HMM), which are then used in a differential split-sample testing framework.

Identifying stationary subsequences
Identifying multiple subsequences in a time series of river discharges comes down to detecting shifts in the flow regime, which can be done using a statistical test like the Pettitt test or with the help of a HMM. The non-parametric trend Pettitt test divides the streamflow record of length T into two subsequences denoted T pettitt.P 1 and T pettitt.P 2 respectively. It involves identifying the change point Y marking the transition from one subsequence to the next. Given a random variable q (e.g. annual streamflow), the Pettitt test is defined as the following (Pettitt, 1979): with L being the length of the time series q.
where K gives the year of the change point if the test is significant (p ≤ 0.05) (Pettitt, 1979). Hidden Markov modelling is a class of probabilistic model that can be used to label the observations (Rabiner, 1989). The motivation for adopting this type of model in hydrology is that the flow regime can be represented by a state variable that can take only a limited number of values (e.g. dry or wet for two states; dry, normal, or wet for three states). In other words, in parallel to the time series of historical river discharges, there exists another time series with discrete climate states. Denote {q 1 , q 2 , . . ., q L } the time series of annual flows and { 1 , 2 , . . ., L } the time series of states which can only take N possible values (Fig. 1).
The state variable is unobserved and is accordingly referred to as a hidden variable. The hidden state t is modelled as a N state Markov chain fully characterized by its transition probability matrix M with elements M ij : where M ij describes the transition probability to switch from the state i at time t − 1 to state j at time t. The observed variable q t is assumed to have been drawn from a probability distribution whose parameters are conditional upon the distinct state at time t such that, when t is known, the distribution of q t depends only on the current state t and not on previous states or observations. M (q t |q t−1 , . . .q 1 , t , . . ., 1 ) = M (q t | t ) .
Fitting a HMM to the observed sequence (here the time series of annual flows) requires evaluating the likelihood of observing that sequence, as calculated under an N-state HMM (see Appendix A for more details). In this study, we use the expectation-maximization (EM) algorithm, which is an iterative method for finding the maximum-likelihood estimate of the parameters of an underlying distribution when some of the data are missing. In the context of HMM, the EM algorithm is known as the Baum-Welch algorithm (Welch, 2003), and the hidden climate states are treated as missing data (Bilmes, 1998;Zucchini et al., 2017).
The EM algorithm consists of two main phases: an expectation phase called "E step", followed by a maximization phase called "M step". Given the current estimate of the HMM parameters θ , the following steps are repeated until acceptable convergence is achieved: the E step phase of the algorithm computes the expected value of unobserved data (i.e. hidden climate states) using the current estimate of the parameters and the observed data. The M step phase of the algorithm then provides a new estimate of the parameters by using the data from the E step phase as if they were actually measured data. These parameters are then used to calculate the distribution of unobserved data in the next E step phase of the algorithm. The resulting values of θ are then the stationary point of the likelihood of the observed data (please refer to Appendix B for more details).
Given the observation sequence, we want to determine the sequence of hidden climate states { 1 , 2 , . . ., T } that have most likely (under the fitted HMM) given rise to the time series of annual river discharges. In the literature, this is known as the decoding procedure. In this study we use the Viterbi algorithm (Viterbi, 1967) to unfold the sequence of hidden climate states (called the Viterbi path). This, in turn, enables us to divide the whole period into numerous climate subsequences. Figure 2 depicts the possible combinations offered by Pettitt's test and HMM classifications. The hydrological model is calibrated on a specific subsequence, and the validation is achieved on others. Thus, the model performances (i.e. the robustness) are assessed over a large panel of hydroclimatic conditions. More specifically, the robustness of the hydrological model can be assessed after examining the differences between calibration and validation scores for the different cases (transitions) that can be investigated once the subsequences are identified. If those differences remain stable, then the hydrologic model is robust vis-à-vis contrasted hydroclimatic conditions.

The study case: Senegal River basin and its sub-basins
The use of HMM-derived subsequences in a differential splitsample testing framework to assess the robustness of a calibrated hydrological model is illustrated with the Senegal River basin. The Senegal River drains a basin shared by four countries in West Africa : Guinea, Mali, Mauritania, and Senegal. There are three main tributaries: (i) the Bafing River contributing to ∼ 50 % of the Senegal River flows, (ii) the Bakoye River (∼ 15 %), and (iii) the Faleme River (35 %). Flowing northward for 500 km, the Bafing River collects precipitation on the Fouta Djallon, a high plateau considered the water tower of West Africa. After merging with the Bakoye River, the Senegal River runs northwest for 200 km before the confluence with the Faleme River at Bakel, the last major tributary. After Bakel, the river meanders over 800 km through the floodplain and then discharges into the Atlantic Ocean.
The basin is located in the Sudano-Guinean zone, which is yearly influenced by the monsoon, a rainy season from April to October (Lahtela, 2003;Bodian, 2011). A consequence of the monsoon is a strong north-south precipitation gradient, ranging from 1900 mm yr −1 in the south to 100 mm yr −1 in the north Bodian et al., 2015). In addition, precipitation present strong annual and inter-annual historical variabilities (Faye et al., 2015), with a wet episode (1950s-1970s) and a dry episode (1970s-1990s). With this historical climatic variability, as well as a strong spatial heterogeneity of its hydroclimatic components, the SRB is an interesting case study to analyze the robustness of hydrological models.
To take advantage of the hydroclimatic specificities of the SRB and its heterogeneity, we have divided the SRB into three sub-basins (Fig. 3b, c, d and Table 1). This allows us to demonstrate the potential of the proposed protocol based on a HMM classification on basins with contrasting hydrologic characteristics. Sub-basins have been delimited using the GRASS-3.4 software and the Shuttle Radar Topography Mission (SRTM) 1 arcsec elevation data set.
Generally speaking, streamflows are considered an integrative signal of the whole basin hydroclimatic conditions, meaning that river discharges are the result of hydrological processes taking place upstream and are influenced by changes in precipitation, land use, etc. For the Senegal River basin, most of the runoff and headwaters are located in the Fouta Djallon, a sparsely populated plateau where vegetation cover is relatively stable (Descroix et al., 2020) anthropogenic impacts on runoff seem to be negligible (Faty, 2017;Bader et al., 2014;OMVS, 2011). The areas mainly concerned with massive land-use conversions are located downstream of Bakel, a region not considered in our analysis because it marginally contributes to the river discharge. This  study relies on a time series of naturalized flows at Bakel produced by Bader et al. (2014) after removing the influence of the Manantali Dam. In Daka Saidou and Oualia sub-basins, however, river discharges are still natural. Consequently, we can assume that changes in the flow regime can only be attributed to shifting climate conditions.

The selected hydrological model
The selected hydrological model is GR2M (Mouelhi, 2003), a monthly time-step conceptual model that has already been used in the SRB with satisfactory results (Ardoin- Bardin, 2004;Ardoin-Bardin et al., 2005;Bodian et al., 2012Bodian et al., , 2015Bodian et al., , 2016. Moreover, since the simulated flows will be processed by a hydro-economic model of the SRB working on a monthly time step (Tilmant et al., 2020), there was no need for a hydrological model working on a shorter time step. GR2M has only two parameters: X1 and X2, controlling the production and the transfer functions respectively (Fig. 4). We use the GR2M version included in the environment "airGR", developed by Coron et al. (2017). The GR2M calibration/validation phase requires three time series: (i) a time series of monthly precipitation (P ) in the basin, (ii) a time series of monthly potential evapotranspiration (PET), and (iii) a time series of monthly river discharges (q) at the outlet.

Implementing the differential split-sample test on HMM-derived subsequences
Many authors have pointed out that selecting the most accurate hydrological and meteorological inputs can significantly reduce the total error during the calibration/validation of a hydrological model (Paturel et al., 1995;Huard and Mailhot, 2006;Kavetski et al., 2006;Huard and Mailhot, 2008;Renard et al., 2010). Based on a comparison with meteorological observations compiled by SIEREM (Système d'Informations Environnementales sur les Ressources en Eau et leur Modélisation) and details given by Bader et al. (2014) about hydrological data, the following dataset is selected: (1) time series of precipitation were extracted from the HSM-SIEREM dataset, stretching from 1940 to 1998 (Boyer et al., 2006); (2) the PET time series comes from the Climate Research Unit (CRU) (Harris et al., 2020) and covers a period from 1901 to 2018; (3) monthly river discharge data at sub-basin outlets are naturalized flows extracted from Bader et al. (2014) for the 1903-2012 period. Based on the above datasets, the analysis is carried out on the period 1940-1998 (59 years), which is denoted by "the full historical record" (T 1940-1998 ) in the remainder of this paper. Selecting an objective function to calibrate a conceptual hydrological model is one of the main concerns of the hydrological community (Garcia et al., 2017;Krause et al., 2005;Madsen, 2003). Here, two objective functions are selected: (1) the Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) and the (2) Kling-Gupta efficiency (KGE) (Gupta et al., 2009). The former is a popular criterion, and since it mainly focuses on high flows, it is particularly relevant for rivers where much of the annual discharge is generated during the high-flow season, which is the case in the SRB. The latter allows for a multi-objective calibration that considers more components than just the errors, i.e. correlation, bias, and variability.
Mathematically, the NSE and KGE formulations can be written as where q obs is the observed flow at time t, q sim is the simulated flow at time t, µ obs is the mean of observed flows, β the ratio between the mean simulated flow and the mean observed flow value β = µ sim /µ obs , α the ratio between the standard deviation of simulated flows and the standard deviation of observed flows α = σ sim /σ obs , and r is given by For the identification of the subsequences, we have implemented Pettitt's test, a two-state HMM classification (a dry state and a wet state), and a three-state HMM classification (dry, normal, and wet states). As shown in Fig. 5, seven transitioning cases can be investigated within the differential split-sample testing framework.
-If relevant, the Pettitt test offers two calibration/validation possibilities: calibration on T pettitt.dry and validation on T pettitt.wet (and vice versa).
-The two-state HMM classification offers two possibilities too: calibration on T 2HMM.dry and validation on T 2HMM.wet (and vice versa).
-Similarly, the three-state HMM classification leads to three possibilities: calibration on T 3HMM.dry and validation on T 3HMM.nor + T 3HMM.wet (and corollaries).
Pettitt test and HMM classifications have been carried out on the time series of annual flows. T pettitt.wet and T pettitt.dry are both subsequences made of contiguous years as the original time series is split into two. In that case, the temporal persistence found in the original time series is very much preserved. However, for T 2HMM.wet and T 2HMM.dry , the situation is different since they are made of numerous, not necessarily contiguous, wet or dry subsequences respectively. This is also true for T 3HMM.wet , T 3HMM.nor , and T 3HMM.dry .
Even though the KGE is based on a decomposition of the NSE, the corresponding scores cannot be directly compared.
Therefore, we will discuss the results obtained with NSE and KGE separately. During all calibration phases, the first year is considered warming-up period and not considered. Recall that the identification of change points is done on the time series of annual flows, while the hydrological model simulates monthly river discharges.

Analysis of simulation results
First, we applied the calibration/validation protocol on the full historical record (T 1940-1998 ). The results (Sect. 3.1) highlight the relevance of a HMM classification for long time series (59 years) with a historical contrasted climate.
Then, the protocol is implemented, independently, on two shorter periods: 1945-1971 (T 1945-1971 , 27 years) and 1972-1998 (T 1972-1998 , 27 years). This second test aims at illustrating the relevance of HMM classifications for shorter time series, which do not display a clear climate trend. Results are respectively given in Sect. 3.2 and 3.3.

Subsequence identification and calibration/validation results for the full historical record T 1940-1998
The results of the division of the full historical record T 1940-1998 are displayed in Table 2 and Fig. 6a. Calibration and validation values are given in Fig. 6b and in Table 3. For the three sub-basins, Pettitt's test is significant and shows a rupture in 1970 or 1971 (Table 2, Fig. 6a vertical red line). The two-state HMM classification provides similar results with nearly aligned climate subsequences for all sub-basins. This is also true for the three-state HMM classification. It must also be noted that Pettitt's test-derived and HMM-derived subsequences are quite long, ranging from 15 to 33 years.
From the examination of the transition probability matrices M in Table 2, we can see that the states are clearly distinct both with the two-state and the three-state HMM classification. As a matter of fact, the values close to one on the diagonal indicate that when the climate is in a particular state, it will likely remain in that state in the next time period (year).
The examination of Fig. 6b reveals that, for Daka Saidou (upstream), the model's scores (NSE or KGE) in calibration and validation are gathered in the top right corner, indicating that the model is able to reproduce well the subsequences of the historical record that have been used in calibration and then in validation. A statistical analysis of the subsequences shows that those sharing the same (hidden) climate state have similar statistics, but the differences across climate states are relatively small at Daka Saidou compared to the other two sub-basins located downstream (Bakel and Oualia). For example, the mean monthly streamflows of dry subsequences represent 64 %, 61 %, and 54 % of wet subsequences for the Pettitt test two-state-HMM and three-state-HMM classifica-   tion respectively, which, as we will see later, are much higher ratios than those found at Bakel or Oualia. In other words, although the climate subsequences are indeed statistically distinct, they are nevertheless not that far apart, meaning that, regardless of the transitions, the conditions for calibration and validation are not significantly different. For Oualia and Bakel sub-basins, however, calibration/validation scores are more scattered; only some model versions are able to perform consistently over contrasted climates. For Oualia, calibrations on dry and normal subsequences (cases 2, 3, 5, and 6) provide relatively good values and similar validation scores (difference between calibration and validation scores lower than 0.1), meaning that the associated model versions could be considered robust. Those results suggest that the "wet version" of the model struggles to simulate very dry months (especially during the dry subsequences). It seems that the wet version of the model does not handle well the intermittent streams which can be observed in the northern (driest) part of Oualia sub-basin during dry years. For Bakel, we can see that the calibration/validation scores obtained from the HMM-derived wet and dry subsequences (cases 3 and 4) are systematically better than those calculated for the subsequences identified by the Pettitt test (cases 1 and 2). Here, calibrating on dry conditions and validating on wet conditions do not systematically perform better than the other way around. In contrast to Oualia, Bakel subbasin drains a portion of the Fouta Djallon with the Faleme and Bafing rivers and is therefore less sensitive to those intermittent streams found in the north. Finally, as we move downstream, we can see that the difference between the NSE and KGE scores tends to increase. As pointed out by Gupta et al. (2009), since the NSE uses the observed mean as baseline, it can lead to overestimation of model skill for highly seasonal time series. Here, due to the north-south precipitation gradient, the seasonality of the flow regime increases once the river leaves the Fouta Djallon since it receives the contribution of more and more intermittent tributaries. T 1945-1971 This section examines the period 1945-1971 (T 1945-1971 ), which can be considered a wet historical episode in the SRB. The results of the division are displayed in Table 4 and Fig. 7a. Calibration and validation values are given in Fig. 7b and in Table 5.

Subsequence identification and calibration/validation results for
For the three sub-basins, Pettitt's tests are inconclusive, indicating that there is no clear climatic trend in the T  period. However, the HMM is still able to make the distinction between climate states and thus identify corresponding subsequences, which can then be exploited in a differ-   . The numbers of years used for the calibration or validation are given in parenthesis. T 1940-1998 Daka Saidou

The full historical record
Pettitt's test    (7 years ential split-sample test. The subsequences provided by twostate-HMM and the three-state-HMM classifications are not necessary aligned for all sub-basins. As T 1945-1971 is a 26year period, the two-state-HMM-derived and the three-state-HMM-derived subsequences have lengths ranging from 5 to 19 years. Various authors have discussed the minimum length required to achieve a calibration or a validation without reaching a consensus, even though a number from 2 to 8 years could be enough depending on the "hydrological events" included in the subsequences (Razavi and Coulibaly, 2013;Juston et al., 2009;Singh and Bárdossy, 2012). In our case, the technique used to identify the subsequences (HMM classifications) seeks to provide relatively homogeneous ones. Here, we assume that 5 years is acceptable, but we do not investigate this issue further since it is beyond the scope of our paper. The transition probability matrices for the two-state HMM and three-state HMM are now diverging from an identity matrix, indicating that the temporal persistence is less pronounced than that found in the full historical records. In addition, we note that the mean annual flow of dry subsequences (T 1945-19712HMM.dry and T 1945-1971 ) are relatively high (in comparison with T 1940-19982HMM.dry and T 1940-1998 ). Compared to the results associated with the full historical records and discussed in the previous section, we can see that the calibration/validation scores for the five remaining cases are more concentrated, especially for the two downstream sub-basins (Oualia and Bakel). During this wet episode, the hydrological models seem to be more robust, which is consistent with the fact that the intrinsic variability of the hydrologic conditions within that episode is smaller than the variability that characterizes the full historical records. With the north-south gradient that characterizes precipitation in the SRB, intermittent tributaries in the north under dry or normal conditions are turned into permanent rivers which are better handled by conceptual models like GR2M. T 1972-1998 Here, we focus on the period 1972-1998 (T 1972-1998 ), which can be considered a dry historical episode in the SRB. The results of the division and the corresponding parameters are displayed in Table 6 and Fig. 8a. Calibration and validation values are given in Fig. 8b and in Table 7. Likewise in Sect. 3.2, there is no clear monotonic climatic trend such as Pettitt's test is inconclusive (p values bigger than 0.05). Again, the HMM remains here a useful tool to identify subsequences, which are not necessary aligned for all sub-basins.

Subsequence identification and calibration/validation results for
For Daka Saidou, all calibration and validation scores are higher than 0.9, and the differences are small (below 0.1). For Oualia sub-basin, the calibration and validation of the hydrological model face the typical challenges associated with the  high spatial and temporal variability that characterizes the formation and propagation of river flows in dryland regions. The dry episode, centered around the 1980s, was triggered by a sustained reduction in precipitation (Faye et al., 2015), which was even more pronounced in the north where some tributaries became intermittent rivers . Conceptual hydrological models like GR2M are indeed not well equipped to deal with sudden and widespread transitions from wet to dry conditions (Gutierrez-Jurado et al., 2021). For Bakel, no clear pattern emerges: some "dry versions" have as bad (or good) performances as some "wet versions". The poorest scores are nevertheless obtained when the calibration and validation are carried out on homogeneous subsequences associated with extreme climate states (dry-wet, wet-dry), e.g. cases 5-2 and 7-1.

Conclusions
This article proposes a HMM-based classification to deal with complex climate sequences and shows how the resulting classification can be used in a differential split-sample test to assess the robustness of a hydrological model. A modelling experiment is carried out in the Senegal River basin using the GR2M model and historical flows from 1940-1998. Then, two other periods have been investigated: a wet episode ) and a dry one .
The main concluding remarks are the following: -When records display a single point change, a classical rupture trend (as Pettitt test) remains an adequate tool to divide the records into two climate subsequences.
-If the records contain multiple change points, a HMM classification can divide the series into several climate subsequences without the need for additional data.
-Regardless of the division method used, the range of climate conditions over which the hydrological model can perform depends on the intrinsic variability of the series. Compared to the Pettitt test, however, the HMM classification allows for a finer labelling of the years and therefore better exploiting the intrinsic variability in the series to enrich a differential split-sample test.
-We encourage modellers to explore as many cases as possible to calibrate/validate a hydrological model according to the differential split-sample test. The parameter's stability over contrasted hydroclimatic conditions seems to depend on the studied period, on the objective functions, on the subsequence identification techniques, and on the basin.

4624
E. Guilpart et al.: Combining split-sample testing and HMM to assess hydrological models Appendix A: Likelihood of hidden Markov models We suppose there is an observation sequence Q = {q 1 , q 2 , . . ., q T } and the associated (unobserved) state variables = { 1 , 2 , . . ., T } generated by such a model. Given the set of HMM parameters θ = {µ, σ, M, δ}, the joint density of complete data set Z = (Q, ) can be expressed as p(Z|θ ) = p(Q, |θ ) = p(Q| , θ )p( |θ ). (A1) Assuming the data belonging to each hidden state are characterized by a specific Gaussian probability distribution, the two terms on the right-hand side are p(Q| , θ ) = T t=1 p q t |µ t , σ t , The complete data likelihood function ζ (θ |Z) can be calculated as ζ (θ |Z) = ζ (θ |Q, ) = p(Q, |θ ). (A4) For a HMM which has the initial distribution δ and transition probability matrix M for the Markov chain, let us define the probability mass function of Q if the Markov chain is in state i at time t as p i (q) = p(Q = q| = i), with i = 1, 2, . . .N . The general form of likelihood function is then given by the following (Zucchini et al., 2017): ζ = δ (q 1 )M (q 2 ). . .M (q T )1 , where (q) is defined as the diagonal matrix with i being the diagonal element p i (q) and 1 being an N dimensional vector of 1.

Appendix B: HMM likelihood maximization with EM algorithm
In order to set out the likelihood computation in the form of the Baum-Welch algorithm (Welch, 2003), which involves the forward α(t) and backward β(t) probabilities, we define α(t) and β(t) as respectively. More specifically, α i (t) is the probability of observing the partial sequence q 1 , q 2 , . . ., q t and ending up in state i at time t, and β i (t) is the probability of observing the remaining sequence. Numerical calculation of α i (t) and β i (t) is not trivial (Akintug and Rasmussen, 2005). Here we use the method suggested by Durbin et al. (1998) for scaling forward and backward probabilities to overcome this problem. Now let us define u j (t) and v j k (t) as the following (Zucchini et al., 2017): u j (t) = p ( t = j |Q, θ ) = α j (t)β j (t) ζ , v j k (t) = p ( t−1 = j, t = k|Q) = α j (t − 1)M j k p k (q t )β k (t)/ζ, where M j k is the probability of transition from hidden climate state j to climate state k, and ζ is the likelihood function. With EM algorithm, we aim to maximize the loglikelihood of the parameters of interest θ, based on complete data (i.e. both the observed data and the hidden climate states). Now let us represent the sequence of climate states (missing data) by the Markov chain by the zero-one random variables. The complete data log-likelihood can be formulated as log(ζ (θ|Z)) = where u j (t) = 1 if and only if t = j (t = 1, 2, . . ., T ), and transition probability v j k (t) = 1 if and only if t−1 = j and Figure B1. Expectation-maximization algorithm for a HMM parameter estimation. t = k(t = 2, 3, . . ., T ); N is the number of hidden climate states, δ j is the initial transition of Markov chain, and p j (.) is the probability mass function if the Markov chain is in state j at time t. Maximization of the complete data log-likelihood function is performed with the EM algorithm through an iterative process presented in Fig. B1.