Suitability of 17 gridded rainfall and temperature datasets for large-scale hydrological modelling in West Africa

. This study evaluates the ability of different gridded rainfall datasets to plausibly represent the spatio-temporal patterns of multiple hydrological processes (i.e. streamﬂow, actual evaporation, soil moisture and terrestrial water storage) for large-scale hydrological modelling in the predominantly semi-arid Volta River basin (VRB) in West Africa. Seventeen precipitation products based essentially on gauge-corrected satellite data (TAMSAT, CHIRPS, ARC, RFE, MSWEP, GSMaP, PERSIANN-CDR, CMORPH-CRT, TRMM 3B42 and TRMM 3B42RT) and on reanalysis (ERA5, PGF, EWEMBI, WFDEI-GPCC, WFDEI-CRU, MERRA-2 and JRA-55) are compared as input for the fully distributed mesoscale Hydrologic Model (mHM). To assess the model sensitivity to meteorological forcing during rainfall partitioning into evaporation and runoff, six different temperature reanalysis datasets are used in combination with the precipitation datasets, which results in evaluating 102 combinations of rainfall–temperature input data. The model is recalibrated for each of the 102 input combinations, and the model responses are evaluated by using in situ streamﬂow data and satellite remote-sensing datasets from GLEAM evaporation, ESA CCI soil moisture and GRACE terrestrial water storage. A bias-insensitive metric is used to assess the impact of meteorological forcing on the simulation of the spatial patterns of hydrological processes. The results of the process-based evaluation show that the rainfall datasets have contrasting performances across the four climatic zones present in the VRB. The top three best-performing rainfall datasets are TAMSAT, CHIRPS and PERSIANN-CDR for streamﬂow; ARC, RFE and CMORPH-CRT for terrestrial water storage; MERRA-2, EWEMBI/WFDEI-GPCC and PGF for the temporal dynamics of soil moisture; MSWEP, TAMSAT and ARC for the spatial patterns of soil moisture; ARC, RFE and GSMaP-std for the temporal dynamics of actual evaporation; and MSWEP, TAMSAT and MERRA-2 for the spatial patterns of actual evaporation. No single rainfall or temperature dataset consistently ranks ﬁrst in reproducing the spatio-temporal variability of all hydrological processes. A dataset that is best in reproducing the temporal dynamics is not necessarily the best for the spatial patterns. In addition, the results suggest that there is more uncertainty in representing the spatial patterns of hydrological processes than their temporal dynamics. Finally, some region-tailored datasets outperform the global datasets, thereby stressing the necessity and importance of regional evaluation studies for satellite and reanalysis meteorological datasets, which are in-creasingly becoming an alternative to in situ measurements in data-scarce regions.

2019). Precipitation is one of the major components of the water cycle, which has led to numerous initiatives on understanding its generation, and estimating its amount and variability on Earth (Maidment et al., 2015;Cui et al., 2019). In hydrological modelling (Singh, 2018;, precipitation is the most important driver variable that determines the spatio-temporal variability of other hydrological fluxes and state variables (Thiemig et al., 2013;Bárdossy and Das, 2008).
With the development of distributed hydrological models that facilitate large-scale predictions Fatichi et al., 2016;Ocio et al., 2019), there is a growing need to inform and evaluate those models with distributed observational datasets to improve spatio-temporal process representation (Baroni et al., 2019;Paniconi and Putti, 2015;Hrachowitz and Clark, 2017). A key challenge is the spatio-temporal intermittency of precipitation, which is a major challenge for its measurement and its spatial interpolation (Tauro et al., 2018;Acharya et al., 2019;Bárdossy and Pegram, 2013;P. D. Wagner et al., 2012), especially in regions with particular features such as complex topography, convection-driven precipitation or snowfall occurrence. A comprehensive description of precipitation measurement techniques can be found in previous studies (e.g. Tapiador et al., 2012;Stephens and Kummerow, 2007;Kidd and Huffman, 2011;Levizzani et al., 2020). The drawbacks of in situ measurements of precipitation include limited and uneven areal coverage, deficiencies in instruments and costly maintenance Awange et al., 2019;Harrison et al., 2019), and they have led to the advent of precipitation estimation from space (Barrett and Martin, 1981). Precipitation estimates from space are spatially homogeneous and cover inaccessible regions with uninterrupted records over time (Beck et al., 2019b;Funk et al., 2015).
The advent of satellite-based rainfall products (SRPs) has opened up new avenues for water resources monitoring and prediction, especially in data-scarce regions (Serrat-Capdevila et al., 2014;Sheffield et al., 2018;Hrachowitz et al., 2013). Although the use of SRPs in hydrology is increasing (Xu et al., 2014;Chen and Wang, 2018), they have not been fully adopted for operational purposes yet (Ciabatta et al., 2016;Kidd and Levizzani, 2011). The limited uptake of SRPs in hydrology is due to measurement bias, inadequate spatio-temporal resolutions (e.g. for extreme-event simulation) and shortness of the records for some applications (e.g. climate change impact assessments), and the scepticism of some potential users with regard to the data quality (Marra et al., 2019). In the past decades, a large number of SRPs have been developed with different objectives, spatial and temporal resolutions, input sources, algorithms and acquisition methods Ashouri et al., 2015;Brocca et al., 2019). Several studies provide a review of SRPs (e.g. Maidment et al., 2014;Sun et al., 2018;Maggioni et al., 2016;Le Coz and van de Giesen, 2019).
In addition to SRPs, there are also atmospheric retrospective analysis (or reanalysis) datasets of precipitation. A reanalysis system is composed of a forecast model and a data assimilation scheme that integrates spatio-temporal observations of meteorological variables (i.e. temperature, humidity, wind and pressure) to generate gridded atmospheric data (Lorenz and Kunstmann, 2012;Schröder et al., 2018). Precipitation is one of the reanalysis model-generated fields that generally has more uncertainties than the meteorological state fields (Roca et al., 2019). Reanalysis datasets are often used in hydrological modelling (Tang et al., 2019;Duan et al., 2019;Gründemann et al., 2018), and sometimes they are preferred over SRPs because of their usually long-term records suitable for climate change studies and because of their higher performance in predictable large-scale stratiform systems (Seyyedi et al., 2015;Potter et al., 2018).
Despite the progress in satellite instruments, which has led to substantial advances in improving precipitation estimates Tang et al., 2019), there are known inconsistencies among the available SRPs (Sun et al., 2018;Tapiador et al., 2017). SRPs are subject to inherent errors originating mainly from precipitation retrieval instruments and algorithms, sampling frequency, and inadequate representation of cloud physics in some regions (Laiti et al., 2018;Alazzy et al., 2017;Romilly and Gebremichael, 2011). While on the one hand SRPs are subject to systematic biases, reanalysis products on the other hand have uncertainties resulting from their model forcing parameters, low spatial resolution with poor representation of sub-grid processes and the model physics (Bosilovich et al., 2008;Laiti et al., 2018). Uncertainty quantification both in SRPs and reanalysis data is subject to intense research (e.g. Maggioni et al., 2016;Awange et al., 2016;Westerberg and Birkel, 2015). The error quantification of SRPs and reanalysis products is usually done by comparing them with in situ measurements (e.g. Dembélé and Zwart, 2016;Thiemig et al., 2012;Beck et al., 2019a;Caroletti et al., 2019;Satgé et al., 2020), or by assessing their reliability as forcing for hydrological models (e.g. Duethmann et al., 2013;Pan et al., 2010;Nkiaka et al., 2017). Other evaluation approaches include triple collocation, which is a technique that estimates the variance of unknown errors of three independent variables without a reference or observed variable (e.g. Massari et al., 2017;Alemohammad et al., 2015;McColl et al., 2014;Roebeling et al., 2012). Compared to the ground-truthing approach, the hydrological evaluation approach has received limited attention (Camici et al., 2018;Poméon et al., 2017).
In rainfall-runoff modelling (Peel and McMahon, 2020), the non-linearity of hydrological processes (Blöschl and Zehe, 2005;Clark et al., 2009) can reduce or amplify the errors in the input rainfall data used and result in a satisfactory or poor representation of the hydrological responses (Maggioni and Massari, 2018;Nijssen, 2004). Consequently, the hydrological model can give a good representation of a hydrological state or flux variable for the wrong reasons (cf. Kirchner, 2006), thereby potentially leading to unfortunate consequences for water resources management (Zambrano-Bigiarini et al., 2017). When testing models as hypotheses (Beven, 2018;Pfister and Kirchner, 2017), type I errors (i.e. false positive model acceptability; Beven, 2010) should be avoided to ensure a high predictive skill of the model and its correctness for good decision-making. This sheds light on the importance of assessing the reliability of hydrological predictions generated with the use of SRPs and reanalysis products (Behrangi et al., 2011;Kuczera et al., 2010). In this context, knowing the adequacy and coherence of meteorological data in reproducing hydrological processes is a prerequisite to data selection for water resources management (Casse et al., 2015;Laiti et al., 2018).
In the context of hydrological evaluation of precipitation datasets, some limitations can be identified in previous studies. Some studies only evaluate a small number of precipitation datasets or do not consider reanalysis products (e.g. Bitew and Gebremichael, 2011;Ma et al., 2018;Liu et al., 2017;Bhattacharya et al., 2019). Usually, the influence of temperature datasets in combination with rainfall datasets is not tested (e.g. Satgé et al., 2019;Camici et al., 2018;Casse et al., 2015;Qi et al., 2016;Zhang et al., 2019), with the exception of a few studies (e.g. Laiti et al., 2018;Lauri et al., 2014), despite the importance of this interaction for evaporation simulation. Most studies evaluate a single hydrological state or flux variable, generally streamflow (e.g. Poméon et al., 2017;Seyyedi et al., 2015;Shayeghi et al., 2020;X.-H. Li et al., 2012) or soil moisture (e.g. Brocca et al., 2013). Some studies use lumped or semi-distributed models, therefore averaging the rainfall amount over large areas (e.g. Duan et al., 2019;Tang et al., 2019;Tobin and Bennett, 2014;Gosset et al., 2013;Shawul and Chakma, 2020), which reduces the bias effect that could occur at the pixel level with a fully distributed model. Often, the model is not recalibrated for each precipitation dataset (e.g. Voisin et al., 2008;Su et al., 2008;L. Li et al., 2012;Tramblay et al., 2016), which is, however, a prerequisite for reliable input field assessment (Stisen et al., 2012). Moreover, some studies perform a global-scale analysis and ignore regionally tailored products (e.g. Beck et al., 2017b;Fekete et al., 2004), which can outperform global products (e.g. Thiemig et al., 2013). Finally, to the best of our knowledge, no study has evaluated the simultaneous impact of various precipitation and temperature datasets on the spatial patterns of several hydrological processes (i.e. soil moisture and evaporation).
In light of the above, we propose to study the adequacy of different combinations of 17 precipitation datasets (10 SRPs and 7 reanalysis products) and 6 temperature datasets from reanalysis, when used as forcing data for a fully distributed hydrological model, in reproducing the spatio-temporal variability of multiple hydrological processes (i.e. streamflow, actual evaporation, soil moisture and terrestrial water storage). In total, 102 rainfall-temperature input data combinations are tested with the mesoscale Hydrologic Model (mHM) by recalibrating the model for each of the input data combinations. The experiment is carried out in the poorly gauged and predominantly semi-arid Volta River basin (VRB), located in West Africa, over the period 2003-2012. It is noteworthy that the goal of this study is not to estimate the intrinsic quality of the meteorological forcing (i.e. precipitation and temperature) but rather to understand the impact of the propagation of associated uncertainties on the simulation of hydrological processes (Bhuiyan et al., 2019;Falck et al., 2015;Marthews et al., 2020).
The VRB case study is particularly interesting from both scientific and societal perspectives. On the one hand, precipitation modelling in tropical monsoon climates is a challenging task due to strong seasonality and diurnal variations of rainfall (Turner et al., 2011;Pfeifroth et al., 2016;Cook and Vizy, 2019), and due to isolated convection systems in semiarid regions (Taylor et al., 2017;Mathon et al., 2002;Parker and Diop-Kane, 2017). On the other hand, open-access and good-quality datasets are needed for water resources management in West Africa (Roudier et al., 2014;Serdeczny et al., 2017;Di Baldassarre et al., 2010;Dinku, 2019). The following research questions are addressed: 1. What is the impact of different gridded rainfall and temperature datasets on the simulation of hydrological fluxes and state variables?
2. How important is the choice of meteorological datasets for the representation of spatial patterns versus temporal dynamics?
Overall, the objective of this work aligns with the efforts to solve the current scientific challenges in hydrology (i.e. uncertainty in large-scale measurements and data, spatial heterogeneity and modelling methods; Blöschl et al., 2019;Wilby, 2019). Moreover, a growing interest in using satellite remote-sensing data in hydrological modelling is expected (McCabe et al., 2017;Wilkinson et al., 2016). Therefore, knowing the suitability of the input data for hydrological modelling is a prerequisite for reliable spatio-temporal predictions, as the goal is to increase model performance with minimum uncertainty (Beven, 2016;McMillan et al., 2018;Savenije, 2009).

Overview of the modelling experiment
The adequacy of the rainfall and temperature datasets to plausibly reproduce various hydrological processes is tested with all the 102 possible combinations of 17 rainfall and 6 temperature datasets used as meteorological forcing (see Sect. 2.2). Different temperature datasets are used to allow flexibility in rainfall partitioning into evaporation and runoff because temperature is a key variable for the calculation of potential evaporation (Kirchner and Allen, 2020;Zheng et al., 2019;Van Stan et al., 2020). The hydrological model is recalibrated for each of the 102 combinations of rainfall-temperature datasets (Fig. 1).
The differences in the performance of model outputs are assumed to result from the propagation of the input data uncertainty through the model simulations (Nikolopoulos et al., 2010;Fallah et al., 2020). In the case of uncertainties resulting from the hydrological model structure, these uncertainties can be assumed to remain consistent for all the input datasets, and therefore it should not hinder the interpretation of the results, because only the parameters change during model calibration, not the model structure (Raimonet et al., 2017).

Modelling datasets
In addition to the meteorological datasets (Table 1), an ensemble of datasets is required for the set-up and the calibration and evaluation of the hydrological model ( Table 2). The streamflow datasets obtained from different organizations (see Acknowledgements) were pre-processed (i.e. gap-filling and quality control) in the work of Dembélé et al. (2019).
Multiple satellite datasets are used to evaluate the modelled hydrological fluxes and state variables. For the evaluation of the modelled water storages, the GRACE-derived terrestrial water storage (S t ) anomaly data release RL05 (Landerer and Swenson, 2012;Swenson, 2012) is used. The ensemble mean of different products from three processing centres (i.e. Jet Propulsion Laboratory, Center for Space Research at the University of Texas and Geoforschungszentrum Potsdam) is preferred because it is more effective in reducing noise in the Earth's gravity signal as compared to the individual products (Sakumura et al., 2014). The surface soil moisture (S u ) data representing the first soil layer (i.e. 2-5 cm depth) are obtained from ESA CCI  using the combination of both active and passive microwave products W. Wagner et al., 2012). Actual evaporation (E a ) data are obtained from the GLEAM land surface model that aggregates components of terrestrial evaporation based on the fraction of land cover types per grid cell (Martens et al., 2017). A full description of the datasets is accessible through the references and web links provided in Tables 1 and 2.

Study area
The transboundary Volta River basin (VRB) covers approximately 415 600 km 2 ( Fig. 2) shared among six countries of West Africa (i.e. Burkina Faso, Ghana, Togo, Mali, Benin and Côte d'Ivoire). The relief is predominantly flat with 95 % of the basin below 400 m a.s.l (De Condappa and Lemoalle, 2009). The Volta River flows over 1850 km with a drainage system composed of four sub-basins known as Black Volta (152 800 km 2 ), White Volta (113 400 km 2 ), Oti (74 500 km 2 ) and Lower Volta (74 900 km 2 ). Before reaching the Atlantic Ocean at the Gulf of Guinea, the Volta River transits through Lake Volta (area: 8502 km 2 ; volume: 148 km 3 ), formed by the Akosombo Dam (7.94 × 10 6 m 3 ) (Williams et al., 2016;Dembélé et al., 2020b). The dominant land cover is savannah composed of grassland interspersed with shrubs and trees over 75 % of the basin area, followed by cropland (13 %), forest (9 %), waterbodies (2 %) and bare land and settlements (1 %). Climate in West Africa is unique and complex (Berthou et al., 2019;Bichet and Diedhiou, 2018;Nicholson et al., 2018a). The seasonal and latitudinal oscillation of the Intertropical Convergence Zone (ITCZ) is the predominant rainfall generation mechanism in West Africa (Biasutti, 2019), thereby depicting a south-north gradient of increasing aridity in the VRB. The ITCZ is a narrow belt of clouds associated with intense convective activity resulting from the near-surface convergence of warm and moist trade winds (Schneider et al., 2014;Dezfuli, 2017). The warm northeasterly Harmattan winds emanate from the Sahara, and the moist southwest monsoon winds originate in the Atlantic Ocean (Nicholson, 2013;Vizy and Cook, 2018). Rainfall in West Africa is characterized by its interannual and multidecadal variability (Biasutti et al., 2018;Thorncroft et al., 2011;Nicholson et al., 2018b). Four eco-climatic zones (i.e. Sahelian, Sudano-Sahelian, Sudanian and Guinean; Fig. 2a) are commonly identified based on the average annual precipitation and agricultural features (FAO/GIEWS, 1998;Mul et al., 2015). The maps of spatial patterns of rainfall and temperature in the VRB for different datasets are shown in Appendix Figs. A1 and A2. The climatology of rainfall and tem- perature per climatic zones are provided in the Supplement (Figs. S11-S14).
The VRB is a data-scarce region, not like places in Europe and USA where a large amount of ground measurements are widely and freely accessible. The few datasets collected by local organizations in the VRB are not easily accessible due to the transboundary nature of the basin, which is shared among six countries. Moreover, the VRB region has a low density of meteorological stations (cf. Fig. 1 of Dembélé and Zwart, 2016, and Fig. 1 of Satgé et al., 2020). A thorough evaluation of satellite and reanalysis datasets with ground measurements in the VRB cannot be limited to a few stations because of the large size of the basin and the strong spatial variability of rainfall. Moreover, a robust ground evaluation would require independent in situ measurements that are not used in the development of the SRPs and reanalysis datasets (Beck et al., 2019a), as they are a luxury in West Africa. These limitations in in situ data availability further motivate the hydrological evaluation of SRPs and reanalysis datasets.

Hydrological model set-up
The fully distributed mesoscale Hydrologic Model (mHM, version 5.9; Samaniego et al., 2010;Kumar et al., 2013) is used in this study. It is a conceptual model that simulates dominant hydrological processes (e.g. evaporation, soil moisture, subsurface storage and discharge) per grid cell in the modelling domain. The Muskingum-Cunge method (Cunge, 1969) is used for routing the total grid-generated runoff using a multiscale routing model (Thober et al., 2019). A multiscale parameter regionalization technique (MPR; Samaniego et al., 2017) is used to account for sub-grid variability of the basin physical characteristics (e.g. soil texture, topography and land cover). For this study, 36 global parameters are determined through model calibration (Table S24).
In this study, the Hargreaves and Samani method (Hargreaves and Samani, 1985), solely based on air temperature data, is used to calculate the reference evaporation (E ref ). Potential evaporation (E p ) is calculated by adjusting E ref to vegetation cover (Allen et al., 1998;Birhanu et al., 2019). A dynamical scaling function (F DS ) (cf. Demirel et al., 2018) is used to account for vegetation-climate interactions (Bai et al., 2018;Jiao et al., 2017). E p is formulated as follows: where I LA represents the leaf area index, a is the intercept term, b represents the vegetation-dependent component and c describes the degree of non-linearity in the I LA dependency. The coefficients a, b and c are determined during model calibration. Actual evaporation (i.e. all evaporative fluxes including transpiration, E a ) depends on plant water availability, i.e. on root distribution in the subsurface and soil moisture availability (Feddes et al., 1976); this is emulated in mHM by computing E a as a fraction of E p at different soil layers. A multi-layer infiltration capacity approach is used to calculate soil moisture based on a three-layer soil scheme (5, 30 and 100 cm depths). As no snow occurs in the VRB, terrestrial water storage is calculated per grid cell by summing up the surface water storage on impervious areas and all subsurface water storage (i.e. reservoirs generating soil moisture, baseflow and interflow). The model is run at a daily time step with a spatial discretization of 0.25 • (∼ 28 km at the Equator).
The modelling experiment covers the period 2000-2012 with a 3-year model warm-up period (2000)(2001)(2002), 6 years   for model calibration (2003)(2004)(2005)(2006)(2007)(2008) and 4 years for model evaluation (2009)(2010)(2011)(2012). The model is calibrated and evaluated with the available daily in situ streamflow datasets from 11 locations (Fig. 2a), while the evaluation with satellite datasets of evaporation, soil moisture and terrestrial water storage is done at a monthly time step to avoid the impact of mismatches in the daily data retrieval periods among the satellite data sources. An illustration of natural variability of streamflow (Fig. S16), precipitation (Figs. S1 and S5) and temperature (Figs. S3-S4 and S6-S8) is provided in the Supplement.

Multisite model calibration on streamflow data
A multisite calibration strategy is adopted by simultaneously constraining the model with the 11 streamflow (Q) gauging stations (Fig. 2) to infer a unique parameter set for the whole basin. The objective function Q combines the Nash-Sutcliffe efficiency (Nash and Sutcliffe, 1970) of streamflow (E NS ) and the Nash-Sutcliffe efficiency of the logarithm of streamflow (E NSlog ), and it is formulated such that it has to be minimized: M. Dembélé et al.: Suitability of 17 gridded rainfall and temperature datasets 5387 where Q mod and Q obs are the modelled and the observed streamflow, t is the number of time steps of the calibration period, and g is the number of streamflow gauging stations present within the modelling domain. Q is calculated with all the streamflow gauging stations, and it ranges from its ideal value of 0 to positive infinity. The model is calibrated solely with Q data because it is the only available in situ measurement, and to avoid potential trade-offs of a multivariate calibration that would result in difficulties in identifying the source of variation in the model performance (i.e. input data vs. model parametrization) (Dembélé et al., 2020b). The parameter estimation is done with the dynamically dimensioned search algorithm (Tolson and Shoemaker, 2007) using 4000 iterations for each of the 102 rainfall-temperature dataset combinations.

Multivariable model evaluation with streamflow and satellite data
In addition to E NS and E NSlog , the Kling-Gupta efficiency (E KG ) (Kling et al., 2012) is used to evaluate the model performance for streamflow: where r KG is the Pearson correlation coefficient, β KG is the bias term (i.e. the ratio of the means), and γ KG is the variability term (i.e. the ratio of the coefficients of variation) between Q obs and Q mod . The E KG ranges from negative infinity to its optimal value of unity. As a reference, E KG > −0.41 indicates that the model is better than the mean observed flow . In addition to Q, several non-commensurable and satellitebased variables are used for model evaluation ( Table 2). The bias-insensitive Pearson's correlation coefficient (r) is used to assess the temporal dynamics of S t , S u and E a because the model is not calibrated on these variables, and their evaluation datasets are satellite-derived products that encompass uncertainties and can be biased.
The spatial pattern representation of hydrological processes is assessed by using a bias-insensitive and multicomponent metric developed by Dembélé et al. (2020b). The proposed spatial pattern efficiency (E SP ) metric is formulated similarly to the E KG (Eq. 4), but it focuses only on the spatial pattern of variables rather than on their absolute values (like the SPAEF; Koch et al., 2018). E SP simultaneously assesses the dynamics, the spatial variability, and the locational matching of grid cells between the observed (X obs ) and modelled (X mod ) variables. Considering two variables X obs and X mod composed of n cells, E SP is defined as follows: where r s is the Spearman rank-order correlation coefficient, with d i being the difference between the ranks of the ith cell of X mod and X obs . γ is the variability ratio (i.e. the ratio of the coefficients of variation) that assesses the similarity in the dispersion of the probability distributions of X mod and X obs , with µ and σ representing the mean and the standard deviation, and α the spatial location matching term calculated as the root-mean-squared error (E RMS ) of the standardized values (z scores, Z X ) of X mod and X obs (Dembélé et al., 2020b). E SP ranges from negative infinity to 1, which is its optimal value. E SP does not have an inherent benchmark, also like E KG . For E SP = 0, the ranks of the observed and modelled variables are moderately related (i.e. r s = 0.55), while no association among the ranks (i.e. r s = 0) results in E SP = −0.67 (cf. Supplement of Dembélé et al., 2020b). However, the main point of using E SP here is not to strictly conclude how well the modelled spatial patterns reproduce the observed patterns -otherwise a benchmark should be used (Schaefli and Gupta, 2007;Seibert et al., 2018) -but rather to determine if a modelled spatial pattern is better than another. The spatial pattern evaluation is completed for S u and E a , while only the temporal dynamics of S t are assessed due to the coarse spatial resolution of the GRACE data.
The relative variation in model performance is assessed with the second-order coefficient of variation (V 2 ) (Kvålseth, 2017). V 2 is an alternative to the classic Pearson's coefficient of variation (CV), which has significant limitations that are comprehensively discussed by Kvålseth (2017). The limitations of the CV include its difficult and non-intuitive interpretation because of the lack of an upper bound, its high sensitivity to outliers, its dependence on the sample mean and problems with negative values. For all sample data x = (x 1 , . . . , x n ) ∈ R n , with R = (-∞, ∞), V 2 is defined as follows: where s is the standard deviation and x is the mean of x. V 2 varies from 0 to 1 or 0 % to 100 % and represents the distance between x and x relative to the distance between x and the origin zero.
The results are presented and discussed for the entire simulation period (2003-2012, i.e. combined calibration and evaluation periods) because reliable meteorological datasets are expected to produce a plausible representation of hydrological processes independently of the modelling period (Bisselink et al., 2016). Separated results are provided for the calibration and evaluation periods in the Supplement.

Model performance for streamflow
Similar model performance patterns are obtained with E KG, E NS and E NSlog of daily streamflow (Q) (Fig. 3). Therefore, only E KG is retained for the description of the results. All input dataset combinations show a median E KG >0.5, except those having JRA-55 as rainfall input (Fig. 3), which can be justified by the coarse spatial resolution of that product. The ranking of the rainfall and temperature datasets based on the model performance for Q is provided in Appendix Table A1. The analysis of model performance for Q is done for the entire VRB and not per climatic zone due to the limited number of stations. As expected, the discrepancies in median E KG are more pronounced across rainfall datasets than across temperature datasets, as visible in the colourcoded ranking of the products in Fig. 3. For a given rainfall product, the ranking among all rainfall products hardly varies with different temperature products. The ranking of all the datasets for the model performance for Q is also summarized in Table A1. The overall stronger impact of the choice of the rainfall dataset on E KG of Q also becomes clear from the V 2 of the median E KG (Table S3). For rainfall datasets, the V 2 across temperature datasets varies between 0.5 % for GSMaP-std and 4 % for JRA-55, with an average V 2 of 2 %. For temperature datasets, the V 2 of median E KG of Q across rainfall datasets varies between 10 % for MERRA-2 and 12 % for ERA5, with an average V 2 of 11 %. This result suggests that the choice of rainfall dataset has a stronger impact on the E KG of Q than the choice of a temperature dataset. The analysis of the components of E KG (i.e. the Pearson correlation r KG , the bias β KG and the variation γ KG ) reveals that, when choosing a rainfall dataset, there is more uncertainty in the bias of Q (V 2 = 14 %) than in its variability (V 2 = 6 %) and in its dynamics (V 2 = 3 %), which is in agreement with the work of Thiemig et al. (2013). Detailed results on the performance for Q (i.e. E NS , E NSlog , E KG , r KG , β KG and γ KG ) and the ranking of the datasets with separate results for the calibration and evaluation periods are provided in the Supplement (Tables S1-S18, Figs. S17-S26).

Model performance for terrestrial water storage
The model performance for the temporal dynamics of monthly terrestrial water storage (S t ) compared to the GRACE product is shown in Fig. 4 (see the Supplement for monthly time series, Figs. S38-S42). The average Pearson correlation coefficient (r) of S t for all datasets in the entire VRB is 0.80, with discrepancies across climatic zones. The driest and wettest climatic zones show the lowest performances, i.e. Sahelian (r = 0.67) and Guinean (r = 0.60) zones, compared to the intermediate climatic zones, i.e. Sudano-Sahelian (r = 0.72) and Sudanian (r = 0.79) zones. Table A1 provides the ranking of all the meteorological datasets for the model performance for S t .
The rainfall datasets show different performances across climatic zones, with ARC showing the highest score for all the climatic zones except the Guinean zone, where CMORPH-CRT ranks first. The choice of the rainfall dataset leads to an average V 2 of 15 % for the r of S t , while the average V 2 is 5 % for the choice of the temperature dataset. Detailed results are provided in the Supplement (Tables S19, Figs. S27-S37). Figure 5 shows the model performance for the temporal dynamics of monthly soil moisture (S u ) compared to the ESA CCI product (see the Supplement for monthly time series, Figs. S54-S58). The average r of S u for the entire VRB over all datasets is 0.93. The r of S u decreases from the drier to the wetter climatic zones: Sahelian (r = 0.94), Sudano-Sahelian (r = 0.94), Sudanian (r = 0.92) and Guinean (r = 0.86). The ranking of the meteorological datasets based on the model performance for S u is provided in Table A1. EWEMBI and WFDEI-GPCC show the highest performance in the Sahelian and Sudano-Sahelian zones respectively, while MERRA-2 shows the highest performance in the Sudanian and Guinean zones. The choice of the rainfall dataset leads to an average V 2 of 4 % for the temporal dynamics of S u , while the average V 2 is 2 % for the choice of the temperature dataset.

Model performance for soil moisture
The spatial patterns of S u show considerable differences when using different combinations of rainfall and temperature input datasets, as illustrated in Fig. 6 (see similar maps for all the meteorological datasets in the Supplement, Figs. S59-S60). The south-north gradient of increasing aridity is not similarly spread among the rainfall-temperature dataset combinations. More interestingly, west-east differences in the spatial patterns of S u can be observed. These differences in spatial pattern reproduction can also be seen in the spatial pattern efficiency metric (E SP ) of S u for the 102 rainfall-temperature dataset combinations (Fig. 7). The average E SP of S u in the VRB over all datasets is −0.11.
For the entire VRB, the choice of the rainfall dataset leads to an average variation of 61 % for the E SP of S u , while the choice of the temperature dataset involves a variation of 45 %. Lower impacts of data choices are observed in the climatic zones where the climate is homogeneous as compared to the entire VRB. The choice of rainfall dataset is more critical for the E SP of S u in the driest and wettest climatic zones,

Model performance for actual evaporation
The model performance for the temporal dynamics of monthly actual evaporation (E a ) compared to the GLEAM product is shown in Fig. 8 (see the Supplement for monthly time series, Figs. S72-S76). The average r of E a for the  entire VRB over all datasets is 0.93. Similarly to S u , the r of E a is higher in the driest climatic zones: Sahelian (r = 0.94), Sudano-Sahelian (r = 0.94), Sudanian (r = 0.89) and Guinean (r = 0.81). However, the predictive skill of the model for the temporal dynamics of E a is higher than its predictive skill for E a in the wetter climatic zones. Table A1 shows the ranking of all the meteorological datasets for the model performance for E a . The rainfall datasets show different performances across climatic zones, with the following best datasets: PERSIANN-CDR in the Sahelian zone, EWEMBI and WFDEI-GPCC in the Sudano-Sahelian zone, and ARC in the Sudanian and Guinean zones. The choice of the rainfall dataset leads to an average V 2 of 4 % for the temporal dynamics of E a , while the average V 2 is 1.5 % for As for S u , the choice of input datasets has a considerable impact on the reproduction of the spatial patterns of E a (Fig. 9). Similar maps for all the meteorological datasets are provided in the Supplement (Figs. S77-S78). It can be observed that different rainfall-temperature combinations used to force the model result in large discrepancies in the spatial pattern of E a , especially in the southern region. The south-north gradient of increasing aridity with west-east differences is represented differently among the rainfalltemperature dataset combinations (see e.g. the difference between the first two columns of the first row in Fig. 9) The E SP of E a for the 102 rainfall-temperature dataset combinations in the VRB is given in Fig. 10. The average E SP of E a in the VRB over all datasets is 0.07, which is higher than for S u (E SP = −0.11). The choice of the rainfall dataset for the VRB affects the E SP of E a on average by 93 %, while the choice of the temperature dataset involves a variation 33 %. However, lower impacts of data choices are observed in the climatic zones. The choice of rainfall dataset is more critical for the E SP of E a in the driest and wettest climatic zones, i.

Discussion
This study builds upon and expands existing research studies on the evaluation of meteorological datasets in several ways: i. the evaluation of the spatial patterns of multiple hydrological processes (i.e. streamflow, actual evaporation, soil moisture and terrestrial water storage) in addition to the more classically evaluated temporal dynamic, ii. the evaluation of a high number of both satellite-based and reanalysis rainfall datasets considered in combination with different temperature datasets, iii. the assessment of the model performance across four considerably different climatic zones from semi-arid to sub-humid.
The overall outcome of this analysis is the ranking of all the meteorological datasets based on their ability to simulate various hydrological processes across different climatic zones in the VRB (Table A1). It is worth noting that the overall ranking shows which product is best or worst at simulating a given hydrological flux or state variable. However, the ranking does not systematically tell whether a dataset is good or bad. Only the skill scores can be used to make a judgement on the adequacy of a given dataset to produce plausible model outputs.
The results show that there is no single rainfall dataset outperforming the others in reproducing all hydrological processes across different climatic zones. These findings align with previous studies in the sense that there is no rainfall dataset that is the best everywhere (Beck et al., 2017b;Sylla et al., 2013). For datasets providing both rainfall and temperature data, the combination of the two variables as model input is not necessarily the best option for obtaining the highest performance in modelling a given hydrological state or flux variable. The best rainfall-temperature combinations for the spatio-temporal representation of each hydrological flux and state variable are provided in the Supplement (Fig. S15).
The results are primarily valid for the study region in West Africa, while a wider generalization of the findings should be made with caution and after repeating similar evaluation studies at other places. Nevertheless, the key message is that there is no rainfall dataset of all hydrological processes and that the best rainfall dataset for temporal dynamics might not be the best for spatial patterns. Therefore, different rainfall datasets should be evaluated before choosing the most suitable one for hydrological modelling in large catchments.
Moreover, when comparing the results of this study to the findings of Satgé et al. (2020) based on a point-to-pixel eval-uation of gridded rainfall datasets in West Africa, it is noticeable that the ground evaluation might lead to different results as compared to the hydrological evaluation adopted in the current study. The skill of a rainfall product in reproducing ground measurements well under a point-to-pixel evaluation does not necessarily correlate with its performance for hydrological modelling, particularly in large and complex hydroclimatic environments such as the VRB.
Despite the efforts to produce a comprehensive evaluation of the meteorological datasets, the results obtained might be subject to uncertainties related to the potential model structural deficiencies as well as errors in the observational datasets used for the model evaluation (McMillan et al., 2010;Renard et al., 2010;Gupta and Govindaraju, 2019). The distribution of the final model parameters (Figs. S79-S80) highlights the possibility of obtaining equally good model performances for different parameter sets (i.e. equifinality), which can be a justification for model recalibration. Moreover, it can be noticed that most of the model parameters are sensitive to the change in meteorological input datasets (Fig. S79). A detailed analysis of parameter variability as a function of input data is beyond the scope of the current study but could build the basis of future research, namely to identify data errors by analysing parameter patterns (e.g. rooting depth) and resolve potential structural deficiencies of the mHM model. However, the mHM is chosen because of its adequacy for the experiment of this study (for model selection, see Addor and Melsen, 2019). The structure of mHM allows the representation of seamless spatial patterns of hydrological processes through the MPR scheme . In addition, mHM facilitates parameter regionalization and is therefore convenient for large-scale modelling, and it harnesses the full potential of the forcing datasets as it is a fully distributed model that has performed well in previous studies including those in the VRB (e.g. Poméon et al., 2018;Dembélé et al., 2020b). Regarding the model evaluation, the comparison between the observed and modelled hydrological processes is made only with regard to their temporal dynamics and spatial patterns using bias-insensitive metrics, except for streamflow, which limits the potential impact of satellite data uncertainty.
The model is calibrated only on Q data despite the known limitations of the Q-only calibration . However, calibrating the model on additional variables would result in additional model performance improvement that would not be separable from the contribution of the input datasets to the model performance. Therefore, regarding the goal of this study, the Q-only calibration was the best option to obtain the impact of various meteorological forcing datasets on the plausibility of hydrological processes. As no rainfall dataset ranks first in simulating all the hydrological processes, this study confirms that model calibration on multiple variables is a way forward in improving the overall representation of the hydrological system and increasing the predictive skill of hydrological models (Dembélé et al., 2020b;Dembélé et al., 2020a). The domain-wide calibration strategy adopted in this study generates a unique parameter set for the simulation of multiple hydrological processes across several catchments with different hydroclimatic features, which has the consequence of having local differences in model performance. However, domain-wide calibration has proved to perform similarly to domain-split calibration in previous studies (Mizukami et al., 2017), and it was ideal for this study because of the interest in simulating seamless spatial patterns, which might have not been possible with separately simulated portions of the basin. Moreover, the main goal of this study is to assess the adequacy of the meteorological datasets for large-scale hydrological modelling, knowing that these datasets usually have a coarse spatial resolution with pixels often averaged over regions with strong sub-grid variability.
Finally, the importance of regional evaluation is emphasized by this study because some region-tailored datasets (e.g. TAMSAT and ARC) which are not included in globalscale studies (e.g. Beck et al., 2017b;Essou et al., 2016) outperform global datasets. The decision to use a given dataset is motivated not only by the availability or the accuracy of the data but also by data accessibility (e.g. storage platforms, openness, format and pre-processing requirement). The findings of this study provide further awareness for the data users and improvement avenues for data producers in their quest of the most accurate products (e.g. Massari et al., 2020;Contractor et al., 2020;Berg et al., 2018;Brocca et al., 2014;Cucchi et al., 2020;Beck et al., 2017a).

Conclusion
This modelling study evaluates the ability of multiple combinations of rainfall-temperature datasets to reproduce plausible hydrological processes and patterns. The experiment is done in the Volta River basin with the fully distributed mesoscale Hydrologic Model (mHM) over a 10-year period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012), using 17 rainfall and 6 temperature datasets from satellite and reanalysis sources. The spatial and temporal representation of streamflow, terrestrial water storage, soil moisture and actual evaporation are evaluated using in situ and satellite remote-sensing observational datasets. The key findings are as follows: -No rainfall dataset consistently outperforms all the others in reproducing the highest model performance for all hydrological processes, and the best dataset for the temporal dynamics is not necessarily the best for the spatial patterns.
-Rainfall datasets have a higher impact on the spatiotemporal representation of hydrological processes than temperature datasets, but the latter have a greater influence on the spatial patterns of soil moisture.
-The large-scale performance for the meteorological datasets is not always valid for sub-regions in the same basin.
The findings of this study give a critical insight on the performance for several meteorological datasets in the challenging hydroclimatic environment of West Africa. They are expected to foster further research initiatives on improving the gridded meteorological datasets and further draw users' attention to the contrasting performances of these datasets in modelling hydrological fluxes and state variables. Efforts should be devoted to reporting on the impact of data uncertainties on process representation in hydrological modelling, especially when model outputs are used for decision-making. Future studies can test the transferability of the model's global parameters across different input datasets, i.e. how reliable a parameter set obtained with a given input dataset is for running the same model with a different input dataset. The answer to this research question will shed light on the necessity of model recalibration when using different meteorological forcing. Furthermore, the predictive skill of the model can be improved with a parameter sensitivity analysis to determine parameters that affect the spatio-temporal representation of each hydrological flux and state variable.   Table A1. Model performance for streamflow (Q), terrestrial water storage (S t ), soil moisture (S u ) and actual evaporation (E a ) using various rainfall-temperature dataset combinations as model inputs. Each score for a given rainfall product represents the average over individual combinations with 6 temperature datasets, while the score is the average over combinations with 17 rainfall datasets for each temperature dataset. The skill scores of the temporal dynamics are obtained with the Kling-Gupta efficiency (E KG ), the Nash-Sutcliffe efficiency (E NS ) and the Nash-Sutcliffe efficiency of the logarithm (E NSlog ) for Q, and the Pearson's correlation coefficient (r) for S t , S u and E a . The spatial pattern efficiency (E SP ) is used to assess the spatial representation of S u and E a . The skill scores are ranked from the best (blue) to the worst (red). The results are shown for the four climatic zones in the Volta River basin (VRB) over the simulation period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012).
Data availability. The meteorological and modelling datasets used in this study are freely available via the web links provided in Table 1 and Table 2. More information on satellite-based precipitation datasets can be found at http://ipwg.isac.cnr.it/ (last access: 10 December 2019) (IPWG, 2019). The modelling database is available at https://doi.org/10.5281/zenodo.3662308 (Dembélé, 2020).
Author contributions. MD performed the analyses and drafted the manuscript. All authors contributed to the writing, review and editing process that led to the final manuscript.
Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. We thank the providers of the datasets used in this study (see Tables 1 and 2  Review statement. This paper was edited by Albrecht Weerts and reviewed by Nadav Peleg and one anonymous referee.