Suitability of 17 rainfall and temperature gridded datasets for large- scale hydrological modelling in West Africa

This study evaluates the ability of different gridded rainfall datasets to plausibly represent the spatiotemporal 10 patterns of multiple hydrological processes (i.e. streamflow, actual evaporation, soil moisture and terrestrial water storage) for large-scale hydrological modelling in the predominantly semi-arid Volta River Basin (VRB) in West Africa. Seventeen precipitation products based on satellite data (TAMSAT, CHIRPS, ARC, RFE, MSWEP, GSMaP, PERSIANN-CDR, CMORPH-CRT, TRMM 3B42, TRMM 3B42RT) and on reanalysis (JRA-55, EWEMBI, WFDEI-GPCC, WFDEI-CRU, MERRA-2, PGF and ERA5) are compared as input for the fully distributed mesoscale Hydrologic Model (mHM). To assess 15 the model sensitivity to meteorological forcing during rainfall partitioning into evaporation and runoff, six different temperature reanalysis datasets are used in combination with the precipitation datasets, which results in evaluating 102 combinations of rainfall-temperature input data. The model is recalibrated for each of the 102 input combinations, and the model responses are evaluated by using in-situ streamflow data and satellite remote sensing datasets from GLEAM evaporation, ESA CCI soil moisture, and GRACE terrestrial water storage. A bias-insensitive metric is used to assess the 20 impact of meteorological forcing on the simulation of the spatial patterns of hydrological processes. The results of the processbased evaluation show that the rainfall datasets have contrasting performances across the four climatic zones present in the VRB, suggesting that, in general, basin-wide hydrological model performance might be misleading and invalid for a smaller spatial domain. No single rainfall or temperature dataset consistently ranks first in reproducing the spatiotemporal variability of all hydrological processes. A dataset that is best in reproducing the temporal dynamics is not necessarily the best for the 25 spatial patterns. In addition, the results suggest that there is more uncertainty in representing the spatial patterns of hydrological processes than their temporal dynamics. Finally, some region-tailored datasets outperform the global datasets, thereby stressing the necessity and importance of regional evaluation studies for satellite and reanalysis meteorological datasets.

Keywords: Precipitation; Atmospheric forcing; Hydrological consistency; Process-based evaluation; Data uncertainty propagation; Ungauged basins; Data scarce regions 1 Introduction 35 Our understanding of environmental systems is underpinned by observational data whose unavailability and uncertainties hinder research and operational applications. Among other factors, atmospheric data quality is of prime importance for the reliability of hydro-meteorological and climatological studies (Ledesma and Futter, 2017;Zandler et al., 2019). Precipitation is one of the major components of the water cycle, which has led to numerous initiatives on understanding its generation, and estimating its amount and variability on Earth (Maidment et al., 2015;Cui et al., 2019). In hydrological modelling (Singh, 40 2018;Beven, 2019), precipitation is the most important driver variable that determines the spatiotemporal variability of other hydrological fluxes and state variables (Thiemig et al., 2013;Bárdossy and Das, 2008).
With the development of distributed hydrological models that facilitate large-scale predictions Fatichi et al., 2016;Ocio et al., 2019), there is a growing need to inform and evaluate those models with distributed observational datasets to improve spatiotemporal process representation (Baroni et al., 2019;Paniconi and Putti, 2015;Hrachowitz and Clark, 2017). 45 A key challenge is the spatiotemporal intermittency of precipitation, which is a major challenge for its measurement and its spatial interpolation (Tauro et al., 2018;Acharya et al., 2019;Bárdossy and Pegram, 2013;Wagner et al., 2012a), especially in regions with particular features such as complex topography, convection-driven precipitation or snowfall occurrence. A comprehensive description of precipitation measurement techniques can be found in previous studies (e.g. Tapiador et al., 2012;Stephens and Kummerow, 2007;Kidd and Huffman, 2011;Levizzani et al., 2020). The drawbacks of in-situ 50 measurements of precipitation include limited and uneven areal coverage, deficiencies in instruments and costly maintenance Awange et al., 2019;Harrison et al., 2019), and have led to the advent of precipitation estimation from space (Barrett and Martin, 1981). Precipitation estimates from space are spatially homogeneous and cover inaccessible regions with uninterrupted records over time (Beck et al., 2019b;Funk et al., 2015).
The advent of satellite-based rainfall products (SRPs) has opened up new avenues for water resources monitoring and 55 prediction, especially in data sparse regions (Serrat-Capdevila et al., 2014;Sheffield et al., 2018;Hrachowitz et al., 2013).
Although, the use of SRPs in hydrology is increasing (Xu et al., 2014;Chen and Wang, 2018), they have not been fully adopted for operational purposes yet (Ciabatta et al., 2016;Kidd and Levizzani, 2011). The limited uptake of SRPs in hydrology is due to measurement bias, inadequate spatiotemporal resolutions (e.g. for extreme event simulation) and shortness of the records for some applications (e.g., climate change impact assessments), and the skepticism of some potential users with regard to the 60 data quality (Marra et al., 2019). In the past decades, a large number of SRPs have been developed with different objectives, spatial and temporal resolutions, input sources, algorithms and acquisition methods Ashouri et al., 2015;Brocca et al., 2019). Several studies provide a review of SRPs (e.g. Maidment et al., 2014;Sun et al., 2018;Maggioni et al., 2016;Le Coz and van de Giesen, 2019).
In addition to SRPs, there are also atmospheric retrospective analysis (or reanalysis) datasets of precipitation. A reanalysis 65 system is composed of a forecast model and a data assimilation scheme that integrates spatiotemporal observations of meteorological variables (i.e. temperature, humidity, wind and pressure) to generate gridded atmospheric data (Lorenz and Kunstmann, 2012;Schröder et al., 2018). Precipitation is one of the reanalysis model-generated fields that generally has more uncertainties than the meteorological state fields (Roca et al., 2019). Reanalysis datasets are often used in hydrological modelling (Tang et al., 2019;Duan et al., 2019;Gründemann et al., 2018), and sometimes they are preferred over SRPs because 70 of their usually long-term records suitable for climate change studies, and because of their higher performance in predictable large-scale stratiform systems (Seyyedi et al., 2015;Potter et al., 2018).
Despite the progress in satellite instruments, which has led to substantial advances in improving precipitation estimates Tang et al., 2019), there are known inconsistencies among the available SRPs (Sun et al., 2018;Tapiador et al., 2017). SRPs are subject to inherent errors originating mainly from precipitation retrieval instruments and 75 algorithms, sampling frequency, and inadequate representation of cloud physics in some regions (Laiti et al., 2018;Alazzy et al., 2017;Romilly and Gebremichael, 2011). While on the one hand SRPs are subject to systematic biases, reanalysis products on the other hand have uncertainties resulting from their model forcing parameters, low spatial resolution with poor representation of sub-grid processes, and the model physics (Bosilovich et al., 2008;Laiti et al., 2018). Uncertainty quantification both in SRPs and reanalysis data is subject to intense research (e.g. Maggioni et al., 2016;Gebremichael, 80 2010;Awange et al., 2016;Westerberg and Birkel, 2015). The errors quantification of SRPs and reanalysis products is usually done by comparing them with in-situ measurements (e.g. Dembélé and Zwart, 2016;Thiemig et al., 2012;Beck et al., 2019a;Caroletti et al., 2019;Satgé et al., 2020), or by assessing their reliability as forcing for hydrological models (e.g. Duethmann et al., 2013;Pan et al., 2010;Nkiaka et al., 2017). Other evaluation approaches include triple collocation, which is a technique that estimates the variance of unknown errors of three independent variables without a reference or observed 85 variable (e.g. Massari et al., 2017;Alemohammad et al., 2015;McColl et al., 2014;Roebeling et al., 2012). Compared to the ground-truthing approach, the hydrological evaluation approach has received limited attention (Camici et al., 2018;Poméon et al., 2017).
In rainfall-runoff modelling (Peel and McMahon, 2020), the non-linearity of hydrological processes (Blöschl and Zehe, 2005;Clark et al., 2009) can reduce or amplify the errors in the used input rainfall data and result in a satisfactory or poor 90 representation of the hydrological responses (Maggioni and Massari, 2018;Nijssen, 2004). Consequently, the hydrological model can give a good representation of a hydrological state or flux variable for the wrong reasons (cf. Kirchner, 2006), thereby potentially leading to unfortunate consequences for water resources management (Zambrano-Bigiarini et al., 2017). When testing models as hypotheses (Beven, 2018;Pfister and Kirchner, 2017), type I errors (i.e. false positive model acceptability; Beven, 2010) should be avoided to ensure a high predictive skill of the model and its correctness for good decision-making. 95 This sheds light on the importance of assessing the reliability of hydrological predictions generated with the use of SRPs and reanalysis products (Behrangi et al., 2011;Kuczera et al., 2010). In this context, knowing the adequacy and coherence of meteorological data in reproducing hydrological processes is a prerequisite to data selection for water resources management (Casse et al., 2015;Laiti et al., 2018).
In the context of hydrological evaluation of precipitation datasets, some limitations can be identified in previous studies. Some 100 studies only evaluate a small number of precipitation datasets or do not consider reanalysis products (e.g. Bitew and Gebremichael, 2011;Ma et al., 2018;Liu et al., 2017;Bhattacharya et al., 2019). Usually, the influence of temperature datasets in combination with rainfall datasets is not tested (e.g. Satgé et al., 2019;Camici et al., 2018;Casse et al., 2015;Qi et al., 2016;, with the exception of a few studies (e.g. Laiti et al., 2018;Lauri et al., 2014), despite the importance of this interaction for evaporation simulation. Most studies evaluate a single hydrological state or flux variable, generally 105 streamflow (e.g. Poméon et al., 2017;Seyyedi et al., 2015;Shayeghi et al., 2020;Li et al., 2012b), or soil moisture (e.g. Brocca et al., 2013). Some studies use lumped or semi-distributed models, therefore averaging the rainfall amount on large areas (e.g. Duan et al., 2019;Tang et al., 2019;Tobin and Bennett, 2014;Gosset et al., 2013;Shawul and Chakma, 2020), which reduces the bias effect that could occur at the pixel level with a fully distributed model. Often, the model is not recalibrated for each precipitation dataset (e.g. Voisin et al., 2008;Su et al., 2008;Li et al., 2012a;Tramblay et al., 2016), which is, however, a 110 prerequisite for reliable input field assessment (Stisen et al., 2012). Moreover, some studies perform a global-scale analysis and ignore regionally tailored products (e.g. Beck et al., 2017b;Mazzoleni et al., 2019;Fekete et al., 2004), which can outperform global products (e.g. Thiemig et al., 2013). Finally, to the best of our knowledge, no study evaluated the simultaneous impact of various precipitation and temperature datasets on the spatial patterns of several hydrological processes (i.e. soil moisture and evaporation). 115 In light of the above, we propose to study the adequacy of different combinations of 17 precipitation datasets (10 SRPs and 7 reanalysis products) and 6 temperature datasets from reanalysis, when used as forcing data for a fully distributed hydrological model, in reproducing the spatiotemporal variability of multiple hydrological processes (i.e. streamflow, actual evaporation, soil moisture, and terrestrial water storage). In total, 102 rainfall-temperature input data combinations are tested with the mesoscale Hydrologic Model (mHM) by recalibrating the model for each of the input data combinations. The experiment is 120 carried out in the poorly gauged and predominantly semi-arid Volta River Basin (VRB) located in West Africa, over the period 2003-2012. It is noteworthy that the goal of this study is not to estimate the intrinsic quality of the meteorological forcing (i.e. precipitation and temperature) but rather to understand the impact of the propagation of associated uncertainties on the simulation of hydrological processes (Bhuiyan et al., 2019;Falck et al., 2015;Marthews et al., 2020).
The VRB case study is particularly interesting from both scientific and societal perspectives. On the one hand, precipitation 125 modelling in tropical monsoon climates is a challenging task due to strong seasonality and diurnal variations of rainfall (Turner et al., 2011;Pfeifroth et al., 2016;Cook and Vizy, 2019), and due to isolated convection systems in semi-arid regions (Taylor et al., 2017;Mathon et al., 2002;Parker and Diop-Kane, 2017). On the other hand, open access and good quality datasets are needed for water resources management in West Africa (Roudier et al., 2014;Serdeczny et al., 2017;Di Baldassarre et al., 2010;Dinku, 2019). The following research questions are addressed: 130 1) What is the impact of different gridded rainfall and temperature datasets on the simulation of hydrological fluxes and state variables?
2) How important is the choice of meteorological datasets for the representation of spatial patterns versus temporal dynamics?
Overall, the objective of this work aligns with the efforts in solving the current scientific challenges in hydrology (i.e. 135 uncertainty in large-scale measurements and data, spatial heterogeneity and modelling methods; Blöschl et al., 2019;Wilby, 2019). Moreover, a growing interest in using satellite remote sensing data in hydrological modelling is expected (McCabe et al., 2017;Peters-Lidard et al., 2017;Wilkinson et al., 2016). Therefore, knowing the suitability of the input data for hydrological modelling is a prerequisite for reliable spatiotemporal predictions, as the goal is to increase model performance with minimum uncertainty (Beven, 2016;McMillan et al., 2018;Savenije, 2009). 140 2 Methodology

Overview of the modelling experiment
The adequacy of the rainfall and temperature datasets to plausibly reproduce various hydrological processes is tested with all the 102 possible combinations of 17 rainfall and 6 temperature datasets used as meteorological forcing (see section 2.2).
Different temperature datasets are used to allow flexibility in rainfall partitioning into evaporation and runoff because 145 temperature is a key variable for the calculation of potential evaporation (Kirchner and Allen, 2020;Zheng et al., 2019;Van Stan et al., 2020). The hydrological model is recalibrated for each of the 102 combinations of rainfall-temperature datasets ( Figure 1). The differences in the performance of model outputs are assumed to result from the propagation of the input data uncertainty through the model simulations (Nikolopoulos et al., 2010;Fallah et al., 2020). In case of uncertainties resulting from the hydrological model structure, these uncertainties can be assumed to remain consistent for all the input datasets and therefore it should not hinder the interpretation of the results, because only the parameters change during model calibration and not the 155 model structure (Raimonet et al., 2017).
170 Table 1. Meteorological datasets with used spatial resolution; the table presents the characteristics of the datasets used in this study, although different spatial and temporal resolutions can be available from the data providers. G: gauge, S: satellite, R: reanalysis, P: precipitation, T: temperature, NP: near-present.

Datasets
Name

Modelling datasets
In addition to the meteorological datasets (Table 1), an ensemble of datasets is required for the set-up and the calibration and evaluation of the hydrological model ( Table 2). The streamflow datasets obtained from different organizations (see acknowledgements) were pre-processed (i.e. gap-filling and quality control) in the work of Dembélé et al. (2019).
Multiple satellite datasets are used to evaluate the modelled hydrological fluxes and state variables. For the evaluation of the 180 modelled water storages, the GRACE-derived terrestrial water storage (St) anomaly data release RL05 (Landerer and Swenson, 2012;Swenson, 2012) is used. The ensemble mean of different products from three processing centers (i.e. Jet Propulsion Laboratory, Center for Space Research at University of Texas, and Geoforschungs Zentrum Potsdam) is preferred because it is more effective in reducing noise in the Earth's gravity signal as compared to the individual products (Sakumura et al., 2014).
The surface soil moisture (Su) data representing the first soil layer (i.e. 2-5 cm depth) is obtained from ESA CCI (Dorigo et al.,185 2017) using the combination of both active and passive microwave products (Gruber et al., 2017;Wagner et al., 2012b). Actual evaporation (Ea) data is obtained from the GLEAM land surface model that aggregates components of terrestrial evaporation based on the fraction of land cover types per grid cell (Martens et al., 2017). A full description of the datasets is accessible through the references and web links provided in Table 1 and Table 2. 190  composed of four sub-basins known as Black Volta (152,800 km 2 ), White Volta (113,400 km 2 ), Oti (74,500 km 2 ), and Lower Volta (74,900 km 2 ). Before reaching the Atlantic Ocean at the Gulf of Guinea, the Volta River transits in the Lake Volta (area: 8,502 km 2 ; volume: 148 km 3 ) formed by the Akosombo dam (7.94 10 6 m 3 ) (Williams et al., 2016;Dembélé et al., 2020b). The dominant land cover is savannah composed of grassland interspersed with shrubs and trees over 75% of the basin area, followed by cropland (13%), forest (9%), water bodies (2%) and bare land and settlements (1%). Climate in West Africa is unique and 205 complex (Berthou et al., 2019;Bichet and Diedhiou, 2018;Nicholson et al., 2018a). The seasonal and latitudinal oscillation of the Inter-Tropical Convergence Zone (ITCZ) is the predominant rainfall generation mechanism in West Africa (Biasutti, 2019), thereby depicting a south-north gradient of increasing aridity in the VRB. The ITCZ is a narrow belt of clouds associated with intense convective activity resulting from the near-surface convergence of warm and moist trade winds (Schneider et al., 2014;Dezfuli, 2017). The warm northeasterly Harmattan winds emanate from the Sahara and the moist southwest monsoon 210 winds originate in the Atlantic ocean (Nicholson, 2013;Vizy and Cook, 2018). Rainfall in West Africa is characterized by its interannual and multidecadal variability (Biasutti et al., 2018;Thorncroft et al., 2011;Nicholson et al., 2018b). Four ecoclimatic zones (i.e. Sahelian, Sudano-Sahelian, Sudanian and Guinean; Table 3) are commonly identified based on the average annual precipitation and agricultural features (FAO/GIEWS, 1998;Mul et al., 2015). The aridity index in Table 3 is derived from the global aridity index database (Trabucco and Zomer, 2018). The maps of spatial patterns of rainfall and temperature 215 in the VRB for different datasets are shown in Appendix A1 and Appendix A2. The climatology of rainfall and temperature per climatic zones are provided in the Supporting Information (SI, Figures S11-S14).

Hydrological Model Setup
The fully distributed mesoscale Hydrologic Model (mHM, version 5.9; Samaniego et al., 2010; is used in 225 this study. It is a conceptual model that simulates dominant hydrological processes (e.g. evaporation, soil moisture, subsurface storage, and discharge) per grid cell in the modelling domain. The Muskingum-Cunge method (Cunge, 1969) is used for routing the total grid-generated runoff using a multiscale routing model (Thober et al., 2019). A multiscale parameter regionalization technique (MPR; Samaniego et al., 2017) is used to account for sub-grid variability of the basin physical characteristics (e.g. soil texture, topography and land cover). For this study, 36 global parameters are determined through 230 model calibration (Table S24 in the Supporting Information).
In this study, the Hargreaves and Samani method (Hargreaves and Samani, 1985), solely based on air temperature data, is used to calculate the reference evaporation (Eref). Potential evaporation (Ep) is calculated by adjusting Eref to vegetation cover (Allen et al., 1998;Birhanu et al., 2019). A dynamical scaling function (FDS) (cf. Demirel et al., 2018) is used to account for vegetationclimate interactions (Bai et al., 2018;Jiao et al., 2017). Ep is formulated as follows: 235 where ILA represents the leaf area index, a is the intercept term, b represents the vegetation dependent component, and c describes the degree of nonlinearity in the ILA dependency. The coefficients a, b, and c are determined during model calibration.
Actual evaporation (i.e. all evaporative fluxes including transpiration, Ea) depends on plant water availability, i.e. on root distribution in the subsurface and soil moisture availability (Feddes et al., 1976); this is emulated in mHM by computing Ea as a fraction of Ep at different soil layers. A multi-layer infiltration capacity approach is used to calculate soil moisture based on 240 a three-layer soil scheme (5 cm, 30 cm and 100 cm depths). As no snow occurs in the VRB, terrestrial water storage is calculated per grid cell by summing up the surface water storage on impervious areas and all subsurface water storage (i.e. reservoirs generating soil moisture, baseflow and interflow). The model is run at a daily time step with a spatial discretization of 0.25° (~28 km at the equator).
The modelling experiment covers the period 2000-2012 with 3-year model warm-up period (2000)(2001)(2002), 6 years for model 245 calibration (2003)(2004)(2005)(2006)(2007)(2008) and 4 years for model evaluation (2009)(2010)(2011)(2012). The model is calibrated and evaluated with the available daily in-situ streamflow datasets from 11 locations (Figure 2a), while the evaluation with satellite datasets of evaporation, soil moisture and terrestrial water storage is done at a monthly time step to avoid the impact of mismatches in the daily data retrieval periods among the satellite data sources. An illustration of natural variability of streamflow ( Figure S16), precipitation (Figures S1 and S5) and temperature  are provided in the Supporting Information (SI). 250

Multisite model calibration on streamflow data
A multisite calibration strategy is adopted by simultaneously constraining the model with the 11 streamflow (Q) gauging stations (Figure 2) to infer a unique parameter set for the whole basin. The objective function combines the Nash-Sutcliffe efficiency (Nash and Sutcliffe, 1970) of streamflow (ENS) and the Nash-Sutcliffe efficiency of the logarithm of streamflow 255 (ENSlog), and it is formulated such that it has to be minimized: where Qmod and Qobs are the modelled and the observed streamflow, t is the number of time steps of the calibration period, and g is the number of streamflow gauging stations present within the modelling domain. is calculated with all the streamflow gauging stations, and it ranges from its ideal value that is 0 to positive infinity.
The model is calibrated solely with Q data because it is the only available in-situ measurement, and to avoid potential trade-260 offs of a multivariate calibration that would result in difficulties in identifying the source of variation in the model performance (i.e. input data vs. model parametrization) (Dembélé et al., 2020b). The parameter estimation is done with the dynamically dimensioned search algorithm (Tolson and Shoemaker, 2007) using 4,000 iterations for each of the 102 rainfall-temperature dataset combinations.

Multivariable model evaluation with streamflow and satellite data
In addition to ENS and ENSlog, the Kling-Gupta efficiency (EKG) (Kling et al., 2012) is used to evaluate the model performance for streamflow.
where KG is the Pearson correlation coefficient, KG is the bias term (i.e. the ratio of the means), and KG is the variability term (i.e. the ratio of the coefficients of variation) between Qobs and Qmod. The EKG ranges from negative infinity to its optimal value 270 that is unity. As a reference, EKG > -0.41 indicates that the model is better than the mean observed flow (Knoben et al., 2019).
In addition to Q, several non-commensurable and satellite-based variables are used for model evaluation ( Table 2). The biasinsensitive Pearson's correlation coefficient (r) is used to assess the temporal dynamics of St, Su and Ea because the model is not calibrated on these variables, and their evaluation datasets are satellite-derived products that encompass uncertainties and 275 can be biased.
The spatial pattern representation of hydrological processes is assessed by using a bias-insensitive and multi-component metric developed by Dembélé et al. (2020b). The proposed spatial pattern efficiency (ESP) metric is formulated similarly to the EKG (Equation 4) but it focuses only on the spatial pattern of variables rather than on their absolute values (like the SPAEF; Koch 280 et al., 2018). ESP simultaneously assesses the dynamics, the spatial variability, and the locational matching of grid cells between the observed (Xobs) and modelled (Xmod) variables. Considering two variables Xobs and Xmod composed of n cells, ESP is defined as follows: = and (9) where rs is the Spearman rank-order correlation coefficient with di the difference between the ranks of the i th cell of Xmod and Xobs. is the variability ratio (i.e. the ratio of the coefficients of variation) that assesses the similarity in the dispersion of the 285 probability distributions of Xmod and Xobs, with and representing the mean and the standard deviation, and the spatial location matching term calculated as the root mean squared error (ERMS) of the standardized values (z-scores, ZX) of Xmod and Xobs (Dembélé et al., 2020b). ESP ranges from negative infinity to 1, which is its optimal value. ESP does not have an inherent benchmark, also like EKG (Knoben et al., 2019). For ESP = 0, the ranks of the observed and modelled variables are moderately related (i.e. rs = 0.55), while no association among the ranks (i.e. rs = 0) results in ESP = -0.67 (cf. Supplementary Material of 290 Dembélé et al., 2020b). However, the main point of using ESP here is not to strictly conclude how well the modelled spatial patterns reproduce the observed patterns, otherwise a benchmark should be used (Schaefli and Gupta, 2007;Seibert et al., 2018), but rather to determine if a modelled spatial pattern is better than another. The spatial pattern evaluation is completed for Su and Ea, while only the temporal dynamics of St are assessed due to the coarse spatial resolution of the GRACE data.

295
The relative variation in model performance is assessed with the second-order coefficient of variation (V2) (Kvålseth, 2017).
V2 is an alternative to the classic Pearson's coefficient of variation (CV), which has significant limitations that are comprehensively discussed by Kvålseth (2017). The limitations of the CV include its difficult and non-intuitive interpretation because of the lack of an upper bound, its high sensitivity to outliers, its dependence on the sample mean and problems with negative values. For all sample data x = (x1,…, xn) ∈ R n , with R = (-∞, ∞), V2 is defined as follows: 300 where s is the standard deviation and ̅ is the mean of x. V2 varies from 0 to 1 or 0% to 100%, and represents the distance between x and ̅ relative to the distance between x and the origin zero.

Results
The results are presented and discussed for the entire simulation period (2003-2012, i.e. combined calibration and evaluation 305 periods) because reliable meteorological datasets are expected to produce a plausible representation of hydrological processes independently of the modelling period (Bisselink et al., 2016). Separated results are provided for the calibration and evaluation periods in the SI.

Model performance for streamflow 310
Similar model performance patterns are obtained with EKG, ENS and ENSlog of daily streamflow (Q) (Figure 3). Therefore, only EKG is retained for the description of the results. All input dataset combinations show a median EKG > 0.5, except those having JRA-55 as rainfall input (Figure 3), which can be justified by the coarse spatial resolution of that product. The ranking of the rainfall and temperature datasets based on the model performance for Q is provided in Appendix A3. The analysis of model performance for Q is done for the entire VRB and not per climatic zone due to the limited number of stations. As expected, 315 the discrepancies in median EKG are more pronounced across rainfall datasets than across temperature datasets, as visible in the color-coded ranking of the products in Figure 3. For a given rainfall product, the ranking among all rainfall products hardly varies with different temperature products. The ranking of all the datasets for the model performance for Q is also summarized in Appendix A3. The overall stronger impact of the choice of the rainfall dataset on EKG of Q becomes also clear from the second-order coefficient of variations (V2) of the median EKG (Table S3 in SI). For rainfall datasets, the V2 across temperature 320 datasets varies between 0.5% for GSMaP-std and 4% for JRA-55, with an average V2 of 2%. For temperature datasets, the V2 of median EKG of Q across rainfall datasets varies between 10% for MERRA-2 and 12% for ERA5, with an average V2 of 11%.
This result suggests that the choice of a rainfall dataset has a stronger impact on the EKG of Q than the choice of a temperature dataset.
The analysis of the components of EKG (i.e. the Pearson correlation KG , the bias KG and the variation KG ) reveals that, when 325 choosing a rainfall dataset, there is more uncertainty in the bias of Q (V2 = 14%) than in its variability (V2 = 6%) and in its dynamics (V2 = 3%), which is in agreement with the work of Thiemig et al. (2013). Detailed results on the performance for Q (i.e. ENS, ENSlog, EKG, KG , KG and KG ) and the ranking of the datasets with separate results for the calibration and evaluation periods are provided in the SI (Tables S1-S18, Figures S17-S26).    The spatial patterns of Su show considerable differences when using different combinations of rainfall and temperature input datasets, as illustrated in Figure 6 (see similar maps for all the meteorological datasets in the SI, Figures S59-S60). The southnorth gradient of increasing aridity is not similarly spread among the rainfall-temperature dataset combinations. More interestingly, west-east differences in the spatial patterns of Su can be observed. These differences in spatial pattern reproduction can also be seen in the spatial pattern efficiency metric (ESP) of Su for the 102 rainfall-temperature dataset 365 combinations (Figure 7). The average ESP of Su in the VRB over all datasets is -0.11.  (Su) obtained with different forcing of rainfall (y-axis, blue font) and temperature (x-axis, red font) datasets. The values are normalized between 0 and 1 to emphasize spatial patterns and to use a unique color scale. Figure 7. Spatial pattern efficiency (ESP) of soil moisture (Su) over the entire simulation period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) for the Volta River basin (VRB) using different combinations of precipitation and temperature datasets used as input for hydrological modelling. Each boxplot has 120 values corresponding to the number of months. The boxplots are colored from the best (blue) to the worst performance (red) based on the median value.

Model performance for actual evaporation
The model performance for the temporal dynamics of monthly actual evaporation (Ea) compared to the GLEAM product is shown in Figure 8 (see the SI for monthly time series, Figures S72-S76). The average r of Ea for the entire VRB over all datasets is 0.93. Similarly to Su, the r of Ea is higher is the driest climatic zones: Sahelian (r = 0.94), Sudano-Sahelian (r = 0.94), Sudanian (r = 0.89) and Guinean (r = 0.81). However, the predictive skill of the model for the temporal dynamics of Ea 395 is higher than its predictive skill for Ea in the wetter climatic zones. Appendix A3 shows the ranking of all the meteorological datasets for the model performance for Ea. The rainfall datasets show different performances across climatic zones, with the following best datasets: PERSIANN-CDR in the Sahelian zone, EWEMBI and WFDEI-GPCC in the Soudano-Sahelian zone, ARC in the Sudanian and Guinean zones. The choice of the rainfall dataset leads to an average V2 of 4% for the temporal dynamics of Ea, while the average V2 is 1.5% for the choice of the temperature dataset, which aligns with the findings of Jung   (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) considering 102 combinations of rainfall (y-axis) and temperature datasets (subplots on x-axis) used as forcing for the hydrological model.

405
As for Su, the choice of input datasets has a considerable impact on the reproduction of the spatial patterns of Ea (Figure 9). Similar maps for all the meteorological datasets are provided in the SI (Figures S77-S78). It can be observed that different rainfall-temperature combinations used to force the model result in large discrepancies in the spatial pattern of Ea, especially in the southern region. The south-north gradient of increasing aridity with west-east differences is represented differently 410 among the rainfall-temperature dataset combinations (see e.g., the difference between the first two columns of the first row in The ESP of Ea for the 102 rainfall-temperature dataset combinations in the VRB is given in Figure 10. The average ESP of Ea in the VRB over all datasets is 0.07, which is higher than for Su (ESP = -0.11). The choice of the rainfall dataset for the VRB affects the ESP of Ea on average by 93%, while the choice of the temperature dataset involves a variation 33%. However, lower

Discussions
This study builds upon and expands existing research studies on the evaluation of meteorological datasets in several ways: (i) The evaluation of the spatial patterns of multiple hydrological processes (i.e. streamflow, actual evaporation, soil moisture, and terrestrial water storage) in addition to the more classically evaluated temporal dynamic.
(ii) The evaluation of a high number of both satellite-based and reanalysis rainfall datasets considered in combination with 435 different temperature datasets.
(iii) The assessment of the model performance across four considerably different climatic zones from semi-arid to subhumid.
The overall outcome of this analysis is the ranking of all the meteorological datasets based on their ability to simulate various hydrological processes across different climatic zones in the VRB (Appendix A3). It is worth noting that the overall ranking 440 shows which product is best or worst at simulating a given hydrological flux or state variable. However, the ranking does not systematically tell whether a dataset is good or bad. Only the skill scores can be used to draw a judgement on the adequacy of a given dataset to produce plausible model outputs.
The results show that there is no single rainfall dataset outperforming the others in reproducing all hydrological processes across different climatic zones. These findings align with previous studies in the sense that there is no rainfall dataset that is 445 the best everywhere (Beck et al., 2017b;Sylla et al., 2013). For datasets providing both rainfall and temperature data, the combination of the two variables as model input is not necessarily the best option for obtaining the highest performance in modelling a given hydrological state or flux variable. The best rainfall-temperature combinations for the spatiotemporal representation of each hydrological flux and state variable are provided in the SI ( Figure S15).
The results are primarily valid for the study region in West Africa, while a wider generalization of the findings should be done 450 with caution and after repeating similar evaluation studies at other places. Nevertheless, the key message is that: there is no rainfall dataset of all hydrological processes and the best rainfall dataset for temporal dynamics might not be the best for spatial patterns. Therefore, different rainfall datasets should be evaluated before choosing the most suitable one for hydrological modelling in large catchments.
Moreover, when comparing the results of this study to the findings of Satgé et al. (2020) based on a point-to-pixel evaluation 455 of gridded rainfall datasets in West Africa, it is noticeable that the ground evaluation might lead to different results as compared to the hydrological evaluation adopted in the current study. The skill of a rainfall product in well reproducing ground measurements under a point-to-pixel evaluation does not necessarily correlate with its performance for hydrological modelling, particularly in large and complex hydroclimatic environments such as the VRB.
Despite the efforts to produce a comprehensive evaluation of the meteorological datasets, the results obtained might be subject 460 to uncertainties related to the potential model structural deficiencies as well as errors in the observational datasets used for the model evaluation (McMillan et al., 2010;Renard et al., 2010;Gupta and Govindaraju, 2019). The distribution of the final model parameters ( Figures S79-S80) highlights the possibility of obtaining equally good model performances for different parameter sets (i.e. equifinality), which can be a justification for model recalibration. Moreover, it can be noticed that most of the model parameters are sensitive to the change in meteorological input datasets ( Figure S79). A detailed analysis of parameter 465 variability as a function of input data is beyond the scope of the current study, but could build the basis of future research, namely to identify data errors by analyzing parameter patterns (e.g. rooting depth), and resolve potential structural deficiencies of the mHM model. However, the mHM is chosen because of its adequacy for the experiment of this study (for model selection, see Addor and Melsen, 2019). The structure of mHM allows the representation of seamless spatial patterns of hydrological processes through the MPR scheme . In addition, mHM facilitates parameter regionalization and is 470 therefore convenient for large-scale modelling, and it harnesses the full potential of the forcing datasets as it is a fully distributed model that has performed well in previous studies including those in the VRB (e.g. Poméon et al., 2018;Dembélé et al., 2020b). Regarding the model evaluation, the comparison between the observed and modelled hydrological processes is done only on their temporal dynamics and spatial patterns using bias-insensitive metrics, except for streamflow, which limits the potential impact of satellite data uncertainty. 475 The model is calibrated only on Q data despite the known limitations of the Q-only calibration . However, calibrating the model on additional variables would result in additional model performance improvement that would not be separable from the contribution of the input datasets to the model performance. Therefore, regarding the goal of this study, the Q-only calibration was the best option to obtain the impact of various meteorological forcing datasets on the plausibility of hydrological processes. As no rainfall dataset ranks first in simulating all the hydrological processes, this study confirms that 480 model calibration on multiple variables is a way forward in improving the overall representation of the hydrological system and increasing the predictive skill of hydrological models (Dembélé et al., 2020b;Dembélé et al., 2020a). The domain-wide calibration strategy adopted in this study generates a unique parameter set for the simulation of multiple hydrological processes across several catchments with different hydroclimatic features, which has the consequence of having local differences in model performance. However, domain-wide calibration has proved to perform similarly to domain-split calibration in previous 485 studies (Mizukami et al., 2017), and it was ideal for this study because of the interest in simulating seamless spatial patterns, which might have not been possible with separately simulated portions of the basin. Moreover, the main goal of this study is to assess the adequacy of the meteorological datasets for large-scale hydrological modelling, knowing that these datasets usually have a coarse spatial resolution with pixels often averaged over regions with strong sub-grid variability.
Finally, the importance of regional evaluation is emphasized by this study because some region-tailored datasets (e.g. 490 TAMSAT and ARC) which are not included in global scale studies (e.g. Beck et al., 2017b;Mazzoleni et al., 2019;Essou et al., 2016) outperform global datasets. The decision to use a given dataset is not only motivated by the availability or the accuracy of the data, but also by data accessibility (i.e. storage platforms, openness, format, pre-processing requirement, etc.).
The findings of this study provide further awareness for the data users and improvement avenues for data producers in their quest of the most accurate products (e.g. Massari et al., 2020;Contractor et al., 2020;Berg et al., 2018;Brocca et al., 2014;Cucchi 495 et al., 2020;Beck et al., 2017a).

Conclusion
This modelling study evaluates the ability of multiple combinations of rainfall-temperature datasets to reproduce plausible hydrological processes and patterns. The experiment is done in the Volta River basin with the fully distributed mesoscale Hydrologic Model (mHM) over a 10-year period (2003-2012), using 17 rainfall and 6 temperature datasets from satellite and 500 reanalysis sources. The spatial and temporal representation of streamflow, terrestrial water storage, soil moisture and actual evaporation are evaluated using in-situ and satellite remote sensing observational datasets. The key findings are: -No rainfall dataset consistently outperforms all the others in reproducing the highest model performance for all hydrological processes, and the best dataset for the temporal dynamics is not necessarily the best for the spatial patterns. 505 -Rainfall datasets have a higher impact on the spatiotemporal representation of hydrological processes than temperature datasets, but the later have a higher influence on the spatial patterns of soil moisture.
-The large-scale performance for the meteorological datasets is not always valid for sub-regions in the same basin.
The findings of this study give a critical insight of the performance for several meteorological datasets in the challenging hydroclimatic environment of West Africa. They are expected to foster further research initiatives in improving the gridded 510 meteorological datasets and further draw users' attention on the contrasting performances of these datasets in modelling hydrological fluxes and state variables. Efforts should be devoted in reporting on the impact of data uncertainties on process representation in hydrological modelling, especially when model outputs are used for decision-making.
Future studies can test the transferability of the model's global parameters across different input datasets, i.e. how reliable a parameter set obtained with a given input dataset is for running the same model with a different input dataset. The answer to 515 this research question will shed light on the necessity of model recalibration when using different meteorological forcing.
Furthermore, the predictive skill of the model can be improved with a parameter sensitivity analysis to determine parameters that affect the spatiotemporal representation of each hydrological flux and state variable.  (Su) and actual evaporation (Ea) using various rainfall-temperature dataset combinations as model inputs. Each score for a given rainfall product represents the average over 530 individual combinations with 6 temperature datasets, while the score is the average over combinations with 17 rainfall datasets for   Variables each temperature dataset. The skill scores of the temporal dynamics are obtained with the Kling-Gupta efficiency (EKG), the Nash-Sutcliffe efficiency (ENS) and the Nash-Sutcliffe efficiency of the logarithm (ENSlog) for Q, and the Pearson's correlation coefficient (r) for St, Su and Ea. The spatial pattern efficiency (ESP) is used to assess the spatial representation of Su and Ea. The skill scores are ranked from the best (blue) to the worst (red). The results are shown for the four climatic zones in the Volta River basin (VRB) over 535 the simulation period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012).
Supplement. The supplement related to this article is available online at: to be provided by the journal Data availability. The meteorological and modelling datasets used in this study are freely available via the web links provided 540 in Table 1 and Table 2. More information on satellite-based precipitation datasets can be found at http://ipwg.isac.cnr.it/. The modelling database is available at https://doi.org/10.5281/zenodo.3662308.
Author contributions. MD performed the analyses and drafted the manuscript. All authors contributed to the writing, review and editing process that lead to the final manuscript. 545 Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. We thank the providers of the datasets used in this study (see Table 1 and Table 2 Review statement. This paper was edited by Albrecht Weerts and reviewed by Nadav Peleg and one anonymous referee.