CABra: a novel large-sample dataset for Brazilian catchments

In this paper, we present the Catchments Attributes for Brazil (CABra), which is a large-sample dataset for Brazilian catchments that includes long-term data (30 years) for 735 catchments in eight main catchment attribute classes (climate, streamflow, groundwater, geology, soil, topography, land-cover, and hydrologic disturbance). We have collected 10 and synthesized data from multiple sources (ground stations, remote sensing, and gridded datasets). To prepare the dataset, we delineated all the catchments using the Multi-Error-Removed Improved-Terrain Digital Elevation Model and the coordinates of the streamflow stations provided by the Brazilian Water Agency, where only the stations with 30 years (19802010) of data and less than 10% of missing records were included. Catchment areas range from 9 to 4,800,000 km2 and the mean daily streamflow varies from 0.02 to 9 mm day. Several signatures and indices were calculated based on the climate 15 and streamflow data. Additionally, our dataset includes boundary shapefiles, geographic coordinates, and drainage area for each catchment, aside from more than 100 attributes within the attribute classes. The collection and processing methods are discussed along with the limitations for each of our multiple data sources. The CABra intends to improve the hydrologyrelated data collection in Brazil and pave the way for a better understanding of different hydrologic drivers related to climate, landscape, and hydrology, which is particularly important in Brazil, having continental-scale river basins and widely 20 heterogeneous landscape characteristics. In addition to benefitting catchment hydrology investigations, CABra will expand the exploration of novel hydrologic hypotheses and thereby advance our understanding of Brazilian catchments’ behavior. The dataset is freely available at https://doi.org/10.5281/zenodo.4070147.


Introduction
The integrated assessment of large-sample catchment attributes is fundamental for the description and classification of 25 landscape properties, leading to an improved understanding of similarities (or dissimilarities) between catchments. Largesample catchment hydrology is essential in terms of hydrological processes understanding Beven et al., 2020). It provides an attractive venue for general inferences that would otherwise be impossible to study based on individual or small groups of catchments, aside from allowing the testing of new and existing hypotheses in hydrologic sciences (Addor et al., 2017;Gupta et al., 2014;Lyon and Troch, 2010;Wagener et al., 2007). 30 https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

75
The CABra dataset is recommended for a wide range of users for decision-making at multiple scaleslocal, national, or regionalcovering all Brazilian biomes (Amazon, Cerrado, Atlantic Forest, Pantanal, Caatinga, and Pampa). CABra was created to ensure easy access to its information and provide high-quality data, with attributes useful for a variety of hydrometeorological modeling and assessments. Each catchment presents several attributes, ranging from the file 80 information described in Table 1 to the attributes described throughout this article. Moreover, we made available all the geospatial data (shapefile of the boundaries) for the users. 85 https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

Catchment delineation and topography
Brazil does not have an official database for the national catchments boundaries, and the Brazilian Water Agency (ANA) 90 does not make available its geospatial database. Because of this and to avoid uncertainties in the existing datasets for South America, we freshly generated all the CABra catchments boundaries used in this study. Digital Elevation Model (DEM) quality and resolution are crucial at this stage since all the post-analysis with the multi-source information utilized in the CABra dataset are area-averaged. For example, is well-known that errors in topographic indices, e.g., slope and catchment area and boundary, are dependent on and highly sensitive to DEM resolution and accuracy, and it is suggested that, if 95 available, a high-resolution DEM should be used instead of a low-resolution DEM due the negative effects of terrain generalization caused by them (Mukherjee et al., 2012;Vaze et al., 2010;Wechsler, 2007;Zhou and Liu, 2004). We delineated the CABra catchments following the procedure described in Maidment (2002), using streamflow gauges location information from the ANA's database and a high-resolution elevation product, i.e., the Multi-Error-Removed Improved-Terrain Digital Elevation Model with a 90-m spatial resolution at Equator (Yamazaki et al., 2017) (Fig. 2). 100 https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

105
In the first stage, which we call "terrain processing", the DEM was sink-filled to avoid possible errors due to peaks or depressions. Then, the flow direction and flow accumulation were calculated, which indicates the direction and accumulation of flow, respectively, in each grid cell within the catchment. The next step was to define the stream network in the catchment. For the definition of a river stream, we considered a threshold of 100 cells accumulating water, and this value was chosen considering the DEM spatial resolution and the range of the size of the catchments. All the previous steps were 110 run for the South America extension. Even though all outlets are located in the Brazilian territory, some of the drainage areas embrace larger areas outside of it.
The second step was catchment delineation, where the products generated in the previous step and the coordinates of the streamflow gauges were used. Each streamflow gauge coordinate was first plotted as a point and the position of it to the stream network was checked and corrected, if necessary. The correction procedure was performed for 132 out of CABra 115 catchments. Then, each corrected point was used as an outlet of the catchment and the delineation of the drainage area was https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. performed using the ArcHydro tool. Aside from the catchments limits, perimeters, and areas, we also extracted the stream information, such as the stream network and hierarchy (Strahler, 1952(Strahler, , 1957. It is important to highlight that we manually inspected each catchment outlet and area to overcome the limitation of unchecked boundaries of another existing catchment dataset in Brazil (CAMELS-BR, by Chagas et al., 2020) and South America (Do et al., 2018), which were based on a DEM 120 with a spatial resolution of 500-m. Moreover, this presented itself as a crucial procedure for an accurate delineation since several outlets' positions needed to be corrected to represent the real expected catchment boundary.
Once the catchment boundaries were delimited, we calculated six attributes related to the topography of each catchment: area, slope, maximum, minimum, and mean elevation, and streamflow gauge elevation.  Figure 3 summarizes the topographic attributes for the CABra catchments. Catchment areas ranged from 9 to 4.810 6 km² ( Fig. 3a). This large range of areas shows how Brazilian hydrology can be, at the same time, local and continental, necessitating a better understanding of hydrologic processes. Many of the largest catchments are in the mainstream of one of 130 the 12 hydrologic regions of Brazil, especially in the Amazon, Tocantins/Araguaia, São Francisco, Paraguay, and Paraná.

125
The mean elevation of CABra catchments ranges from close to zero to up to 2000 m, with the highest values found in the southern and south-eastern portions.
In turn, steepen areas can be found in the coastal and mountainous areas of the southeast and south ( Fig. 3b and Fig. 3c).
Most of the Brazilian catchments have a flat topography though, with a mean slope up to 10%. Figure 3d shows the gauge 135 elevation. Note the difference between the gauge elevation and the mean catchment elevation in Fig. 3b. The gauge elevation considers only the elevation at the gauge position in the landscape, thereby proving only the local information, while the mean catchment elevation considers the average elevation for the entire catchment. An example of this difference is the largest CABra catchment, i.e., the Amazon. The mean elevation in the Amazon basin would be low, however, the western part of the basin has some of the highest peaks of the Andes, where the gauge elevation would be much higher. 140 https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

Methodology
We present daily time series of area-averaged precipitation, minimum, maximum, and mean temperatures, solar radiation, relative humidity, wind speed, evapotranspiration, and potential evapotranspiration (calculated by Penman-Monteith, Priestley-Taylor, and Hargreaves methods http://careyking.com/data-downloads/. This product has a much finer spatial resolution and is based on a higher number of rain gauge stations than other widely used products (~4,000 stations for Brazil, in comparison to ~600 stations for South America in CRU TS3.1 product). However, the REF dataset covers only the Brazilian territory, while the CABra dataset has 20 catchments with upstream areas outside Brazil. To overcome this, we incorporated the ERA5 (Hersbach et al., 2020) climate data into the CABra dataset (here referred to as "ERA5"). 160 ERA5 is the most recent version of climate reanalysis from the European Centre for Medium-Range Weather Forecasts (ECMWF) and provides hourly, daily, and monthly data on several atmospheric, sea, and land variables in a 0.25ºx0.25º spatial resolution grid, from 1950 to the present. As a reanalysis dataset, the ERA5 uses past observations and models to generate accurate and consistent time series of climate variables and parameters, being one of the widely used datasets in geosciences (Hersbach et al., 2020). To incorporate and produce a more reliable product for all the CABra catchments, we 165 have generated an ensemble mean product (here referred to as "ENS") using both datasets beforementioned, i.e., REF and ERA5 climate products. The procedure was conducted in the Climate Data Operators (CDO, Schulzweida, 2019) and aimed to a better characterization and representation of the climate based on the two independent estimations, which imply in a more robust reproducibility of the phenomenon than in a single-member analysis (Abramowitz et al., 2018). The precipitation seasonality (Woods, 2009), which indicates the timing of the precipitation seasonal cycle and the temperature 170 seasonal cyclevalues close to +1 indicates summer precipitation and values close to -1 indicates winter precipitationwas calculated for the ensemble product.
The actual evapotranspiration adopted in CABra is derived from the Global Land Evaporation Amsterdam Model version 3 (GLEAM v3, Martens et al., 2017), which is a set of algorithms that estimates the many components of land evaporation based on satellite observations of climatic and environmental variables. The calculations of the actual evapotranspiration by 175 GLEAM v3 take into account a potential evapotranspiration module (by Priestley and Taylor method), an interception loss module (by a Gash analytical model), and a stress module (by a semi-empirical relationship to root-zone moisture and https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. vegetation optical depth). The GLEAM dataset is one of the most commonly used datasets on evapotranspiration applications (Forzieri et al., 2018;Schumacher et al., 2019;Zhang et al., 2016).
Even though the REF dataset presents a reference evapotranspiration product (calculated by Penman-Monteith method 180 following the FAO-56 guidelines), it embraces only the Brazilian territory and did not comprise all the areas of the catchments included in the CABra dataset. To overcome this limitation, we calculated the daily potential evapotranspiration (PET) by three different widely used methods based on energy balance and transfer mass, radiation, and temperature, using meteorological variables from the ERA5 and the ensemble products as inputs. The first method was the FAO-56 Penman-Monteith equation (Allen et al., 1998), which is the standard for reference evapotranspiration, and assumes a hypothetical 185 crop similar to a surface of small grass of uniform grass, actively growing and sufficiently watered. The where is the reference evapotranspiration, in mm day -1 , is the net radiation, in MJ m -2 day -1 , is the soil heat flux, 190 in MJ m -2 day -1 , is the mean daily temperature at 2m height, in ºC, 2 is the wind speed at 2m height, in m s -1 , is saturation vapor pressure, in kPa, is the actual vapor pressure, in kPa, ∆ is the slope vapor pressure curve, in kPa ºC -1 , and is the psychrometric constant, in kPa ºC -1 .
The radiation-based method chosen for the CABra dataset is the Priestley-Taylor equation (PT) (Priestley and Taylor, 1972).
The PT considers that when large areas, such as catchments, are saturated, the main force that governates the evaporation is 195 the net radiation, and under certain conditions, the knowledge of net radiation and the ground dryness is enough to determine the vapor and sensible heat fluxes at the surface. Moreover, is one of the most commonly used models to estimate evapotranspiration due to its low number of inputs requirement (Maes et al., 2018;McMahon et al., 2013;Shuttleworth, 1996). The PT equation takes the following form: where is the potential evapotranspiration, in mm day -1 , is the Priestley-Taylor constant, dimensionless, is the net radiation, in MJ m -2 day -1 , is the soil heat flux, in MJ m -2 day -1 , ∆ is the slope vapor pressure curve, in kPa ºC -1 , and is the psychrometric constant, in kPa ºC -1 . Considering that PT only considers daytime evapotranspiration and is negligible during the daytime, we used = 0 in our calculations.
The main limitation on the application of the PT method is the requirement of the Priestley-Taylor constant α, which is 205 related to the ratio between the actual evapotranspiration and the equilibrium evaporation rate (Eichinger et al., 1996). and 1.34. The authors concluded the best estimation for α should be an overall mean of 1.26. However, it is known that the α value is scenario-dependent and its variability is not taken into account when using the mean value proposed in its development (Guo et al., 2007). 210 The third method adopted here is the Hargreaves equation. The method was developed by Hargreaves (1975) for irrigation planning and design and it is a temperature-based equation widely used to calculate the potential evapotranspiration due to its easy application and low inputs requirement (Equation 3).
where is the potential evapotranspiration, in mm day -1 , is the solar radiation, in MJ m -2 day -1 , and is the daily 215 mean temperature, in ºC.
The main limitation of this equation is the estimative are subject to error due to a large range of temperatures caused by weather fronts on a daily scale. On the other hand, it is a less biased model, when compared to other methods, when applied to small and not well-watered catchments (Hargreaves and Allen, 2003).
From the climatic variables and attributes, we carried out an analysis of the annual water balance in the Budyko space, an 220 empirical approach applied to the study of the hydrological behavior of catchments. The Budyko hypothesis (Budyko, 1948(Budyko, , 1974 considers that the ratio between the long-term annual actual evapotranspiration (ET) and precipitation (P) is a function of the ratio between the long-term potential evapotranspiration (PET) and precipitation (P). The Budyko framework has been used to assess global impacts of climate change on water resources (Berghuijs et al., 2017;Roderick et al., 2014), and to gain further insight on water balance controls at mean annual timescales (Donohue et al., 2007;Berghuijs et al., 2017;Meira Neto 225 et al., 2020).
https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.  Figure 4 shows some of the climate attributes for the CABra dataset. Regarding the precipitation derived from our ensemble of Xavier et al. (2016) and ERA5 (Fig. 4a), we found the highest values, reaching up to 10 mm day -1 , in the northern portion, and the lowest values, below 1 mm day -1 , in the north-eastern portion. Despite the wide range in the daily precipitation, most of the catchments (~80%) presented area-averaged precipitation between 3 and 6 mm day -1 . Figure 4d shows the area-averaged solar radiation reaching the surface, ranging from 10 to 20 MJ m 2 day -1 , with most of the 235 catchments with daily values higher than 15 MJ m 2 day -1 . The spatial distribution of solar radiation is reflected in the temperature values in CABra catchments ( Fig. 4e and Fig. 4f). The southern and south-eastern portions present the lowest values of both the maximum and minimum temperatures. This is due to the lower values of solar radiation and high altitudes found in these regions of Brazil. Other areas of Brazil are located in higher latitudes and are subject to higher solar radiation, and due to its flat relief, the temperatures are higher than in the south. Figure 4b indicates that, in most of CABra catchments 240 (~85%), the precipitation seasonal cycle is in timing with the temperature seasonal dynamics, which means that most of the precipitation occurs in the summer (seas > 0). There are only a few catchments in the northern portion of Brazil that have precipitation in the winter (seas < 0), and this can be explained by the high influence of sea breeze on convective precipitation in this region. According to Ahrens (2010) and Kousky et al. (1984), the Amazonian coastal area is highly influenced by the sea breeze, which can occur in 3 out of every 4 days, with the formation of convective activity inland. 245

260
The Budyko framework (Budyko, 1948(Budyko, , 1974 shows that half of CABra catchments are water-limited and the other half are energy limited (Fig. 6). The lowest aridity index values are found in the Amazon and the Atlantic Forest, while the warmer and drier climate can be found in the Cerrado and Caatinga biomes. This may be correlated with the physiognomies of vegetation found in these biomes: tropical forests for the first group and grass and shrub for the second one, and especially, 265 to the water availability and radiation incidence on these abovementioned biomes. Although we have found some outliers which are not explained by the Budyko hypothesis, most of the CABra catchments follow the expected behavior to the longterm mean water balance proposed by Budyko (1948Budyko ( , 1974. https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

Methodology
The CABra dataset provides daily streamflow records for 735 catchments in Brazil. We used data from streamflow gauges of ANA, where each gauge is related to one of the abovementioned catchments. This dataset is available in the HIDROWEB 275 database (see http://www.snirh.gov.br/hidroweb/). ANA's database contains raw time series of dozens of thousands of gauges of streamflow, precipitation, water quality, and sediment discharge, with a consistency level for each observation.
Due to the inconsistencies and missing records in the streamflow data provided by ANA, we implemented filters to take into account only the reliable data for the CABra dataset.
During our analysis, we found four main issues with ANA's database collected from HIDROWEB: (a) missing streamflow 280 values for a period of the time series; (b) duplicate streamflow values with different consistency levels; (c) duplicate values with the same consistency level, and (d) duplicate dates with different values and consistent levels. In the first filter step, we overcame the last three issues by picking up only one of the duplicated values/dates based on the best level of consistency.
The first issue is more complex and difficult to overcome as in some cases the missing data reaches almost 100% for some gauges. Since long time series of streamflow is needed for reliable hydrologic investigations, we defined a threshold for the 285 selection of the streamflow gauges considered in the CABra dataset based on the following conditions: at least 30 years of data, comprising the hydrologic years from 1980 to 2010, with up to 10% of missing data. The application of these filters led to 735 streamflow gauges, and consequently, 735 catchments.
After the employment of the filters, we calculated for the 735 selected catchments, a variety of hydrological signatures, which can provide a better understanding of the patterns of functionality and behavior of the catchments. From the 290 quantification of hydrological characteristics, it is possible to explain the variability in responses to climate forcings. We https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. selected hydrological signatures obtained from widely available hydrological series (see Table 4), as well as Sawicz et al. (2011) e Westerberg e McMillan (2015. A list with more hydrological signatures can be found in Yadav et al. (2007).  Figure 7 shows the hydrologic signatures calculated for the CABra catchments for the period between the hydrologic years 1980 and 2010. The mean daily flow for the Brazilian catchments ranges from less than 1 mm day -1 to up to 9 mm day -1 , with an overall mean of 2 mm day -1 . The highest values were found in the extreme north of Amazon, where the daily flows 300 reached 8 mm day -1 due to high amounts of precipitation through the all the year, and in the Atlantic Forest, in the southeast, where we also have steepness relief with higher values of the slope, providing the runoff instead of infiltration process. This can be showed seen in Fig. 7b, related to the runoff coefficient, where we noted the high values in the southern and northwestern portions of Brazil. Most of the CABra catchments presented a runoff coefficient up to 0.5 though.
Our results also revealed that the Brazilian catchments to be mainly dependent on the baseflow since all of it presented a 305 baseflow index greater than 70%. The lowest values were found in the Caatinga biome, where we also found the lowest mean daily flows. The half-flow date (considering October 1 st as the beginning of the hydrologic year) indicates that ~80% of Brazilian catchments reach the half of total accumulated annual flow in less than 200 days (Fig. 7d), showing the high correlation with the seasonal cycle of precipitation. The catchments with later dates of the half-flow day can be found in the Pampa biome, where there is no well-defined rainy/dry season, and in the Amazon, where the amounts of accumulated 310 annual streamflow are too high and the peak of precipitation is near the end of the hydrologic year (Almagro et al., 2020).
The analysis of the slope of the flow duration curve, in Fig. 7e, shows the lowest values in a great portion of Brazil, ranging from the Cerrado to the Atlantic Forest and Pampa biomes.
In our analyses, we also found values of the slope of flow duration curve reaching infinity in the north-eastern portion of Brazil, in the Caatinga biome, which indicates the existence of catchments with ephemeral rivers in that region, which are 315 mainly dependent on direct runoff. This can be also seen when analyzing Fig. 7f, related to the streamflow elasticity. The highest values, up to 4, are located in catchments within the same abovementioned region, indicating the strong dependence of those catchments on precipitation events to generate its streamflow. Moreover, we can note that most Brazilian catchments are inelastic to changes in precipitation. This fact can be explained by the high values of the baseflow index, which maintain the streamflow through the year. Fig. 7g, Fig. 7h, and Fig. 7i show the results related to the low flows of 320 CABra catchments.
In general, Brazilian catchments present a low flow (5 th quantile) lower than 1 mm day -1 , up to 50 days through the year, with a mean duration of up to 25 following days. Despite the mean values, we can note high values (up to 3 mm day -1 ) in the Amazon. Additionally, higher values of frequency and duration of low flows can be found in the north-eastern portion of Brazil, with mean frequency reaching 150 days and mean duration reaching 100 days for some catchments. In turn, Fig. 7j, 325  https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.

Methodology
The CABra dataset presents four attributes regarding the groundwater at the catchments (Table 5). They are related to the 340 water table (water table depth and height above the nearest drainage) and to the aquifer where the catchment is within (aquifer name and rock type). The first attribute is the area-averaged water table depth. This information was extracted from Fan et al. (2013), which is a global water table depth map generated using a climate-sea-terrain coupled model. The results were validated against observations and show the global patterns of shallow groundwater, making possible the understanding of how groundwater affects terrestrial ecosystems, such as the soil moisture and land hydrology, in a deficiency of rain (Fan 345 et al., 2013;Lo et al., 2010).
The second attribute is the Height Above Nearest the Drainage (HAND), also related to the water table but is an indirect way to infer the water table depth. The HAND is a normalized drainage version of a digital elevation model, where the height is defined as the vertical distance from a hillslope (at the surface cell) to a respective "outlet-to-the-drainage" cell, as defined by Nobre et al. (2011). Considering the local gravitational potential, the HAND model shows robust correlations between 350 soil water conditions and its values. Additionally, the authors created three classes to easily infer about the water table depth (if at the surface, shallow or deep) only using a digital elevation model, which is commonly a piece of difficult and scarce information on a large scale. We also present the aquifer in which the catchment is within (most of the area) and the most common type of rock of the aquifer. This information was provided by the ANA database and it is important to the knowledge of the aquifer geology and its implication to the groundwater storage and recharge. 355

Results and discussion
Our analyses showed a close relationship between the water table depth from Fan et al. (2013) and the HAND. In the 360 northern portion of Brazil, especially in the Amazon, we can find shallow water table depths, while in the south-eastern, especially in the Atlantic Forest, we noted the deepest values for the water table depths (see Fig. 8a and Fig. 8b). This could be related to the altitudes of each catchment since the HAND is a product derived from a digital elevation model. As a https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. catchment lies at a high elevation, the water table depth is deeper than the other catchments in low elevations. This is particularly noted in the coastal area of the Atlantic Forest, which presents high altitudes and at the same time, is close to the 365 sea level. Figure 8c shows that most of the CABra catchments are dominated by fractured and porous rocks. The fractured rocks store the water in fractures, creating large pockets of water, and due to the nature of the rock, it is hard to drill. The porous rocks store water in the soil pores (especially in sandy soils originated by sedimentary rocks), and it is common to find large amounts of water in them. Moreover, it is easier to drill than other types, which leads to more exploration of its water. The 370 two of the world's largest aquifers are in Brazil and are porous, the Guarani Aquifer in the Cerrado biome, and the Alter do Chão Aquifer in the Amazon biome. The third aquifer type found in CABra catchments is the karstic one. This kind of aquifer is like the fractured one, but the fractures are much bigger, thereby forming subsurface rivers and lakes. This can be found in the São Francisco River Basin.

Methodology
The CABra dataset has eight attributes related to the soil type, properties, and texture ( Table 6). The soil type of the 380 catchment presented here is the most common type for each catchment (bigger percentage of the different types) derived from the Brazilian soil map developed by the Brazilian Agricultural Research Corporation (EMBRAPA, in Portuguese) (Santos et al., 2011). To meet with the international standards for soil classification, we converted the classes to the widely used World Reference Base (WRB) (FAO, 2014). Due to the high importance of the knowledge of the soil depth, density, texture, and organic matter to the understanding of soil-water dynamics and root grow (Dexter, 2004;Saxton et al., 1986;385 Saxton and Rawls, 2006;Shirazi and Boersma, 1984), we also present the mean areal attributes for them. These fields were taken from the SoilGrids250m, a global high-resolution gridded soil information based on field measurements, data assimilation, and machine learning. This is the most detailed and accurate global soil product and is crucial for the development of large-scale studies in many fields (ecology, climate, hydrology). However, despite all the improvements https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. brought by SoilGrids250m, the data still have limitations, and one of the biggest is the high uncertainty levels for some of its 390 products, such as the depth to bedrock and coarse fragments. Besides, we also employed the United States Department of Agriculture (USDA) soil texture classification, which is a widely used method for soil definition based on the mechanical limits of soil particles. Moreover, previous studies showed that the USDA soil texture classification can potentially reflect other soil parameters and characteristics (Groenendyk et al., 2015;Twarakavi et al., 2010), making it a powerful tool with a low input requirement. 395

Results and discussion
The catchments presented 12 main soil classes, with the Ferrasols, Acrisols, and Nitisols being the most common soil types 400 in more than 90% of the CABra catchments (Fig. 9a). The Ferrasols were the dominant soil type in approximately 75% of the catchments, typical of equatorial and tropical regions, which have an advanced stage of weathering of their constitutive material, being normally deep (>1m), well-drained, and acidic soils (high pH levels can occur in areas with a strong dry season, such as observed in the Caatinga biome). Acrisols are formed mainly by minerals, with an evident increase in the clay content from the surface to horizon B, with variable depth and drainage, but always with high acidity. The third most 405 common soil type is the Nitisols, which have a clay texture, with a well-developed B horizon structure, and are usually deep and well-drained with moderate acidity (EMBRAPA, 2018).
We noted that most of the catchments present soil texture dominated by sand and clay (Fig. 9c, Fig. 9d, and Fig. 9e). Southeastern, northern, and central regions of Brazil are dominated by sandy clay loam soils, while the southern portion is dominated by clay, which can reach up to 80%, making this region one of the most productive in terms of agriculture in 410 Brazil. By the employment of the USDA texture triangle, we found 6 classes: clay, clay loam, loam, sandy clay, sandy clay https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. loam, and sandy loam (see Fig. 9b). The soils presenting a clay and clay loam texture are in the southern portion, especially where the Nitisols occur, which is also the region with a significant portion of the Brazilian agricultural production.
Most of the catchments present a mix of texture, the sandy clay loam, which covers from the south through the central to the northern regions of Brazil. There is a spatial correlation between the soil organic carbon, bulk density, and the distance to the 415 bedrock, as we can see in Fig. 9f, Fig. 9g, and Fig. 9h. In the southern and south-eastern portions, especially in the Atlantic Forest biome, we have a combination of high soil organic carbon, low bulk density, and low distance to the bedrock. These characteristics, allied to the favorable climate, turned this kind of soil attractive to agriculture. On the other hand, other Brazilian regions present the opposite.

Methodology
The CABra dataset presents four attributes related to the geology of the catchments (Table 7), being the predominant lithology class, the subsurface porosity, the subsurface permeability, and the subsurface hydraulic conductivity. The lithology class is derived from the Global Lithologic Map (GLiM) (Hartmann and Moosdorf, 2012). The GLiM is a highresolution global dataset that describes the geochemical, mineralogical, and physical properties of the rocks in 16 main 430 lithological classes. Moreover, GLiM allows us to better understand the geology of smaller areas, such as our CABra catchments. Also, we are using a GLiM-derivate product of subsurface porosity and permeability named GLobal HYdrogeology MaPS (GLHYMPS), developed by Gleeson et al. (2014). The GLHYMPS is the first large-scale highresolution mapping of porosity and permeability and fills a lack of robust and spatially distributed subsurface geology map.
The porosity is the void spaces in a material (soil in our case) controls how much fluid (water) can be stored in this material, 435 or in the soil subsurface. The movement of the stored water in the soil is controlled by the permeability, which is the capacity of a porous material (again, soil) to transmit fluids. Both parameters are fundamental to the knowledge of fluid rate and its impacts on Earth's subsurface. When using this kind of high-resolution data for large-scale studies, we can improve our understanding of the dynamics between groundwater and land surface. Considering the saturated hydraulic conductivity as one of the most important physical properties on the quantitative and qualitative assessment of the water movement in the 440 soil, we presented its values in the CABra dataset. Following the assumption that the hydraulic conductivity is separable into the contributions of the porous matrix of the soil, and the density and viscosity of the fluid, we also estimated the hydraulic conductivity of the CABra catchments using its relation to the permeability (Equation 4), as described in Grant (2005).
where K is the subsurface hydraulic conductivity, k is the subsurface permeability, ρ is the density of the fluid, g is the 445 gravitational constant (9.8 m s -2 ), and µ is the viscosity of the fluid. In our study, we have considered the water as the fluid, so we have used ρ = 999.97 kg m -3 , and µ = 0.001 kg m -1 s -1 .

Results and discussion
Related to the lithology class, the catchments present 10 different classes according to the GLiM dataset: siliciclastic sedimentary rocks, acid volcanic rocks, unconsolidated sediments, acid plutonic rocks, metamorphic rock, mixed sedimentary rocks, basic volcanic rocks, carbonate sedimentary rocks, intermediate volcanic rocks, and pyroclastic rocks (Fig. 10). We found that 35% of the catchments have the metamorphic rocks as the most common lithologic class, a result of 455 continuous weathering on the original rock. These catchments are located especially in the southern portion of Brazil, in mountainous areas. Approximately 39% of CABra catchments are formed by sedimentary rocks, considering its subdivision in siliciclastic, unconsolidated, and mixed resulted from sediment deposition. They are mostly located in flat areas, such as in the Paraná River Basin and São Francisco River Basin, in the central and north-eastern portion of Brazil. 25% of catchments presents igneous rocks (plutonic and volcanic) as the most common lithology class, resulted from volcanic eruptions. These 460 catchments are located mainly in the Atlantic Forest biome, although we can find some catchments in the Amazon.
In respect to the subsurface, most CABra catchments presented values lower than 20%, with a mean value of 10%.
Catchments in the Atlantic Forest presented the lowest values of the catchments set. Results regarding the subsurface permeability and hydraulic conductivity reinforce the heterogeneity and random occurrence of these soil properties. As we can see in Fig. 10c and Fig. 10d, there is no well-defined spatial behavior for them. Subsurface permeability ranges from -14 465 to -12 m² in log scale, with a mean of -13.4 m², while the subsurface hydraulic conductivity presented a mean value of -6.4 m s -1 in log scale, vary between -10 to -4 m s -1 in log scale.

Methodology
The CABra dataset presents 14 attributes regarding the land-cover and land-use of the Brazilian catchments (Table 8). They are related to the area-averaged land-cover and land-use itself (dominant cover type, and the cover fractions of 9 main classes of use: bare soil, forest, grass, shrub, moss, crops, urban, snow, and water) and to the area-averaged intra-annual 475 variability of the vegetation biomass, here represented by the Normalized Difference Vegetation Index. The land-cover and land-use map used in the CABra dataset is the Copernicus Global Land Cover, which has 100-m spatial resolution, is a result of a classification of the PROBA-V satellite observations of the year 2015 and follows the UN FAO Land Cover Classification System (Buchhorn et al., 2019) available at https://land.copernicus.eu/global/lcviewer.
As an indicator for the vegetation biomass of the land-cover through the year, we are using the seasonal NDVI for each 480 CABra catchment, derived from the Long Term Statistics (LTS) based on the Normalized Difference Vegetation Index (NDVI) from the Copernicus Global Land services. This dataset is an NDVI mean for each month of the year during the 1999-2017 period, obtained from the SPOT-VGT and PROBA-V sensors in a 1-km spatial resolution, available at https://land.copernicus.eu/global/products/ndvi. The NDVI is obtained by calculating the spectral reflectance difference between red and near-infrared bands of the satellite image (Tucker, 1979) (Equation 5) and ranges from -1 to +1, with the 485 highest values attributed to areas with greater vegetation cover.
where NIR is the surface spectral reflectance in the near-infrared band and RED is the surface spectral reflectance in the red band. We observed that most of the Brazilian catchments are covered by forest and grass (Fig. 11). The shrub is the dominant cover for most of Caatinga catchments, while the grass is the dominant one in the Cerrado (tropical savannah). The forest cover is dominant especially in the Amazon and Atlantic Forest, as these two biomes are known by tropical forest occurrence, but even though the forest cover is not the most common for all the CABra catchments, ~85% of them present at least 20% of it (Fig. 11b). The grass cover fraction presented values up to 40% of the area for most of the catchments but 500 reached 60% in some cases (Fig. 11c). The highest values were found in the Cerrado and Atlantic Forest biomes, in central and south-eastern portions of Brazil.
Large areas of natural cover were converted to agricultural lands (including crops and pasture) in past years (Gibbs et al., 2010(Gibbs et al., , 2014, and satellite sensors and classifiers algorithms cannot separate natural grassland and pasture/managed grasslands, as described in the PROBA-V documentation. Figure 11d gives us a better idea of this. Probably the fraction of 505 the shrub cover of the Cerrado is the natural cover remaining for this biome since this is the expected type of vegetation. As seen in Fig. 11e, a few numbers of catchments present the crops as the dominant cover type, mostly in the central and https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. southern region, but we can also see the great fraction of crop cover in the MATOPIBA region, one of the largest agriculture frontiers in Brazil (Gibbs et al., 2014;Pires et al., 2016;Spera et al., 2016). Figure 11f shows that there are only a few cases of urban catchments, within or close to major Brazilian cities that present this type of cover, showing that the CABra dataset 510 is mainly composed of either natural or minimally (hydrologically) modified catchments.   The CABra dataset presents 6 attributes related to the hydrologic disturbances on catchments water fluxes (Table 9).
Anthropic changes in water flux patterns, which happens outside the range of natural flow and climate extremes, can directly impact the water availability and quality, stream channel geometry and sedimentation, and the equilibrium of ecosystems 535 (Boulton et al., 1992;Coleman et al., 2011;Whited et al., 2007). Natural conditions of catchments are constantly modified by human interactions such as land-cover and land-use changes, flow regulation, water abstractions, soil impermeabilization, and many others, which can drastically alter the way hydrologic fluxes in the catchments respond.
Considering the relevance of the abovementioned human interactions, we provided information about the number and volume of the reservoirs (which can regulate streamflow), water demand extracted from ANA (2017), and using some of the 540 CABra attributes, we have created a hydrologic disturbance index, which will easily provide for CABra users the degree of human interactions that can modify water fluxes in each catchment. In the development of this index, we have considered fraction of urban cover in each catchment, the distance to the nearest urban area of each catchment, the number of reservoirs where is the hydrologic disturbance index, dimensionless; is the normalized fraction of urban cover; is the normalized distance to the nearest urban area; is the normalized fraction of crops cover; is the normalized number of reservoirs; % is the normalized percentage of catchment's area covered by reservoirs; is the normalized reservoirs' 550 regulation capacity of catchment's mean annual flow; and is the normalized catchment's annual water demand. The results of the spatial distribution of the hydrological disturbance index and its components are shown in Fig. 13. Most CABra catchments are close to an urban cover (it can be a large city or a small village), with a distance of up to 10 km.
However, we also could find catchments with up to 100 km of distance to the urban cover. As seen in Fig. 13b and Fig. 13c, most CABra catchments present a fraction of urban cover up to 10%, with high values close to large cities, and a fraction of crops cover up to 40%, with the highest values in central and southern portions. As these factors present a high weight on the 560 hydrological disturbance index, they are a good clue of the most disturbed catchments.
Results from the reservoirs in CABra catchments are shown in Fig. 13d, Fig. 13e, Fig. 13f, and Fig. 13g. The number of reservoirs in the catchment ranges from zero to 48,404. Even though we found the largest number of reservoirs in a large catchment, this relationship is not linear. There are some catchments, especially in the São Francisco River Basin, which presents an extremely high number of reservoirs due to the low amounts of annual precipitation and intensive drought in the 565 region. Moreover, catchments in the São Francisco River Basin presents the highest values of the total volume of reservoirs.
These reservoirs are used for many anthropogenic purposes, such as hydroelectric power plants, irrigation, drinking water supply, fish-farming, and recreation. These high values of the total volume of reservoirs, especially in the drier regions, could lead to a strong streamflow regulation, as seen in Fig. 13g. In most of the CABra catchments, reservoirs can regulate up to 25% of the annual flow, but there are some cases in the Caatinga biome where the regulation capacity reaches up to ten 570 times the annual flow, making these catchments susceptible to non-natural events.
The water demand on CABra catchments ranges from zero (in Amazon) to 171 mm year -1 (in Caatinga) and it is related to drinking water supply and irrigation of agricultural areas (Fig. 13h). The integrated analysis of the above-mentioned attributes is shown in Fig. 13i, as the new hydrological disturbance index. Most of the CABra catchments present an index value of up to 0.2, indicating a low anthropic interference on water fluxes. Higher values, above 0.4, indicate catchments 575 with some significant interference on water fluxes, which may be related to one or more terms of the equation. High values of the hydrological disturbance index in the central and southern portion of Brazil may be related to agriculture development, while in the south-eastern part, they may be related to urbanization, and in the north-eastern part, they may be related to the presence of numerous voluminous reservoirs. As expected, in the Amazon and mountainous areas of Atlantic Forest, low values were found. The creation of the hydrological disturbance index can be especially useful for the users of the CABra 580 dataset, allowing them to quickly view the general state of the anthropogenic interferences on water fluxes, which is an important consideration in a wide range of studies.

Comparison with the CAMELS-BR and broader implications for hydrological studies
The CABra and the CAMELS-BR (Chagas et al., 2020) contain both large samples of hydroclimatic, landscape, and other attributes for Brazilian catchments. Their striking similarities in concept and goals highlight nothing but the urgent need for the creation of such a database for Brazilian catchments. However, it is important to notice that multiple differences between both datasets exist, as we will discuss below. 595 The first main difference between CABra and CAMELS-BR is related to the catchment delineation procedures adopted.
CAMELS-BR uses the basin masks from the GSIM (Do et al., 2018) product, where a 500-m digital elevation model was used for the delineation of catchment boundaries and extraction of topographic indices. GSIM has a quality filter allowing for up to 50% of error in the catchment area when compared with ANA's value, as described in Do et al. (2018). As previously explained, the CABra catchment boundaries (delineated using streamflow gauge location from ANA), uses a 600 high-definition (90-m) elevation product. We have manually inspected each of the 735 catchments to minimize further errors, correcting the geographic position of the outlet to coincide with the stream network, achieving a mean error of 2% against ANA's areas. It is important to highlight that a suitable watershed delineation is of paramount importance for catchment hydrology studies because errors in these processes are further propagated for all computed attributes dependents on area and location. 605 Related to the daily streamflow data, in the CABra dataset we have retained catchments with less than 10% missing streamflow records over 30 hydrologic years  which resulted in the final selection of 735 catchments. On the other hand, CAMELS-BR contains 897 catchments with less than 5% missing data, while considering 20 hydrologic years, (1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009). Our choice for a longer time series was predicated on the commonly adopted rationale which assumes 30 years as the basis for establishing long-term climatology as well as hydrologic indices (Huntingford et al., 2014;Tetzlaff et al., 610 2017), which we in turn believe will lead to better characterization of hydrological and climatological processes taking place.
Another important difference between both datasets is related to the choice of databases used for providing the daily meteorological time series and estimated the related indices. While CAMELS-BR uses three widely used gridded datasets (based on remote sensing/reanalysis/gauge blends of rainfall), i.e., the CHIRPS v2.0, CPC, and MSWEP v2.2, being the first one the chosen for the climatic indices (because of its spatial resolution of 0.05ºx0.05º), the CABra uses the Xavier et al. 615 (2016) (Melo et al., 2015). 620 Other uses of this dataset include the evaluation of precipitation from downscaled-global circulation models (Almagro et al., 2020), as well as other meteorological variables used in regional studies (Battisti et al., 2019;Bender and Sentelhas, 2018; https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License. Monteiro et al., 2018), aside from being widely used for hydrological studies (Almagro et al., 2017;Avila-Diaz et al., 2020;Lima and AghaKouchak, 2017;Souza et al., 2016).
Additional differences belonging to the meteorological time-series section are also worth noting. CAMELS-BR provides the 625 model-based PET estimates extracted from the GLEAM product (Martens et al., 2017), while daily temperatures (maximum, minimum, and average) are the only PET-related variable provided in a daily time series format. The CABra dataset provides the computed PET following 3 widely used methods, along with all necessary variables for its computation, such as solar radiation, wind speed, temperature, and relative humidity. Our choice for the computation of PET instead of using modelbased estimates should allow for more transparency and reproducibility of results obtained using our dataset. Also, the 630 choice of providing a wider range of meteorological variables allows the user to estimate PET based on different methods while enhancing the reach of our dataset for studies that might benefit from additional meteorological variables.
While the soil and geology attributes of from both CABra and CAMELS-BR are derived from the same data sources, (i.e., the SoilGrids250, the GLiM, and the GLHYMPS v2.0), CABra provides the following additional variables not available in CAMELS-BR: subsurface permeability (subsurface hydraulic conductivity for geology attribute), soil type, textural class, 635 and soil bulk densitywhich can be used to estimate soil porosity. Regarding groundwater attributes, CABra contains rock type and name of the aquifer and water table depths from Fan et al., (2013) and the HAND estimates, while CAMELS-BR contains only the water table depth estimates from Fan et al., (2013).
In terms of land-cover attributes, CABra and CAMELS-BR present similar attributes, but the data source is different. CABra adopted a product with a higher spatial resolution (100-m against 300-m) and more recent observation (2015 against 2009) 640 than in CAMELS-BR. Due to this better spatial resolution. we chose to use a most recent land cover, even it being outside of the timespan of hydrologic time series. CABra also brings information about the seasonal vegetation biomass of the catchment, in terms of NDVI, which is not present in CAMELS-BR.
Finally, both datasets take into account the human influence within each catchment, which is essential to a holistic understanding of the catchment behavior due to anthropogenic interactions and a lack of most of the large-sample datasets 645 . CAMELS-BR presents data about water use, the volume of reservoirs, and the degree of regulation of the reservoirs. However, there is no combination or integration of these attributes in a specific index or approach. On the other hand, CABra presents eight attributes, i.e., distance to urbanization, the fraction of non-natural land-cover (crops and urban areas), water demand, reservoirs' count, area, volume, and streamflow regulation capacity (the last two are also found in CAMELS-BR), which can affect the hydrologic behavior of the catchment in terms of water quantity, quality and 650 regulation. Additionally, we developed a new hydrologic disturbance index (HDI), which considers all these eight attributes above-mentioned. The HDI is a quantitative index of the level of anthropization, being reproducible and practical to identify a more or less human-impacted catchment.

Conclusions
In this study, we have collected, synthesized, organized, and made available more than 100 topography, climate, streamflow, 655 groundwater, soil, geology, land-use, and land cover, and hydrologic disturbance attributes for 735 catchments in Brazil. To do so, we have used several sources, such as observed time series, observed and modeled gridded data, remote sensing data, and reanalysis data. Moreover, we have calculated some attributes for providing more accurate data than those available in the literature, including potential evapotranspiration, and providing inexistent data, such as the hydrological disturbance index. As this dataset deals with catchment-scale averaged attributes, we have paid particular attention to DEM resolution, 660 catchment delineation, while also manually inspecting each of the CABra catchments.
The development of the CABra dataset opens up several opportunities to test and develop a hypothesis in a unique environment like Brazil, with its vast and rich diversity in hydrology and landscapes. Finding relationships between the catchments' attributes will enable hydrologists to identify the drivers of the water fluxes in the catchment. We hope our dataset will aid catchment classification efforts that will ultimately unravel the underlying dominant controls of Brazilian 665 regional hydrology across space and time. At the same time, the CABra dataset covers fundamentally different hydroclimatologic and ecologic regions than those covered by other similar large-sample datasets (United States, Great Britain, Chile, etc.), being a complement for global assessments and expanding the possibility of the use of our dataset for multiple scientific areas, such as geology, agronomy, ecohydrology.
We intend to expand the CABra dataset in the future. Information and attributes related to relevant fields of work, such as 670 soil erosion, ecology, biology, and chemistry, as well as climate change projections, will be added to the CABra dataset in future updates release. Thus, CABra represents a robust multi-source data collection effort for Brazil and is intended to play a key role in advancing the scientific understanding of climate-landscape-hydrology interactions. As such, we hope it will guide large-sample hydrology investigations and pave the way for testing novel hypotheses by both the Brazilian and the international scientific community. 675

Author contribution
AA, PTSO, AAMN, and PT conceived the ideas and designed the methodology for the study; AA collected, processed, and 680 analyzed the data; AA, PTSO, and AAMN led the writing of the initial draft; TR and PT edited and reviewed the manuscript; All authors contributed and gave final approval for publication. https://doi.org/10.5194/hess-2020-521 Preprint. Discussion started: 14 October 2020 c Author(s) 2020. CC BY 4.0 License.