Nonstationary weather and water extremes: a review of methods for their detection, attribution, and management

Hydroclimatic extremes such as intense rainfall, floods, droughts, heatwaves, and wind/storms have devastating effects each year. One of the key challenges for society is understanding how these extremes are evolving and likely to unfold beyond their historical distributions under the influence of multiple drivers such as changes in climate, land cover, and other human factors. Methods for analysing hydroclimatic extremes have advanced considerably in recent decades. Here we provide a review of the drivers, metrics and methods for the detection, attribution, prediction and projection of nonstationary hydro5 climatic extremes. We discuss issues and uncertainty associated with these approaches (e.g arising from insufficient record length, spurious nonstationarities, or incomplete representation of nonstationary sources in modelling frameworks), examine empirical and simulation-based frameworks for analysis of nonstationary extremes, and identify gaps for future research.

. What is nonstationarity? Examples of (a) a stationary time series with constant mean and variance; and (b) three nonstationary time series in the form of shift in mean (trend, step-change) and shift in variance. Solid and dashed black lines represent the mean and the variance of the time series, respectively. Lettenmaier, 2006;Spinoni et al., 2017), and heatwaves/heat stress (e.g. Oliver et al., 2018;Lorenz et al., 2019;Ouarda and Charron, 2018) have been widely detected, and have led to proclamations about "death" of stationarity in water management (Milly et al., 2008). In contrast, trends in strong winds/storms are less certain and the influence of anthropogenic climate change is difficult to attribute (Shaw et al., 2016;Elsner et al., 2008;Martínez-Alvarado et al., 2018;Wohland et al., 2019).
Disentangling natural and anthropogenic drivers of nonstationarity is problematic, as the two can be interlinked, and even 5 seemingly 'natural' drivers such as climate modes may shift under the effects of anthropogenic climate change (e.g. Maher et al., 2018). Although the drivers of abrupt nonstationarities (step-changes) may be apparent (e.g. water abstraction, reservoir filling and operations), drivers of more incremental nonstationarities (trend/variance) (e.g. climate variability and change, and land cover change) may be harder to attribute and/or obfuscated by other confounding factors.
Deciding whether to employ nonstationary methods for the purpose of managing extremes is thus a key challenge facing 10 weather and water scientists and practitioners today (e.g. Faulkner and Sharkey, 2020), as it is unclear how to represent increased uncertainty arising from climate change (e.g. Wasko et al., 2021). Detecting the presence, and attributing the source of nonstationarity in hydroclimatic extremes is however vital for understanding and managing water resources in a changing world. Nonstationarity may have dramatic impacts for infrastructure (François et al., 2019), property, and society over a range of overlapping timescales. There are many different drivers that may alter extremes simultaneously over both the 15 short-term (such as human management impacts) and medium-to long-term (such as land cover and climate change impacts) (e.g. Warner, 1987;Rust et al., 2019). Short-term trends are widely present in variables like streamflow, which exhibit long memory (time-dependence) and periodicities. However, such trends are not necessarily indicative of nonstationarity (e.g. Koutsoyiannis, 2006;Koutsoyiannis and Montanari, 2015). "Nonsense correlations" can be found between time-varying variables without any physical significance (Yule, 1926) and apparent trends or oscillations may be found in time series due to random 20 causes (Slutzky, 1937). In fact, random fluctuations or deviations from the mean are to be expected in stationary time series (e.g. Wunsch, 1999), especially those with long memory. Multi-decadal (e.g. 30-year) shifts may simply be temporary excursions in longer (e.g. 100-year) records, or a function of the start and end dates chosen for the trend analysis (Harrigan et al., 2018), and may not warrant application of nonstationary analysis. A growing body of literature has shown that inappropriately applying nonstationary models to short time series may have the undesired effect of increasing uncertainty; in cases where model structure and/or underlying physical drivers are uncertain, stationary models may be the preferred option for design and management of extremes (Serinaldi and Kilsby, 2015). Nonstationarity should not be presumed to occur everywhere simply because of climate change (Lins, 2012); nonstationary approaches have value primarily when there are good reasons to suspect 5 physically-plausible and predictable drivers of change.
Over the last two decades, hundreds of papers have addressed the detection, attribution, prediction and projection of nonstationarity in precipitation, floods, drought, heat stress, temperature, extreme winds and storms. However, a comprehensive, introductory overview of these methods across hydroclimatic extremes, including an overarching discussion of the key challenges that can arise, has not been published to date. This manuscript is the first to review the entire non-stationary management 10 process from identifying metrics to analysis through to management. This paper offers a synthesis of methods for quantifying, attributing, and managing nonstationarity over multiple spatial and temporal scales, along with their limitations. The structure follows the logical order of steps employed in a detection, attribution and management framework ( Figure 2). Challenges are presented for each step, throughout the paper. Section 2 describes the most widely used indices for diagnosing the symptoms of nonstationarity via changes in magnitude, 15 frequency and timing. Section 3 identifies essential data pre-requisites for detecting nonstationarity, such as homogeneity analysis. Section 4 discusses the techniques used to detect nonstationarity, including regression-based methods for gradual change, pooled approaches for analysis of rare extremes, step-change analysis for abrupt change, and other methods for discerning changing seasonality. Section 5 introduces the key drivers of nonstationarity of hydroclimatic extremes. Section 6 reviews approaches for attribution of nonstationary extremes, including both observation-and model-based approaches, and issues with 20 attribution and engineering design under nonstationarity. Finally, Section 7 discusses approaches to manage nonstationarity via engineering design and model projections, including key limitations. The overall aim of the paper is to highlight the most significant issues and considerations when detecting, attributing, predicting and projecting nonstationarity in hydroclimatic extremes.

2 Symptoms of nonstationarity
Nonstationarities in hydroclimatic extremes may be expressed through a significant shift in the mean, variance, or shape of a given time series (Figure 1). Such departures are generally diagnosed by symptoms such as a change in the magnitude (events becoming more or less extreme), frequency (events occurring more or less often than before) and timing (events occurring earlier or later in the year) of seasonal or annual extremes. There are many different ways of describing the symptoms of 5 nonstationarity -each is discussed in turn below, and examples are provided in Table 1.

Magnitude
Significant changes in the magnitude of extremes are relevant to society, engineers, decision-makers, and insurers alike. The magnitude or intensity of an event is generally described by estimating the percentiles of a distribution over a given period.
Examples may include the peak rainfall, peak streamflow, peak intensity of a drought, maximum/minimum temperature, or 10 peak wind speed, within a given day, month, season or a year ( Figure 3). Significant changes are detected by evaluating alterations in these percentiles over time (e.g. a time series of annual maximum daily streamflow). More generally, magnitudes can also be described via metrics characterising the intensity or flashiness of an event, such as its spatial extent, duration or time-to-peak ( Figure 3).
The magnitude or intensity of precipitation (in mm, Figure 3a) can be assessed using various metrics. Many of these are part 15 of the 'ETCCDI' indices that were proposed in 2002 by the expert team on climate change detection and indices (ETCCDI) (see e.g. Frich et al., 2002;Zhang et al., 2011). Precipitation metrics include the maximum depth of precipitation accumulation for a given duration (1-, 3-, or 6-hour; 1-or 5-day, respectively termed Rx1day, or Rx5day) within a given month, season, or year (e.g. Champion et al., 2019;Sun et al., 2020a); the percentage of rain that fell in the monthly maximum 1-hour precipitation (or some other period); the 90 th , 95 th or 99 th percentile precipitation amount (over 1-, 3-, or 6-h) during a month/year -or specifically 20 on wet days (Moberg and Jones, 2005); or even the total precipitation accumulated from hours exceeding specified percentiles over a month/year (e.g. Donat et al., 2013). For instance, at the global scale, analysis of the Rx1day and Rx5day precipitation accumulations found that extreme precipitation has increased at about two thirds of stations -a significantly greater proportion than can be expected by chance (Sun et al., 2020a). An important concept is the 'probable maximum precipitation' (PMP), i.e. the greatest depth of precipitation that is possible in a given place and time and for a given storm duration. For a complete 25 state-of-the-art review on the PMP concept, see (Salas et al., 2020). PMP can be computed via hydrometeorological, statistical, grid-based and site-specific approaches using both stationary and nonstationary methods (e.g. Lee and Singh, 2020). PMP is expected to increase in many regions in future decades due to increases in atmospheric moisture content and moisture transport into storms (Kunkel et al., 2013).
Percentiles of daily streamflow distribution are commonly used for floods ( Figure 3b). For example, the Q 90 (the 90 th per- 30 centile of the distribution, i.e. the flow that is exceeded 10% of the time, confusingly referred to as 'Q 90 ' in North America and 'Q 10 ' in the British and Irish islands), Q 95 , Q 99 (flow that is exceeded 3.65 days per year, on average), Q 99.9 (0.37 days per year) or AMAX (the annual maximum streamflow). When considering more extreme events than AMAX, hydroclimatic Drought; (d) Temperature; (e) Wind (storms). All these variables can be described using indicators of event duration and magnitude (peak intensity). extremes are commonly expressed as 1-in-20, 1-in-50, or the 1-in-100 year events (e.g. Milly et al., 2002;Slater et al., 2021).
Hydrograph analyses may also be used to extract metrics such as flood volume and duration ( Figure 3b). Other indicators describe the flashiness of an event, such as the time-to-peak, which is defined as the total number of hours starting from the sharp rise of the hydrograph until the peak discharge. The spatial extent of a flood event can be described using metrics such as the number of catchments flooding simultaneously (Uhlemann et al., 2010) or the 'flood synchrony scale' (FSS), which 5 evaluates the largest radius around a stream gauge where more than half of the surrounding stream gauges also record flooding within the same week (Berghuijs et al., 2019a). Studies are increasingly evaluating the spatial dependence of flooding across multiple basins by considering meteorological, temporal and land surface processes leading to simultaneous flooding across varying spatial scales (e.g. Kemter et al., 2020;Wilby and Quinn, 2013). Droughts ( Figure 3c) differ from other extreme weather because they develop more slowly and last longer; they are broadly 10 defined as "a sustained period of below-normal water availability" (Tallaksen and Van Lanen, 2004), sometimes referred to as a "creeping phenomenon" (Mishra and Singh, 2010;Wilhite, 2016). These characteristics make it more difficult to assess nonstationarity as there are fewer events to compare over time plus drought onset and termination are challenging to pinpoint (Parry et al., 2016). Not all droughts are defined by aridity; and rainfall deficit alone does not imply a drought (Van Loon, 2015).
Instead, a combination of factors in the hydrological cycle interact to yield below normal conditions ( Figure 3c). Drought has 15 been typically classified as: meteorological (precipitation deficit); hydrological (surface and subsurface water deficit relative to local water uses); agricultural (declining soil moisture and crop failure); or socioeconomic (failure of water resources system to meet demand) (Van Loon, 2015). Since the definition of "normal conditions" depends on spatial and temporal scales, drought anomalies are typically defined locally using composite indicators, including precipitation and temperature. Various metrics exist to measure drought stress, and they either reflect deficits in precipitation, or combined metrics of precipitation, temperature and evaporation. The World Meteorological Organisation (WMO) recommends the use of the Standardized Precipitation Index 5 (SPI), which reflects standard deviations from normal rainfall (WMO, 2016;McKee et al., 1993). Other well-known indices that rely on monthly precipitation include the Palmer Drought Severity Index (PDSI) (Palmer, 1965), deciles (Gibbs and Maher, 1967), and the rainfall anomaly index and its modified version (RAI/mRAI) (e.g. Hänsel et al., 2016). Some metrics combine variables; for instance PDSI is computed using precipitation and potential evapotranspiration. The Standardized Precipitation Evapotranspiration Index (SPEI) can be interpreted similarly to SPI, but reflects both evaporative demand and precipitation 10 inputs to a system (Vicente-Serrano et al., 2010). The spatial characteristics of drought have also become increasingly relevant, as more studies examine their areal extent over time, using metrics like spatial patterns of drought intensity mapped across livelihood zones over time (e.g. Leelaruban and Padmanabhan, 2017;Mekonen et al., 2020).
For temperature extremes (Figure 3d), studies may monitor the hottest/coldest day, warmest/coolest night, or the extreme temperature range (TXx-TNn,°C) over a given period -month, season, or year (e.g. Donat et al., 2013;Papalexiou et al., 2018). 15 Globally, the highest temperature of the year, for example, increased by 0.19°C per decade during 1966-2015, but accelerated to 0.25°C per decade during 1986-2015, displaying a faster increase than the mean annual temperature (Figure 4b, Papalexiou et al., 2018). Percentiles of the distribution are also commonly assessed (Zhang et al., 2005;Kjellström et al., 2007), while some authors work with combined temperature-humidity metrics Raymond et al., 2020b;Knutson and Ploshay, 2016), which offer a more complete measure of atmospheric heat content (Pielke Sr et al., 2004;Peterson et al., 20 2011;Matthews, 2020), and may therefore be more closely aligned with levels of thermal stress felt by humans (Mora et al., 2017;Matthews, 2018). Other approaches include an emphasis on duration by focussing on heatwaves, defined as periods of consecutive days when heat is higher than normal (Perkins and Alexander, 2013). This very broad categorization has seen a plethora of thresholds (both absolute-value and percentile-based) and metrics (temperature and combined temperature-humidity indices) applied to heatwave studies, with the choice shaped by interests in potential impacts . 25 Changes in the magnitude of extreme wind events (Figure 3e) may also be tracked using wind speed percentiles such as the 90 th , 95 th , 98 th and 99 th seasonal or annual percentiles (e.g. Donat et al., 2011;Young and Ribal, 2019;Wang et al., 2009).
For instance, the 90 th percentile of 10m wind speed from ERA-Interim reanalysis data exhibits increasing wind speeds over the tropical oceans and large parts of South America but decreasing trends over Eastern Europe and Northwest Asia ( Figure   4e, Torralba et al., 2017), although trends vary widely across reanalysis products (Torralba et al., 2017;Wohland et al., 2019). 30 Wind intensity (in metres per second) may be explicitly measured over two-minute sustained periods, or three-second gust periods (Pryor et al., 2014). Wind events may also be inferred from gradients in sea-level pressure fields (Jones et al., 2016;Matthews et al., 2016b). Winds associated with Western Hemisphere tropical cyclones are described using storm scales such as the Saffir-Simpson hurricane wind scale (SSHWS) (e.g Elsner et al., 2008;Karl et al., 2008). This classifies wind intensity into five storm categories: one (119-153 km/h), two (154-177 km/h), three (178-208 km/h), four (209-251 km/h) and five (>252 35 km/h). Cyclones and typhoon magnitudes are equally described using metrics such as the seasonal mean lifetime peak intensity, intensification rate, and intensification duration (Mei et al., 2015). Table 1. Examples of indicators employed for detection of nonstationarity in weather and water extremes. More detailed tables of metrics can be found in Donat et al. (2013) for temperature and precipitation, or Ekström et al. (2018) for a detailed list of hydroclimatic metrics.

Type
Common indicators Example application Example reference Magnitude Percentiles / quantiles -90 th p.tile of rain on wet days Moberg and Jones (2005) -

Frequency
A second broad category of nonstationarity 'symptoms' is the frequency of events. Many metrics are used to describe changes in the frequency of hydroclimatic extremes, such as annual exceedance probabilities and counts of occurrences above or below 5 thresholds. Examples of such 'peak over threshold' (POT) approaches may include the number of days or hours above/beneath a temperature threshold, or the number of flood or drought events, within a given season/year. In reality, the magnitude and frequency of extremes are closely related, such that when magnitudes increase, one can also typically expect to find more peaks over a given threshold (see Figure 4). Frequency-based metrics, however, generally enable better detection of changes in extremes than magnitude-based metrics (e.g. Mallakpour and Villarini, 2015) because they often reflect a larger sample of 10 data and are less prone to measurement errors. For example, while block-maxima approaches often include just one value per year/season, POT approaches count the total number of exceedances above a threshold. This fact is exploited by those using documentary evidence to evaluate flood frequency (Macdonald et al., 2006). The thresholds for detection of changes in frequency should be set high enough to describe a meaningful extreme event, yet low enough to compile an adequate sample size.

5
For precipitation, independent events must be first identified for POT analysis. Various approaches exist for identification of events, such as the fitting of Poisson models (Restrepo-Posada and Eagleson, 1982). Multiple methods also exist for the selection of the most appropriate threshold (Caeiro and Gomes, 2016). Thresholds are generally chosen based on the local precipitation distribution such as the 95 th , 98 th , 99 th or 99.5 th percentile of rain over a 1-, 6-, 12-, or 24-h period (e.g. Wi et al., 2016). Percentile or fixed thresholds (such as the 10 mm or 20 mm daily total, R10mm or R20mm) are then used to 10 count monthly/annual days with heavy precipitation exceeding or equalling these values. Alternatively, a mean residual life plot (an exploratory graphical approach) can be used to select a suitably high threshold (e.g. Coles, 2001). These thresholds can be calculated for individual years or using the entire multi-year record. For an overview of threshold selection methods, see Anagnostopoulou and Tolika (2012). At the global scale, increases in the frequency of extreme precipitation have been more pronounced than changes in magnitude ( Figure 4a-b, Papalexiou and Montanari, 2019). An alternative approach for estimating 15 the probability of intense rainfall events is the changing likelihood of an historical precipitation analogue (Matthews et al., 2016a).
The frequency of temperature extremes is assessed using metrics such as the percentage of time when the daily minimum or maximum temperature is below or above a given percentile, such as the total annual count of ice/frost days (ID/FD) where the daily minimum temperature is below 0 • C (e.g. Donat et al., 2013). Mwagona et al. (2018) observed changes in cold/warm 20 night frequency at 116 stations in Northeast China. They report a decrease in cold night frequency during winter and spring, while warm night frequency increased primarily in summer. Similar metrics are used to describe changes in the frequency of heatwaves (Perkins-Kirkpatrick and Lewis, 2020) or growing season days, such as the number of days with plant heat stress (with maximum temperature exceeding 35 • C) (e.g. Rivington et al., 2013). Alternatively, the accumulated frost (sum of degree days where the minimum temperature is below 0 • C) (e.g. Harding et al., 2015) may be linked to crop yields during different 25 growth phases.
The frequency of wind extremes is often assessed in terms of the number of wind storms (e.g. Wild et al., 2015) or cyclones (e.g. Matthews et al., 2016b) in a given period. The number of events can be estimated using tracking algorithms, ranging from simple algorithms based on the mean sea-level pressure values for cyclone activity (Murray and Simmonds, 1991;Donat et al., 2011) and the exceedance over the 98 th percentile wind speeds for wind storms (Leckebusch et al., 2008) to more complex 30 tracking of 'sting jets' (Hart et al., 2017). The frequency of extreme winds may be quantified from reanalysis data, noting that the trend magnitude and sign can be sensitive to the reanalysis product (Befort et al., 2016;Wohland et al., 2019;Torralba et al., 2017).
POT methods are widely used to evaluate changes in the frequency of hydroclimatic events. These methods require selection of a reference threshold (a magnitude) and a period (e.g. one week) to de-cluster independent events. This is a common chal- lenge for most hydroclimatic extremes ( Figure 3). For floods, many studies apply a somewhat arbitrary streamflow threshold that is on average exceeded twice per year (e.g. Hodgkins et al., 2019;Slater and Villarini, 2017a). In practice, alternative thresholds may be equally valid, such as the number of days when water levels exceed official flood thresholds (Slater and Villarini, 2016). However, setting a lower threshold means that events are less likely to be independent and/or of practical significance. 'Clusters' of consecutive events are, thus, further de-clustered by specifying a period between events. For instance, 5 Wi et al. (2016) separated extreme rainfall events by an interval of 7 days but other requirements have been proposed to ensure independence. Lang et al. (1999, p.105) highlight that in 1976, the U.S. Water Resources Council imposed a separation between flood events of "at least as many days as five plus the natural logarithm of square miles of basin area", including a drop between consecutive peaks 'below 75% of the lowest of the two flood events'. The time between independent events depends on the catchment size, as a longer duration is expected in larger catchments because of the slower draw-down of hydrograph 10 limbs. In practice, such thresholds are far too short to adequately distinguish truly independent events, given the timescales of hydroclimatic variability which are often greater than a year. Similar concerns apply when isolating successive drought events (Bell et al., 2013;Thomas et al., 2014;Parry et al., 2016). Metrics to identify drought termination and subsequently drought independence include storage deficit methods that quantify the volume of water in relation to normal water storage conditions (Thomas et al., 2014), and more generally, the return from maximum negative anomalies to above-average conditions (Parry 15 et al., 2016). Hence, specification of the threshold and de-clustering technique make POT approaches more complicated to implement than block maxima approaches, where only one extreme per 'block' (unit of time) is selected. The more flexible selection of extremes and larger sample size of frequency-based (i.e. POT) approaches may be preferred when record length is a limiting factor, whereas simpler magnitude-based (i.e. block maxima) approaches may be preferred when longer records are available.

20
More severe hydroclimatic extremes (such as 1-in-10, -20, -50, -100, or -200-year events) are typically evaluated using return periods (or 'expected waiting time'). Alternatively, annual exceedance probabilities (AEPs) define the probability that a threshold will be exceeded in a given year. Other metrics have been proposed for engineering design. Reviews by Salas et al. (2018) and François et al. (2019) highlight ongoing disagreements about the utility of nonstationary methods for the design of engineering structures. As the uncertainties of nonstationary model structure may exceed that of stationary models (Serinaldi 25 and Kilsby, 2015), specific strategies are required to manage the consequences of those uncertainties (François et al., 2019).
There are thus ongoing debates about which concepts and methods are most appropriate for estimation of extremes, such as the return period, risk, reliability or Equivalent Reliability (ER), Design Life Level (DLL) or Average Design Life Level (ADLL), and Expected Number of Exceedances (ENE) (e.g. Read and Vogel, 2015;Rootzén and Katz, 2013;Yan et al., 2017;Salas and Obeysekera, 2014). For instance, return period metrics may exhibit limitations in the case of time-correlated 30 hydroclimatological extremes, so alternatives such as the 'equivalent return period' (ERP, i.e. "the period that would lead to the same probability of failure pertaining to a given return period T in the framework of classical statistics (independent case)" Volpi et al., 2015), may be preferred.
It is not just individual characteristics of weather and water extremes that can change over time but also the interdependence between different characteristics, such as frequency, magnitude, and volumes. Myhre et al. (2019) highlight that in a warming 35 climate, increases in extreme rainfall are likely to be driven by shifts in both the intensity and frequency of events; but increases in the frequency are most important. Brunner et al. (2019) assessed future changes in flood peak-volume dependencies and found that the interdependence between variables may change more strongly than the individual variables themselves. This interdependency also applies to other variable pairs jointly of interest such as drought duration and deficit or precipitation intensity and duration. Recognising the interdependence between magnitude and frequency, many studies employ intensity-5 duration-frequency (IDF) metrics, which describe both the magnitude and frequency at once. It has recently been shown that generalized extreme value (GEV) distribution parameters scale robustly with event duration at the global scale (R 2 > 0.88), hence a universal IDF formula can be applied to estimate rainfall intensity for a continuous range of durations, including at the sub-daily scale (Courty et al., 2019). There is growing interest in the nonstationarity of IDF curves (Cheng and AghaKouchak, 2014;Ganguli and Coulibaly, 2017) as well as the implications of this nonstationarity for compound hydroclimatic extremes 10 globally (AghaKouchak et al., 2020).

Timing
Nonstationarity in the timing and seasonality of weather and water extremes has been examined far less than trends in magnitude and frequency (see examples in Table 1). Timing and seasonality provide information that is relevant for the management of water resources and analysis of underlying drivers of change. For instance, the start of field operations for farming may be 15 estimated as the day of the year when 'the sum of average temperature from 1st January exceeds 200 • C' (e.g. Rivington et al., 2013;Harding et al., 2015). Similarly, the start of the growing season may be measured as the first of five consecutive days with average temperature exceeding 5 • C (Rivington et al., 2013). Changes in these indicators of hydro-climatic extremes may have substantial impacts (e.g. crop yields). Additionally, changes in timing and seasonality can also affect impacts of extreme events.
For example, the risk from compound tropical cyclones and heatwaves is sensitive to the seasonal cycles in tropical cyclone 20 probability (which peaks in late summer) and extreme heat (mid summer). A greater frequency of tropical cyclones earlier in summer, or more extreme heat late in summer, would increase the risk of compounding and attendant impacts .
Changes in the timing of seasonal streamflow are typically assessed using the centre of volume (CV) date (Court, 1962) or mean date of flood occurrence (mean flood day, MFD). For example, Hodgkins and Dudley (2006) assessed changes in flood 25 timing over the conterminous USA from 1913-2002 using the winter-spring CV dates. They found that a third of stations north of 44 degrees had significantly earlier flows, likely related to changes in winter and spring air temperatures affecting winter snowpack. The MFD has been used to assess changes in streamflow timing in specific countries such as Wales (Macdonald et al., 2010) and Spain (Mediero et al., 2014). Probabilistic methods for identifying flood seasonality and their trends Cunderlik et al. (2004) have been applied in Canada (Cunderlik and Ouarda, 2009) and the Northeastern United States (Collins, 2019). 30 In Europe, an analysis of 4,262 streamflow stations in 38 countries used the date of occurrence of the highest annual peak flow to assess changes in flood timing (Blöschl et al., 2017). This showed significant changes in the seasonal timing of floods at the regional scale: in Northeastern Europe, 81% of stations had shifted towards earlier floods (by 8 days per 50 years); in Western Europe, 50% of stations had shifted towards earlier floods (by 15 days per 50 years); and around the North Sea, 50% of the stations had shifted towards later floods (by 8 days per 50 years), as seen in Figure 4c (Blöschl et al., 2017). Wasko et al. (2020b) evaluated global shifts in the timing of streamflow based on the local water year, and found shifts in timing of annual floods were three times greater than shifts in mean streamflow. Further, the drivers of streamflow timing depend on the magnitude of the event: less extreme events tend to correspond with soil moisture timing, while more extreme events depend more on rainfall timing (Wasko et al., 2020a). 5 Nonstationarity in the timing of extreme precipitation has also been used to better understand the causal factors of extremes. early summer, and less frequent in late summer, due to summer drying.
3 Data considerations before detecting and attributing nonstationarity 3.1 Spurious nonstationarities: the issue of data quality Confidence in nonstationarity detection, attribution, and prediction rests on confidence in the homogeneity and quality of input data -including primary hydro-meteorological series from individual observations, accompanying metadata and qualitative 20 information. Data quality issues tend to be particularly prevalent with measurement of extremes. Hydroclimatologists increasingly need to be aware of data quality issues associated with homogeneity of remotely sensed data and their derivatives. For example, inaccuracies may arise from orbital drift (Weber and Wunderle, 2019), inference of precipitation from vegetation in data sparse regions , or changing land-surface reflectance such as snow cover over mountainous regions (Karaseva et al., 2012). Some of the most common sources of data errors and biases that reduce homogeneity/ cause nonstationarity 25 in ground-based information over time-scales of years to decades are: site or instrument changes, biases and drift in field procedures (time of sample, preferred values), unstable rating curves and channel cross-sections, changes in network density/ cover, post-processing and archiving (unit changes) . For example, the England and Wales Precipitation series is a specific example of spurious nonstationarity (a long-term trend towards wetter winters) arising from a combination of climate drivers (cold winters with more snowfall in the early nineteenth century) with non-standard rain gauges before the Gridded products and reanalysis data are also not immune from such data quality issues and are further affected by time-space variations in raw data inputs and version updates (Sterl, 2004;Ferguson and Villarini, 2014). Detecting spurious nonstationarities within raw data should be one of the first steps when evaluating time series that have yet to be quality controlled. Approaches have been proposed for uncovering homogeneity issues (see Figure 5b).

Record length and completeness
The observed record length required for assessments of nonstationarity depends on the type of process under consideration, the aims of analysis, properties of the underlying data, as well as the timescales of the sources of nonstationarity (drivers of change). For instance, Atlantic sea surface temperatures (SSTs) vary over periods longer than 50 years (McCarthy et al., 2015;Sutton and Dong, 2012) and affect concurrent precipitation and temperature patterns: the North Atlantic was particularly cold 5 during the middle of the climate normal period 1961-1990 due to the Atlantic Multidecadal Oscillation/great saline anomaly (Dickson et al., 1988). Hence, even 50 years of data is insufficient to robustly detect true nonstationarities because the start and end dates of records may substantially affect the sign (direction) and magnitude of trends (Harrigan et al., 2018), especially in records that exhibit multi-decadal periodicity. Hundreds of years are required to adequately identify certain stationary models (Thyer et al., 2006), let alone nonstationary models. Additionally, highly variable time series require a larger percentage change 10 in the mean of the data to identify a statistically significant change compared with less variable time series (e.g. Chiew and McMahon, 1993). In places where series have low signal-to-noise ratios, the time required to detect plausible trends (e.g. in precipitation, evapotranspiration, and discharge extremes) can thus be centuries long (e.g. Ziegler et al., 2005;Wilby, 2006).
In some cases, faster detection may be possible using seasonal, rather than annual time series (e.g. Ziegler et al., 2005).
The mismatch between the temporal scales of drivers of climate variability versus the availability and quality of obser- 15 vations can also result in misleading conclusions. For example, numerous studies have reported decreasing precipitation in Mediterranean regions since the 1960s (Longobardi and Villani, 2010;Gudmundsson and Seneviratne, 2016), with some attributing this decline and corresponding increase in drought frequency to anthropogenic forcing in the Mediterranean basin (e.g. Barkhordarian et al., 2013;Gudmundsson and Seneviratne, 2016;Hoerling et al., 2012). However, when viewed in the context of rescued and quality assured data beginning in the mid-19th Century, these recent trends in precipitation are within insight into current conditions. For example, warm and cool season rainfall was reconstructed in Australia to investigate recent observed trend magnitude in the context of palaeoclimatic variability (Freund et al., 2017). Hydroclimatic reconstructions of the last 500 years have considerable potential to place recent observations into a long-term context that is not achievable from short observation-based record lengths alone. Although such datasets lengthen the period available for analysis and better reflect ranges of variability in extremes such as drought (Murphy et al., 2020a), they are subject to limitations from changes 30 in measurement practice, decreasing density of observations in early records, and a lack of consideration of issues such as changes in land cover and shifts in channel capacity (Slater et al., 2015).
In many regions, temporal and spatial data sparsity is likely to remain a key issue, hindering robust detection, attribution, and prediction of water and climate nonstationarities. Therefore, different trend detection and attribution methods should be considered (e.g. using lower quantiles and peak-over-threshold methods; see Section 4), while implementing holistic 'multiple working hypotheses' approaches (Chamberlin, 1890;Harrigan et al., 2014) to avoid overlooking potential drivers of change.
Finally, gaps in extreme hydroclimatic time series may also affect the detection rate of significant trends. Detection rates are lower in records that have larger gaps and shorter records than those with less change (lower regression slopes) and fewer gaps, and/or when the data gaps are located towards the beginning or end of a time series (Slater and Villarini, 2017a).

4 Detection of nonstationarity
Detection of nonstationarity in hydroclimatological extremes requires a sound examination of the data before applying any statistical tests, which broadly seek to detect two types of nonstationarity (see Figure 1b): monotonic change (trends) and step-changes (change points). Such changes can be considered as symptoms of nonstationarity if they represent a significant departure from normality within a long-term record. Nonstationarity may be detected either in individual time series (point-10 based analysis) or in larger ensembles of stations (spatially-coherent trends) (Hall et al., 2013). Here, we provide an overview of existing methods employed in the fields of weather and water extremes. For a description of change-detection methods in hydrology we refer the reader to Helsel et al. (2020), and for floods to . For an overview of methods and challenges in the detection and attribution of climate extremes, see Easterling et al. (2016).

Regression-based methods for detection of incremental change
15 The detection of trends in hydroclimatic time series generally employs two key approaches: detection of trends in magnitude (e.g. quantiles or block maxima, such as the annual maxima, AMAX) or frequency-based methods (e.g. use of point process modelling frameworks to model the peaks-over-threshold (POT) series, also referred to as partial-duration series (PDS) (e.g. Coles, 2001;Salas et al., 2018).
The non-parametric Mann-Kendall (MK) test (Mann, 1945;Kendall, 1975) is a distribution-free test, frequently employed 20 to detect monotonic trends in time series without assuming a linear trend. Instead, MK simply evaluates whether the central tendency or median of the distribution changes monotonically over time (see Helsel et al., 2020). The test statistic, Kendall's τ , is a rank correlation coefficient which ranges from -1 to +1 (e.g. see Figure 4c, left panel). For instance, Westra et al. (2013) evaluated trends in annual maximum daily precipitation at 8,326 precipitation stations with at least 30 years of record and found increases at approximately two-thirds of these stations. A modified version of the MK test can also be applied to auto-25 correlated data (Hamed, 2009a, b). The Theil-Sen slope estimator (Sen, 1968;Theil, 1992;Hipel and McLeod, 1994), has often been used alongside MK (e.g. Hannaford et al., 2021)  Other studies also apply ordinary least squares (OLS) linear regression to estimate trends in precipitation (e.g. Figure 4a, 30 Papalexiou and Montanari, 2019) temperature (e.g. Papalexiou et al., 2018), and flood flows (e.g. Hecht and Vogel, 2020).
Practical advantages to using OLS methods include its ease of use and expression of uncertainty, graphical communication, and usability for providing decision-relevant information (e.g. Hecht and Vogel, 2020). In cases where the assumptions of OLS are not met (such as linearity, independence, normality, and equal variance of the residuals), non-parametric alternatives such as the Theil-Sen slope estimator or quantile regression (QR) may be used (both trend lines can be plotted). Instead of estimating the conditional mean of the response variable, QR considers different conditional quantiles (including the median) of a distribution. QR has been used for precipitation trends (e.g. Tan and Shao, 2017), air temperature (e.g. Barbosa et al.,5 2011), surface wind speed (e.g. Gilliland and Keim, 2018), and flood trends (e.g. Villarini and Slater, 2018), and is also regularly employed to investigate scaling properties between hydroclimatological variables; see section 5.1 (e.g. Wasko and Sharma, 2014).
When the empirical distribution of hydroclimatic extremes is known, many prefer to select an appropriate distribution and evaluate how the distribution parameters vary as a function of covariates such as time (e.g. Katz, 2013). For example, the non-

15
In the nonstationary case, a time or climate covariate can be employed to detect changes in the parameters of the distribution -as illustrated in Figure 6. The advantage of employing climate covariates is that climate model predictions/projections can then be employed as covariates to estimate future change (as discussed in section 7.3, e.g. Du et al., 2015). Criteria for model selection, such as the Akaike Information Criterion (AIC) or Schwarz Bayesian Criterion (SBC -also known as the Bayesian Information Criterion, BIC) can then be used to determine whether the stationary or nonstationary model is the better fit (e.g. 20 Figure 6a). The AIC and BIC assess the trade-off between goodness of fit and model complexity, so the improvement in goodness of fit must be sufficient to overcome the complexity penalty. Small differences in AIC are not always meaningful, such that several models may be equally acceptable (e.g. Wagenmakers and Farrell, 2004). In cases where the nonstationary model performs significantly better than a stationary model (in terms of goodness-of-fit and uncertainty), then the time series may be considered as nonstationary, pending sufficient record length (see section 3.2). 25 Increasingly, distributional regression modelling frameworks such as Vector Generalized Linear and Additive Models (VGLM/ VGAM) or Generalized Additive Models for Location, Scale and Shape (GAMLSS) are being chosen for their flexibility in evaluating nonstationarity of hydroclimatic extremes (e.g. Serinaldi and Kilsby, 2015). These frameworks are a generalization of generalized linear models (GLMs) allowing a broader range of distributions as well as different relationships between parameters and explanatory variables (linear, nonlinear, or smooth nonparametric). GAMLSS models, for instance, can have 30 up to four parameters µ, σ, ν, and τ , which allow modelling of the location (mean, median, mode), scale (spread: standard deviation, coefficient of variation) and shape (skewness and kurtosis) of a distribution. An example of nonstationarity detection is shown in Figure 6. Here, two nonstationary (with time-varying parameters) and two stationary (constant parameters) models are fit using both the Gamma and Weibull distributions to observed time series of instantaneous (15-minute) peak maxima ( Figure 6a). In the nonstationary case, the (µ and σ) model parameters both depend linearly on the time covariate (in years).

35
A logarithmic link function is employed to ensure the distribution parameters remain positive. The goodness-of-fit of both stationary and nonstationary models is assessed using SBC (Figure 6a). A detrended quantile-quantile-(worm) plot showing the residuals for different ranges of the explanatory variable(s) can also be used to diagnose model fit (Figure 6b). The model fit is satisfactory if the worm is relatively flat and if data points lie within the confidence intervals. In the case of the River Ouse, we find the nonstationary Gamma model is the best-fitting model. However, the Gamma and Weibull nonstationary model fits are 5 fairly similar (Figure 6a), and if the wormplots indicate a similar goodness-of-fit, it may well be that both are "acceptable", but the GA nonstationary model is simply slightly better. Here, both the mu and sigma parameters are increasing over time ( Figure   6c-d). The 50-year flood (specific discharge) increased from 0.123 m 3 /s/km 2 in 1900 to 0.183 m 3 /s/km 2 in 2018 ( Figure 6f).
GAMLSS methods have been applied for different hydroclimatic extremes. For example, Bazrafshan and Hejabi (2018) developed a Nonstationary Reconnaissance Drought Index (NRDI) to assess drought nonstationarity in Iran, and found large 10 differences between the NRDI and a traditional RDI for timeframes longer than 6 months. Sun et al. (2020b) also evaluated changes in a Nonstationary Standardised Runoff Index (NSRI) using GAMLSS over the Heihe River Basin in China. An evaluation of global changes in 20-year river floods since the 1970s found a majority of increases in temperate climates, but decreases in cold, polar, arid and tropical climates ( Figure 4d, Slater et al., 2021). Importantly, the covariate in a GAMLSS nonstationary model is often time (e.g. Villarini et al., 2009a;Serinaldi and Kilsby, 2015), but may also include other physical 15 drivers such as climate modes (e.g. Villarini and Serinaldi, 2012), urban or agricultural land cover (e.g. Villarini et al., 2009b;Slater and Villarini, 2017b), or other hydro-climatic variables such as dew point temperature (e.g. Lee et al., 2020).
Finally, there is growing interest in using interpretable machine learning methods for detection of nonstationarities in weather and water extremes. For example, Prophet (Taylor and Letham, 2018) is a decomposable time series forecasting model, similar to Generalized Additive Models (GAMs) (Hastie and Tibshirani, 1987), which is increasingly popular for hydro-climatological

25
This approach allows users to determine the locations in time when there are significant changes.

Pooled methods for detecting changes in extremes
As noted above, trend-detection of hydroclimatic extremes is problematic when there is uncertainty arising from short record lengths or small samples. Extreme events such as the annual maximum, or 1-in-n (50, 100)-year events tend to be highly variable and require lengthy time series to ensure robust detection of significant nonstationarities. In cases where the sample 30 size of observed records is insufficient, alternative methods have been proposed, ranging from "pooled" sampling to scaling approaches.
Various pooled methods can be used to address the issue of limited sample size over large spatial scales. One pooled approach is to extract the single largest event over an n-year period from multiple independent gauge-based records, effectively substituting space (large spatial sample across many gauges) for time (long temporal sample at individual gauges). In other words, by pooling the data from multiple records or datasets, the data sample is increased for greater statistical robustness. gional scales, i.e. flood-rich and flood-poor periods likely associated with hydroclimate variability. Max-stable models (Coles, 2001) are another method of pooling data directly. The approach simulates spatial fields with observations from various point locations, to increase the precision of statistical inference (e.g. Westra and Sisson, 2011) .
Most statistical tests used to detect nonstationarity suffer from a substantial loss of power when applied to shorter time series (Yue et al., 2002b;Vogel et al., 2013;Prosdocimi et al., 2014Prosdocimi et al., , 2019. Pooled frequency analysis may improve the estimation of events associated with long return periods in rainfall intensity-duration-frequency curves at sites where historical rainfall records are short or ungauged, by compiling data from many rainfall records in a region (Requena et al., 2019 A different pooling approach for detection of trends in extremes is the "UNSEEN" approach -UNprecedented Simulated Extremes using ENsembles (e.g. Van den Brink et al., 2005;Thompson et al., 2017). UNSEEN pools members of seasonal predictions or large ensemble climate models (e.g. Deser et al., 2020), using the members as multiple realisations of a plausible 10 'alternate reality'. In this way, historical sample sizes can be vastly increased to provide greater statistical confidence in extreme estimates. The method was further developed through an "UNSEEN-trends" approach that enables detection of nonstationary extremes (e.g. precipitation) such as 100-year events, from short (e.g. 30-year) climate model records (Kelder et al., 2020). The UNSEEN-trends approach has potential for detection of nonstationarities in a range of climate extremes, and can be applied to century-long seasonal hindcasts (e.g. Weisheimer et al., 2017Weisheimer et al., , 2020 However, if any of these pooled approaches is limited to a short period (e.g. half a century), it may not necessarily overcome all the challenges of record length. For instance, if a region is influenced by climate variability (e.g. positive AMO phase), increasing the sample size will not overcome the lack of data from other climate phases -the data still only originate from one of many possible states, and it is not possible to definitively conclude nonstationarity in behaviour of extremes over longer time periods. Thus, caution must be employed: short-term pooled approaches may be useful to investigate recent weather and 25 water extremes (e.g. recent land cover changes, temperature scaling, or present climate risk analysis) but less useful to explain long-term drivers (e.g. climate variability), where longer records would be required. This caveat is equally applicable to all methods limited to short time periods.

Weather generators and synthetic time series
Weather generators and synthetic time series are powerful tools to capture non-stationarity for weather variables and more 30 recently even for streamflow at the catchment level. Such approaches can be used to inform decision-making. For instance, Benoit et al. (2018) developed a synthetic simulation approach for different types of rainfall, such as stratiform winter rainfall, spring rain showers, or heavy summer convective rainfall, based on their space-time-intensity structure. The rain types can be conditioned on different meteorological covariates (such as pressure, wind, temperature, or humidity) (Benoit et al., 2020).
The authors then applied this approach under a RCP8.5 (high) emissions scenario. In hydrology, synthetic design hydrographs (SDHs) may be used to test the sensitivity of the peak, volume and shape of a flood hydrograph to the flood generating mechanism and catchment properties (e.g. Brunner et al., 2018;Yue et al., 2002a). For instance,  developed an empirical, wavelet-based model for stochastic simulation of streamflow time series (made available within the R package PRSim). They showed the method was suitable for sites exhibiting nonstationarities and spatial dependence of 5 streamflow extremes, and can be used for water management applications.

Change point analyses for detection of abrupt change
Abrupt changes in hydroclimatic extremes are often a sign that there has been significant human intervention, such as the construction of a dam or diversion, which may significantly affect a catchment's water balance. Change point (also known as step-trend) tests can thus be used either to detect such a shift, or to compare two different periods separated by a long gap 10 (Helsel et al., 2020). Several change-point tests are available. The nonparametric Pettitt test is one of the most widely applied, and allows the user to determine the timing of the change point and its significance (Pettitt, 1979 (Scott and Knott, 1974), Bayesian analysis (Erdman and Emerson, 2008), wild binary segmentation (WBS) (Fryzlewicz et al., 2014), nonparametric PELT (Haynes et al., 2017), and the Mann-Whitney test (Mann and Whitney, 1947) -for the estimation of 15 abrupt changes in the mean, variance, or median of a time series was tested using both simulated and historical data with known change points (Ryberg et al., 2019). Although these methods offer potential benefits (such as the detection of multiple change points), the comparison found that the Pettitt test delivered the best combination of change point detection and minimization of false positive results. The parametric tests (PELT, binary segmentation, and Bayesian analysis) generally performed poorly at detecting known change points in peak stream flow, whilst the non-parametric tests (non-parametric PELT and Mann-Whitney) 20 as well as the parametric WBS resulted in unacceptable false positive rates (Ryberg et al., 2019). The Mood test (Mood, 1954;Ross et al., 2011) for abrupt changes in scale was also evaluated in the same study but located only about 25% of known change points in historical data (with a relatively low false positive rate) and approximately 29% of change points in simulated data (with a relatively high false positive rate).  Barnett et al.,15). The null hypothesis of circular statistics, when applied to timing, is that data are evenly distributed (uniform), with no tendency to cluster. Different methods can be used to 30 calculate the trend in timing, such as circular regression (Wasko et al., 2020a); measures of linear-circular association (Villarini, 2016); the Theil-Sen slope estimator (Blöschl et al., 2017); linear regression (Barnett et al.,15); and linear regression after standardization on the local water year (Wasko et al., 2020b). Three seasonality types can also be detected: circular uniform (no preferential direction, indicating that an event has the same probability on any day of the year); reflective symmetric (unimodal); and asymmetric (multimodal, e.g. locations with multiple generating processes, such as snowmelt and mesoscale convective systems in the case of flooding) (Villarini, 2016).

Circular statistics and methods for detection of shifts in timing
For precipitation, Gu et al. (2017) found a significant shift in the seasonality of extreme rainfall using circular statistics, which suggested that pathways of seasonal vapor flux and tropical cyclones were likely driving these changes. For flooding, an 5 approach based on seasonality statistics was developed to evaluate changes in the dominant flood-generating processes across Europe. Berghuijs et al. (2019b) use the mean date of occurrence of three processes (extreme precipitation, soil moisture excess, and snowmelt) to estimate the relative importance of each process at 3,777 European catchments with at least 20 years of flood peak timing data. They found that the relative importance of mechanisms had not changed significantly over 50 years.
Similarly, Macdonald et al. (2010) used the mean day of the flood (MDF) to assess changes in flood timing over Wales. They

Drivers of nonstationarity
The terms hydroclimatic 'mechanisms', 'agents', and 'drivers' are sometimes used interchangeably or with different intents.
Here, we distinguish between these terms as follows. The expressions hydroclimatic agents or mechanisms refer to the pro-20 cesses generating hydroclimatic extremes. Depending on the temporal and spatial scale, precipitation-generating agents might include local convection, synoptic weather patterns or cyclones; flood-generating mechanisms might additionally include snowmelt or ice-jam release. The expression 'nonstationary drivers' refers to longer-term processes which may cause significant shifts in the underlying distributions of hydroclimatic extremes, via climate or land cover change. Here we focus more on the drivers of nonstationarity than the extreme-generating mechanisms and agents. It is important to note that other 25 (non-process related) factors may also generate spurious nonstationarities (see Section 3.1).
Among the different hydroclimatic extremes, it is widely recognised that temperature is easier to attribute than precipitation, floods and storms. Dynamically-driven extremes are harder to attribute because they typically have smaller signal-to-noise ratios and higher uncertainty (see e.g. Trenberth et al., 2015). The drivers or sources of nonstationarity are predominantly artificial -i.e. human impacts such as land cover change, river regulation, and anthropogenic climate change. They may 30 operate over a range of timescales, from the annual timescales of water management (abstraction, water transfers) and some abrupt land-surface changes (e.g. urbanisation or impacts of forest fires) to the (multi)decadal timescales of climate change. In contrast, natural periodicities, such as the seasonal cycle, weather patterns, climate modes of variability, and geological climate

Climate variability and change
Climate change is widely recognised as one of the most important multidecadal to millennial drivers of changes in hydroclimatic extremes. Potential impacts may be expressed primarily through the water cycle, as storminess and rainfall extremes are short-duration rainfall extremes under climate change). Generally, peak rainfall intensities increase with both regional and global temperature. Some studies have shown that extreme precipitation is expected to become more intense, and weaker rainfall less intense (Wasko and Sharma, 2015). However, further investigation is needed, as there are contrasting results regarding the decline in light precipitation, so the "compensation" mechanism is still unclear (Markonis et al., 2019). Increases in precipitation extremes in response to warming are greater for convective precipitation (a super C-C rate) compared with stratiform precipitation (a normal CC rate, Berg et al., 2013). Further, it has been suggested that the transition from C-C 5 scaling to super C-C scaling may occur at higher temperatures (approx. 20°C) in regions with climatologically lower frequency of convective events (e.g. approx. 20°C South Korea vs 12°C in certain European regions, Park and Min, 2017). One other key question is whether precipitation intensities are likely to increase proportionally more over shorter timescales. Research suggests that temperature-rainfall scaling increases over hourly timescales (Lenderink and Van Meijgaard, 2008). For example, UK summer precipitation is estimated to increase by 30-40% for short-duration extreme events (Kendon et al., 2014). 10 What does temperature scaling mean for floods and droughts? There is still little evidence that increases in heavy rainfall events at higher temperatures translate into similar increases in streamflow (Wasko and Sharma, 2017). Increased extreme rainfall does not necessarily lead to increased flooding (Blöschl et al., 2019). There are many other factors that affect flood response besides precipitation intensity, including the duration and extent of precipitation events, antecedent soil moisture conditions, catchment size, vegetation cover, and catchment imperviousness and roughness (Sharma et al., 2018). Studies 15 have found that lower annual recurrence interval floods are more likely to be reduced due to drier antecedent soil moisture conditions, whereas higher annual recurrence interval floods are more likely to increase due to increases in extreme rainfall ( Figure 7 in Wasko and Nathan, 2019). In temperate climates of Northwestern Europe, extreme precipitation is an important flood driver; while in drier climates of Southern Europe, both antecedent soil moisture and extreme precipitation matter (with greater importance of soil moisture for smaller floods) (Bertola et al., 2021). Increases in precipitation intensity are more 20 likely to cause increases in streamflow in smaller catchments (Wasko and Sharma, 2017;Wasko and Nathan, 2019). Perhaps counter-intuitively, streamflow is likely to decrease in catchments that are experiencing significant reductions in the fraction of precipitation falling as snow (Berghuijs et al., 2014). In high-altitude, snowmelt-dominated regions such as the Hindu Kush Himalayan "water towers of Asia" (Immerzeel et al., 2020), where a large amount of water is stored as snow or ice, expected climate shifts could accelerate glacier and snowpack melting. This may in turn lead to more frequent glacial lake outburst 25 floods, flash floods and riverine floods, thus posing potential risks to the 240 million people in the region and the 1.9 billion people living downstream (Wester et al., 2019). Yet, it is important to note that in many locations, the effects of climatic shifts may represent only a minor source of variability when compared with more direct anthropogenic influences such as reservoirs and land cover change (Lins, 2012).
The effects of climate change on wind are less understood than precipitation, temperature, or hydrological extremes. Al- 30 though the intensity and location of future storm tracks depend on changing temperature gradients, the response is likely to be variable, because those temperature gradients are expected to change in a non-uniform manner. For example, Harvey et al.
(2014) found a consistent positive relationship between the change in temperature gradient and storm track intensity across CMIP5 models, but a highly-variable storm track response (e.g. with strengthening over the UK region, yet weakening over the US). This is explained by a weakening temperature gradient over the US (due to Arctic amplification), and a strengthening 35 gradient over the eastern Atlantic -likely explained by changes to ocean circulation . Over large spatial scales, climate change also affects the characteristics (magnitude, frequency and timing) of the synoptic-scale phenomena that generate hydroclimatic extremes, such as atmospheric rivers (ARs), tropical cyclones, and atmospheric circulation patterns (e.g. Hirschboeck, 1988;Schlef et al., 2019). For example, ARs play a major role in flood occurrence in many regions of the world (Lavers et al., 2011(Lavers et al., , 2012Paltan et al., 2017), and thus changes in their frequency and characteristics could alter the 5 properties of future hydroclimatic extremes. In regions such as Norway where rainfall extremes are primarily driven by ARs (Whan et al., 2020), changes in the phase of precipitation are leading to increasingly rainfall-dominated, rather than snowfalldominated regimes, which may alter flood characteristics and water resource management. Where otherwise snow would be stored in the catchment, rainfall is likely to contribute more directly to runoff, leading to more severe AR-induced floods, but less severe snowmelt-driven floods later in the season. 10 The effects of climate variability on hydroclimatic extremes are well recognised. Hazard-rich and hazard-poor periods in the historical record tend to be driven by the spatial and temporal periodicities of multiple, sometimes overlapping, climate modes. For instance, during a positive phase of the North Atlantic Oscillation (NAO), the jet stream tends to shift northwards, generating a greater frequency of heavy rain events over the British and Irish islands (Hannaford and Marsh, 2008). The NAO, together with the Scandinavian pattern (SCA), is also the main driver of European windstorm variability (Walz et al., 2018). 15 Understanding the exact "mix of atmospheric influences" driving flood-rich periods, such as the exceptionally flood-rich past three decades in Europe, requires ongoing work (Blöschl et al., 2020). El Niño Southern Oscillation (ENSO) phases have different effects on flooding and droughts depending on locality. In Central America, for instance, the West Coast experiences increased likelihood of droughts during the El Niño phase of ENSO, while the East Coast has increased flood risk (Cid-Serrano et al., 2015;Enfield and Mayer, 1997;Aguilar et al., 2005). Moreover, El Niño events have been observed to drive 20 episodes of extreme heat in Southeast Asia (Thirumalai et al., 2017), whilst also increasing the lifetime and strength of tropical cyclones in the Western North Pacific (Camargo and Sobel, 2005). Temperature extremes in Northern Europe are associated with atmospheric blocking situations, both during winter cold spells (Sillmann et al., 2011) and summer heatwaves (Schaller et al., 2018). However, the interlinkages between climate variability and change are still not fully understood for atmospheric blocking and remain challenging to disentangle (e.g. Woollings and Blackburn, 2012;O'Reilly et al., 2019). 25

Land cover changes
Land cover modulates the response of local, regional, and remote regions to shifts in climate variability and change. Land is both a source and sink of greenhouse gases: expansion of areas under agriculture and forestry may alter hydroclimatic extremes through a combination of both biophysical effects (e.g. photosynthesis, respiration, drying, greening) and greenhouse gas feedbacks (see Chapter 2 of IPCC (2019)). 30 The effect of urbanisation on hydroclimatic extremes is perhaps one of the most well documented land cover changes, but there are still many unknowns. Cities alter the local atmosphere through the urban heat island (UHI) effect which increases the mean annual surface air temperature within cities relative to surrounding rural areas. The magnitude and diurnal amplitude of the UHI effect varies between cities (Ward et al., 2016) lifting night temperatures more than daytime temperatures (Hausfather et al., 2013). For example, in Kuala Lumpur, Malaysia, hourly intensities of extreme rainfall have increased by~35% in the last 30 years due to the UHI creating an unstable atmosphere (i.e. altering the local atmospheric vertical structure, Li et al., 2019). The effects of these local land surface changes are compounded by the thermodynamic and dynamical effects of climate change. Local changes in extreme temperature are driven over short/medium spatial and temporal scales by changes in the land surface such as the UHI effect and large-scale irrigation, (e.g. Mahmood et al., 2014), but over multi-decadal timescales they 5 are principally caused by large-scale shifts in greenhouse gases modifying the global mean temperature.
Urbanisation effects on flooding and drought are also widely acknowledged but not yet fully understood. Prosdocimi et al. (2015) evaluated the impacts of urban land cover in a paired-catchment study in the UK using both block maxima and POT approaches. They found a significant effect of urbanization on high flows in all seasons, with the strongest effect in summer.
A recent analysis of 280 stream gauges in the United States found that annual maximum floods increase by 3.3% on average 10 for every 1% increase in impervious land cover, using panel regression (e.g. Blum et al., 2020). Further, the vertical structure of cities can alter precipitation and flood extremes. When hurricane Harvey hit Houston in August 2017, enhanced rainfall was produced by the storm system's drag induced by increased surface roughness . Another recent study examined the effect of urbanization on long term persistence in the stream flow records of 22 catchments in the Northeastern United States using scaling exponents (Jovanovic et al., 2016). They found evidence that streamflow responds more quickly to 15 rainfall in urbanized catchments than in less urbanized counterparts.
Increases in vegetation cover, termed "greening" are occurring in many regions, both as a result of human afforestation, and potentially through a fertilization effect caused by increases in global CO 2 (expected to lead to greater water consumption, e.g. Ukkola et al., 2016). Nonlinear feedbacks between atmospheric and land surface processes can also lead to varying carbon uptake in vegetation (e.g. soil moisture may alter photosynthesis rates; Humphrey et al., 2021). Afforestation and deforestation 20 affect both floods and droughts, and sometimes in more ways than one. For example, afforestation can increase streamflow magnitude by ditching but also reduce flood peaks by canopy interception, evapotranspiration and drier antecedent conditions (e.g. Birkinshaw et al., 2014;Soulsby et al., 2017). The effects of changes in vegetated land are believed to be far more pronounced for low flows than high flows (e.g. Birkinshaw et al., 2014;Bathurst et al., 2020;Do et al., 2017;Vicente-Serrano et al., 2019) although there have been very few large-sample studies using observational records. To understand the influence of 25 land cover changes on catchment hydrology, numerical modelling is often used to assess potential impacts and provide evidence for ambitious land cover changes related to policy before implementation. Some have employed theoretical land cover changes (e.g. Gao et al., 2018;Iacob et al., 2017); others scenario-based land cover changes (e.g. Harrison et al., 2019). The response of hydroclimatic extremes to land cover change varies over both spatial and temporal scales. At the very fine scale, the hydraulic structure of the soil may change due to reorganisation of macro-and micro-pores associated with land management practices 30 within a single parcel of land. At continental and multi-decadal scales, changes in the magnitude, intensity and pathway of storms may lead to widespread changes in runoff. These effects may not be distinguishable over short periods in individual catchment data so long-term monitoring is required to understand their impact on catchment hydrology over a range of spatial and temporal scales .
Major land cover changes may also have remote teleconnection effects on hydroclimatic extremes. For instance, it has been shown that tropical deforestation alters precipitation patterns not only locally, but also in the mid-and high latitudes.

Deforestation of the Amazon and Central Africa may reduce precipitation in the U.S. Midwest; while deforestation of Southeast
Asia has been shown to affect China (Avissar and Werth, 2005). In Northern India, Saeed et al. (2009) found that irrigation suppresses the development of monsoon-driving land-sea temperature gradients and, thereby, exerts a first-order control on 5 precipitation in Central India and the Bay of Bengal.

Water resources management and geomorphological change
With the growing demand for food and increased productivity following the mechanisation of agriculture, large swathes of land have been subject to arterial and land drainage. Such installations can take various forms. Their impact on hydrological response is poorly understood but they may alter catchment morphology, soil and groundwater hydrological response, depend-10 ing on the extent of the catchment affected and the extent of works completed on river channels. The nature of hydrological nonstationarities associated with arterial and field drainage has been debated: some argue that changes should not be detectable in discharge as abrupt shifts, while others have identified change points in specific components of the hydrological regime (Harrigan et al., 2014). Numerous studies showed that arterial and field drainage can affect flood peaks (increasing magnitude and reducing time to peak, e.g. Wilcock and Wilcock (1995), Bhattarai and O'Connor (2004)). Conversely, understanding of 15 the impacts of drainage on low flows and drought remains poor, as does our understanding of how other components of runoff response are affected due to inadequate monitoring of sub-surface hydrological processes.
Changes in river channel morphology and conveyance capacity are also poorly-understood drivers of changes in flooding.
In the Mississippi River for example, construction of wing dikes, navigational structures, and levees have contributed significantly to increases in flood levels (Pinter et al., 2008). Decreases in river conveyance may significantly affect the frequency of 20 overbank flooding and can be identified from trends in flood stages (Pinter et al., 2006) as well as from stream gauge transect data (James, 1991;Smelser and Schmidt, 1998). It is now possible to estimate the relative effect of hydrologic and geomorphic drivers of changing flood risk (Slater et al., 2015) and even to estimate the influence of tropical cyclones on conveyance and flood risk (Li et al., 2020).
Water management can also generate significant nonstationarities. Dam construction increased dramatically over the last 25 century and is likely to continue apace in developing regions. Grill et al. (2015) highlighted that on a global basis, 48% of river volume was moderately to severely impacted by either flow regulation, river fragmentation (i.e. diminished connectivity within river systems), or both as a result of dam construction. The impacts on trends in river flow have been debated. Lorenzo-Lacruz et al. (2012) examined trends in Iberian river flows over the period 1945-2005, finding that river regulation by dams was more likely to affect the magnitude of trends rather the direction of change with important seasonal differences. Examining trends 30 in floods at a global scale, Do et al. (2017) found that the presence of dams did not have a large impact on trend results in all catchments. Rather, catchment size and local context was deemed to be most important in determining response. Abstractions and discharges from watercourses can also drive nonstationarities, thus an important task in detection of climate or land cover impacts is to ensure naturalised flow regimes are employed.

Land-atmosphere feedbacks
Nonstationarity can also arise when non-linear dynamical systems exhibit multiple metastable states. Much debate has arisen in the literature surrounding the flawed diagnosis of nonstationarity in systems that have metastable states due to internal feedbacks. The difficulty is particularly acute when records are short and understanding of the system is poor. Examples of metastable states imposed by the climate system include the persistence of periods of high levels in Lake Victoria (Sutcliffe 5 and Parks, 1999), now thought likely to be driven by the Indian Ocean Dipole (Schott et al., 2009). Important examples of landatmosphere feedbacks internal to hydrological systems include the role of rainfall recycling in Amazon dieback (Wenzel et al., 2014). Feedbacks are also thought to have amplified climate variability in the Sahel and in the 20 th century Western U.S. dust bowl (Berg et al., 2016). More recently, the role of soil moisture storage in controlling land-surface feedbacks was highlighted through enhancement of heatwaves and associated severe drought conditions in Central Europe (e.g. Seneviratne et al., 2010;10 Kornhuber et al., 2020). In Australia, both positive and negative correlations have been found between daily soil moisture and next-day rainfall, depending on spatial scale, location, and season (Holgate et al., 2019). Of key importance is the potential for human water use to affect the water balance in regions where hydrological systems are close to sensitive thresholds (Gleick and Palaniappan, 2010). These examples demonstrate that internal feedbacks in hydrological systems can amplify nonstationary drivers of change. Moreover, some of these feedbacks exhibit behaviour which suggests that they introduce tipping points in 15 the Earth system, thus providing not only a response to external changes in the hydrological system, thus they can also become be a driver of nonstationarity (Lenton et al., 2008). Analyses which address the mechanisms behind such feedbacks will be required in order to account for their effects in the future.
Most of the time there are many drivers affecting hydroclimatic extremes simultaneously, across overlapping multiple temporal and spatial scales (Figure 7). This makes it all the more challenging to discern their individual effects. 20

Compound drivers: difficulties understanding nonstationarity in compound risk and consecutive disasters
Interest in compound hydrological extremes is growing rapidly, not least because these events can deliver particularly severe societal impacts (de Ruiter et al., 2019;Raymond et al., 2020a;Zscheischler et al., 2020). Although rarity and complexity pose challenges for their assessment, physical reasoning and empirical methods have highlighted that they can be sensitive to anthropogenic climate warming (AghaKouchak et al., 2020) as well as modes of climate variability (Hillier et al., 2020). It 25 is also possible that novel compound hazards (such as enhanced wildfires in 2019 in California, caused by a combination of wind and low-humidity events) may emerge as the climate continues to change (Raymond et al., 2020a;Matthews et al., 2019), causing potentially devastating consequences for communities taken by surprise (Masys et al., 2016). With so much at stake, losses may be too substantial to learn from novel hazards once they have emerged (Masys et al., 2016;Sornette and Ouillon, 2012). Instead, modelling and simulation can be used to explore extreme unseen events (Section 4.2) as well as resampling 30 and stochastic weather generation techniques that provide a quantitative underpinning to storyline approaches (Woo, 2019;Shepherd et al., 2018;Matthews et al., 2019).
The role of spatial and temporal correlations in hydrological hazards like drought is important for probabilistic calculations of risk. For example, simultaneous crop failure in Russia, South West Australia and South West China caused a major spike in grain prices in 2011-12 (e.g. Gaupp et al., 2020). Crop production losses can be underestimated by a factor of three if spatially-correlated risks are ignored, but losses can be reduced by international cooperation to pool risks (e.g. Gaupp et al., 2020). Coincident hydrological extremes at the global scale can be driven by climate modes such as ENSO, the Pacific Decadal 5 Oscillation, and the Atlantic Multi-decadal Oscillation (De Luca et al., 2020).

Attribution of nonstationarity in weather and water extremes
The term 'attribution' implies understanding the causes or driving processes of change. In the climate sphere, it is often understood in a strict sense as quantifying the influence of humans on the climate or demonstrating that climate change is consistent with climate model predictions rather than alternative causal explanations. However, beyond climate science, the 10 field of attribution is still relatively new. In hydroclimatology, the concept of attribution is used more broadly to disentangle the wide range of drivers of hydroclimatic extremes, which include climate variability, land cover change, and land-atmosphere feedbacks (section 5). Attribution is typically preceded by exploratory data analysis (EDA) to understand the underlying data (section 6.1). Attribution methods can then broadly be divided into empirical (section 6.2) and simulation-based (section 6.3) approaches. Empirical approaches use statistical techniques to relate the causes to detected changes in the observed record. 15 Simulation-based approaches use model simulations to explain drivers of changes in climatic extremes. In both cases, attribution requires developing multiple working hypotheses based on a process-based understanding of potential drivers of change (Harrigan et al., 2014).

Exploratory data analysis (EDA)
Exploratory analyses play an essential role in understanding historical changes. Often, such analyses involve visual exploration 20 of the time series and of the statistical test results for many gauges to identify common patterns or outliers. At the most basic level, temporal trend lines (e.g. Figure 6f) from multiple stations can be plotted on the same graph to identify any record that diverges from the coherence of the pooled analysis. Similarly, trend persistence plots may be created for multiple stations to identify suspect data where nonstationarities may be present. For example, Noone et al. (2016) computed the MK Z-score over different time periods, starting with the full record, then shortening the start year one year at time to a minimum length required 25 for statistical robustness (such as 30 years). Plotting persistence lines for multiple stations allowed them to flag stations with possibly spurious nonstationarities. Wavelet power spectra can also be plotted for multiple stations to assess whether a common periodicity can be detected across multiple sites. For instance, Rust et al. (2019) used continuous wavelet transforms to identify multi-annual periodicities of the North Atlantic Oscillation in UK groundwater records.

Empirical attribution approaches
Statistical attribution of hydroclimatic nonstationarity is typically regression-based. Statistical methods require less computational power than physical model-based approaches, but cannot account for complex feedbacks between processes. Additionally, statistical approaches are unable to distinguish between correlation and causation, even when a predictor is physically plausible. The first step in attributing the drivers of a nonstationary process is to detect the presence of nonstationarity in the 5 time series. This can be done by applying a trend plus significance test (section 4.1); or by fitting a distributional regression model with constant parameters (stationary case), then again with time-varying parameters (see Figure 6a), and evaluating the superior description of the data. If the stationary model fit is better (e.g. lower AIC or SBC), then it is not recommended to proceed further. If the time-varying model is better, and if there are physically-plausible reasons to suspect nonstationarity, then attribution may proceed. However, it is important to note that small differences in AIC and SBC are not always meaningful 10 (Burnham and Anderson, 2004), and visual assessment is generally recommended. If a time series is deemed nonstationary, regression can be used to determine potential drivers of change by introducing additional predictors that are representative of the drivers (e.g. changing land cover, climate, or reservoir indices). Regression model coefficients can be used to quantify the effect of covariates (e.g. Prosdocimi et al., 2015).
Panel regression techniques are increasingly popular because they can be used to leverage temporal and spatial variation to 15 isolate a causal effect, separate from other drivers of change (Blum et al., 2020). These methods pool both dynamic (e.g. land cover) and static (e.g. physical catchment characteristics) data across time and space. They are able to identify generalized relationships between drivers and hydrologic response (i.e. reliable model coefficients), and are particularly powerful in regional and large sample analyses (Bassiouni et al., 2016;Steinschneider et al., 2013). Examples of recent applications of panel regression include estimating low-flow response to rainfall in ungauged basins (Bassiouni et al., 2016), estimating the effects 20 of urbanization on annual runoff coefficients (Steinschneider et al., 2013) and peak flows in the United States (Blum et al., 2020) and in Belgium (De Niel and Willems, 2019). Others have applied panel regression to analyse the effects of forest cover change and other socioeconomic factors on the frequency of large floods in developing regions (Ferreira and Ghimire, 2012) or to examine the effects of deforestation and agricultural development on streamflow in Brazil (Levy et al., 2018).

25
Unlike statistical methods, where models are "fit" to observations, climate model experiments are physically-based: they solve series of mathematical equations representing various Earth-system processes. The strength of these methods relies on their use of physics to accurately describe real-world feedbacks and recognise important physical limits, for example, the cap on extreme humid heat in the tropics linked to deep convection (Sherwood and Huber, 2010;Zhang and Fueglistaler, 2020). Here we provide a brief overview but we point the reader to Hegerl et al. (2010) and Stocker et al. (2013) for details. 30 In attribution studies, climate models can be run under different scenarios to identify the role of anthropogenic forcing in contributing to the likelihood of an observed extreme. For example, comparing the probability of an extreme event simulated by a model for the current climate (p1) with its probability in a modelled pre-industrial scenario (p0), indicates how much "more likely" (p1/p0) the event has become relative to a "counter-factual" world in which the pre-industrial climate persisted (see Figure 8a-b). The probability ratio framework is also used to identify the "fraction of attributable risk" (1-p0/p1) -the portion of the event's probability that can be assigned to anthropogenic modification of the climate (Fischer and Knutti, 2015).
Changes in event magnitude can also be examined using the same framework. Wang et al. (2018) applied such methods to the exceptional rainfall caused by Hurricane Harvey in 2017. The analysis indicated that, in a counter-factual environment of no 5 trends in SSTs or tropospheric variables, the storm would have delivered almost 30% less precipitation. Such methods, which blend climate change scenarios with numerical weather prediction models can also be used to assess how the characteristics of observed extremes may differ in counter-factual worlds (hotter or colder than present). Lackmann (2015) demonstrated this for Hurricane Sandy, noting that climate warming since pre-industrial levels caused a small northward shift and intensification of the storm's track; further intensification and northward displacement could be expected in a counterfactual climate even warmer 10 than observed in 2012. Statistical comparisons, such as the Anderson-Darling (AD; Figure 8c) or Kolmogorov-Smirnov (KS) tests, can formally confirm any significant differences between historical observations and future simulated extremes. The AD test is often more powerful when comparing two distributions than the KS test (Engmann and Cousineau, 2011).
Other studies have compared observations and multi-model simulations of changes using an 'optimal fingerprinting' tech- areas to human induced climate change. The fingerprint method was also recently used to attribute global trends in mean and extreme streamflow to anthropogenic climate change (Gudmundsson et al., 2021). 20 Attribution often requires running large model ensemble experiments, which can be computationally expensive. The project Climateprediction.net was established to make use of thousands of home computers volunteered by individuals, to enable ensemble runs of global or regional climate models. These many runs can be used to quantify internal ensemble variance and compare it to observed events. The project now runs regional large ensemble citizen science experiments called "weather@home", used for probabilistic event attribution (Massey et al., 2015;Guillod et al., 2017). For example, this regional ensemble approach 25 was used to demonstrate the human influence on climate in the 2014 Southern England winter floods (Schaller et al., 2016) and the 2003 European summer heat (Mitchell et al., 2016).

Event attribution
Event attribution has emerged as a major field of research in the last decade, aiming to assess whether specific extreme events can be ascribed to human-induced climate change (Stott et al., 2016). Climate models can be used to estimate the influence  6.5 Issues with attribution: complexity, confounding variables, and undetected drivers Attribution encounters different challenges depending on the variables being considered. In the case of hydrological extremes such as floods or droughts, there is substantial complexity in disentangling multiple drivers, which may have a confounding influence. One of the principal obstacles in hydrology is the lack of data on the drivers of nonstationarity at appropriate temporal and spatial scales. Important drivers of hydrological change (such as arterial drainage, land use changes, or adjustments in the 5 conveyance capacity of river channels) may be overlooked if the initial attribution framework is too narrow -i.e. does not consider the possibility of multiple plausible drivers. Hydrologists seeking to quantify the influence of land cover signals are faced with a conundrum: how to disentangle multiple driving factors in big datasets that have not been assembled for such a purpose.
There are many national hydrological reference networks (also known as benchmark networks) that are minimally affected by anthropogenic influences and designed to enable detection of global change signals (e.g. Whitfield et al., 2012;Harrigan 10 et al., 2018). These data, however, are not well-suited to detect drivers such as land cover changes: currently no benchmark networks exist for detection and attribution of the effects of land cover changes on streamflow at any scale. Hydrologists seeking to attribute such effects may find spurious relationships, or that the data are affected by other confounding variables such as upstream dams or abstraction. Because of the multiple confounding variables that affect streamflow, a broader attribution framework, with multiple working hypotheses, is needed. 15 Complexity is reduced for attribution of extreme high temperatures because they scale strongly with global mean warming (Buzan and Huber, 2020), particularly over land . Some confounding influences remain even for heatwaves because of their sensitivity to land use change (Miralles et al., 2019). For example, any modification leading to surface drying (e.g. cessation of irrigation) may partition more radiant energy into sensible heat (at the expense of latent heat), thereby amplifying high temperature extremes (Miralles et al., 2014;Peterson et al., 2011). This principle of trading off between sen-20 sible and latent heat is often exploited in reverse by urban designers trying to engineer cool cities (Coccolo et al., 2018), but it may confound local-scale attribution studies focusing on temperature extremes. The total sensible and latent heat content of the air is therefore recommended for studies of extreme heat events (Pielke Sr et al., 2004;Matthews, 2020), not least because it is more tightly coupled to physiological heat stress (Matthews, 2018).

Adjusting for nonstationarity in engineering design
It is widely accepted that assuming stationarity in the hydrologic variables used for long-lived engineering designs is no longer tenable (Milly et al., 2008). Differences in opinion persist, however, about how environmental change information can be used in engineering designs. For instance, a decade ago, some claimed that climate models were not yet 'ready for prime time' (Kundzewicz and Stakhiv, 2010). The reasoning was that outputs from climate models and downscaling procedures were too 5 coarse and uncertain for use in site-specific adaptations to infrastructure. This claim was premised, however, on a 'predictthen-act' approach to nonstationarity, where the goal was to characterize, or even constrain, major sources of uncertainty for decision-makers (Clark et al., 2016).
An alternative view is that nonstationary hydroclimatic information can be applied in smarter ways by stress-testing the performance of a design or adaptation decision via risk and reliability metrics (e.g. Brown and Wilby, 2012). To differentiate 10 between the impacts of climate change and natural variability, the degree of "hydrologic stress" can be estimated relative to a "baseline" range of behaviour using climate model ensembles that represent aleatory uncertainty under baseline and stressed conditions . Such stress-testing frameworks favour early engagement with the decision-maker in the design process to identify key system vulnerabilities, performance criteria, and trade-offs between management goals (Poff et al., 2016). Furthermore, so-called 'scenario-neutral' methods can establish when and where national safety margins for 15 nonstationary hydroclimatic conditions might be inadequate (Prudhomme et al., 2010;Broderick et al., 2019). When combined with storylines that describe more elaborate scenarios of change -such as shifts in runoff volume/timing from forest dieback or from snowpack dust in hotter and drier conditions -more holistic capture of nonstationarity in adaptation planning is then feasible (e.g. Yates et al., 2015).
A few agencies have issued specific guidance for incorporating climate change into detailed engineering designs (e.g. Asian rates of change (e.g. for sea level rise); using changes in primary design variables (e.g. annual daily maximum precipitation amount) from regional climate projections as an adjustment factor (or climate change allowance) to uplift baseline series; or rescaling the parameters of extreme value distributions to reflect multi-decadal climate variability. For a review of these 25 procedures and other robust decision-making approaches in the water sector, see  and Wasko et al. (2021). Approaches for incorporating uncertainty into flood design include climate factors (i.e. adjusting peak flow estimates); the 'prudent approach' (i.e. selecting a larger return period based on the precautionary principle); and robustness-based decision methods (which may be suited to a large range of plausible futures) (François et al., 2019). Other sectors are recognizing the need to adopt new standards as well, for example, to improve the thermal performance of buildings (e.g. Lomas and Giridharan, .

Model-based projections
Climate model projections (such as CMIP) are used to understand how weather and water extremes may develop according to different scenarios. They are typically initialized once with perturbations to the initial conditions and with stochastic physics.
Despite the advantages of climate models for understanding nonstationary extremes, they also have several non-trivial limita-10 tions. Their immense computational demand means that ensemble experiments may take years to complete. Their skill depends on the ability to represent climate dynamics accurately, and so most climate models have notable biases when projecting nonstationarities decades ahead. For instance, several CMIP5 climate models with 6-hourly fields exhibit 'implausible' projections (McSweeney et al., 2015). Care should be taken when using climate projections, as their biases (and changes over time of the biases) may considerably distort the multi-model means of water cycle components (Liepert and Lo, 2013). Another issue with 15 using climate models to understand nonstationarity is their tendency to "drift" over time, which means they develop progressive changes beyond natural variability (e.g. Covey et al., 2006;Gupta et al., 2013;Hermanson et al., 2018). This behaviour strongly depends on both the model and climate variable being analysed. For example, precipitation drift contributes to less than 10% of the historical trend in most of the CMIP5 models, whereas drift in steric sea level might contribute up to 30-60% of the historical trend (Gupta et al., 2013). Hydrologists are interested in catchment-scale projections, but climate models have much 20 larger resolutions, which necessitate downscaling for use in process-based models. Many different statistical downscaling and bias correction methods are employed. For extreme precipitation and flood projections over Europe, see Madsen et al. (2014).
Downscaling has many benefits, but also a range of weaknesses such as uncertainties due to data sparsity, representation of summer/subdaily rainfall, and errors inherited from the driving global climate model (e.g. Maraun et al., 2010;Pielke Sr and Wilby, 2012). 25 Ensemble modelling approaches enhance confidence in our understanding of nonstationarities by better quantifying the uncertainty arising from models versus the uncertainty arising from internal variability. Sample size can be increased by employing many models (such as CMIP); or by using single model initial-condition large ensembles (SMILEs), which quantify the effect of internal variability (Deser et al., 2020). For instance, Maher et al. (2020) showed that surface temperature trends in near-term (15 to 30 year) projections are dominated by internal variability, using six SMILEs and the CMIP5 output. 30 Post-processing methods such as the Bayesian Ensemble Uncertainty Processor (BEUP) have been developed to quantify the predictive uncertainty associated with both ensemble climate predictions and hydrologic data (Reggiani et al., 2009). In BEUP approaches, ensemble weather predictions are used as inputs to hydrologic models and the resulting ensemble streamflow predictions are then post-processed by the BEUP, delivering a full probability distribution of the expected flow. The climate uncertainty is represented by ensembles of prediction and the aggregated hydrologic uncertainty is quantified and then added after post-processing of BEUP (Han and Coulibaly, 2019).

Hybrid dynamical-statistical approaches
Hybrid dynamical-statistical (or statistical-dynamical) approaches offer several advantages over purely statistical or purely dynamical methods for predicting and projecting nonstationary extremes. A hybrid model uses climate predictions that are 5 sometimes bias-corrected or merged using techniques such as regression or Bayesian model averaging. Hybrid approaches are particularly useful for hydroclimatic extremes where dynamical models have limited skill (e.g. rainfall predictions). They have been applied to a wide range of weather and water extremes including precipitation (Wang et al., 2012;Fernando et al., 2019), temperature (Strazzo et al., 2019), hurricanes (Kim andWebster, 2010;Vecchi et al., 2011), floods  and droughts (see Hao et al. (2018) for a review). For example, Lee et al. (2020) used projected surface air temperature (SAT), 10 specific humidity and surface pressure from two global and four regional climate models, and two Representative Concentration Pathway (RCP) climate change scenarios as covariates to compute dew-point temperature (DPT). They then employed the SAT and DPT to project nonstationary precipitation peaks over threshold (POT) using a Generalized Pareto distribution. This indirect approach produced more consistent projections of precipitation across a range of climate models than the raw (direct) precipitation projections. Finally, Slater and Villarini (2018) used forecasts of seasonal precipitation and temperature from the 15 NMME alongside antecedent climate conditions, agriculture, and population density, to generate enhanced flood predictions.
In this approach, the regression model relating extremes (e.g. flood indices) to covariates (e.g. precipitation) can be used to develop future predictions/projections of extremes from climate model outputs.
There is scope for developing enhanced statistical-dynamical prediction using sophisticated postprocessing techniques such as ensemble model output statistics (EMOS). For instance, Bayesian joint probability (BJP) modelling, an EMOS-type method, 20 employs a joint probability distribution to characterize the relationship between the raw GCM ensemble mean and observations. BJP has been found to be superior to traditional quantile mapping approaches when dealing with bias, reliability and coherence (Zhao et al., 2017).

Validity of models for evaluation of nonstationarity
Process-based models are often used in climate science and hydrological impact studies precisely to cope with the uncertainty 25 and nonlinear feedbacks that arise because of nonstationarity but are not always able to solve those problems. Incorporating nonstationarity in the calibration of model parameters and physics remains an issue. Models are used in situations where nonstationarity is present, sometimes with the express purpose of understanding the nonstationary process (e.g. Crooks and Kay, 2015). Models that assume stationarity or are heavily reliant on calibration (with fixed parameters) are less credible at representing nonstationarity. However, this distinction between statistical models (that assume data on past system behaviour offers 30 insight into future behaviour) and physically-based models (which incorporate fundamental physical, biological, and chemical principles) is unhelpful. In practice, there are many physical principles (e.g. conservation of mass, energy, momentum) encoded in or relied upon by statistical models, whereas there are many data-specific parameters hard wired into physical models (Hrachowitz and Clark, 2017). As always, the key is to understand all of the steps at which assumptions of nonstationarity are made because these determine model realism when applied under nonstationary conditions (Kirchner, 2009).
More generally, models used for attribution and prediction suffer from incomplete representation of all sources of nonstationarity. For instance, hydrodynamic models used for inundation modelling may be inaccurate for projecting flood extents because they consider the land surface or river channel properties as fixed, by assuming bankfull flooding occurs once every 5 two years (employing a two-year flood recurrence interval) (e.g. Kettner et al., 2018). One promising approach is to begin to tackle nonstationarity within an Earth System Modelling perspective that includes interactions between atmosphere, oceans, land, biosphere, and human activities. Such an interdisciplinary approach is expected to accelerate scientific progress (Harrigan et al., 2020). However, there are still obstacles in the way of a more holistic framework. For instance, grid resolution will inevitably determine which features may be included (e.g. river catchments) or excluded (e.g. field drains).

Conclusions and recommendations
Use of models for evaluating and managing the presence of nonstationarity in weather and water extremes has long been controversial, due to lack of clarity about the timescales required for identifying nonstationarity; spurious identification of nonstationarities arising from data inhomogeneity; mistaking of short-term excursions for long-term change; and/or application of nonstationary models without prior/complete knowledge of expected drivers of change and their timescales. Caution is 15 undoubtedly necessary when attempting to detect and project nonstationary behaviour (Serinaldi et al., 2018). Nonstationarity should not be detected from data only but should be based on a sound a priori understanding of the physical drivers of change. When there are plausible reasons to suspect nonstationarity, statistical tests can be applied with care. Awareness of the constraints imposed by the models on the extraction and interpretation of information is essential (Von Storch and Zwiers, 2001). 20 This review outlines a 'toolkit' to help guide investigations into the detection, attribution, and management of nonstationary extremes ( Figure 2). Recognising nonstationary symptoms (section 2) is essential for developing risk management strategies under global change, including both nature-based and/or engineered solutions to risk management, where appropriate. However, methods used for hazard risk management should only factor nonstationarity into their design after thorough exploratory data analysis (section 3), due consideration of the uncertainties inherent in nonstationary techniques (Serinaldi and Kilsby, 25 2015), and evaluation of the transferability and robustness of model parameters from the calibration period to a nonstationary future (Broderick et al., 2016;Fowler et al., 2016Fowler et al., , 2020. When faced with incomplete knowledge of hazard-generating processes, 'multiple working hypotheses' frameworks must be developed to ensure robust detection of drivers (Chamberlin, 1890;Clark et al., 2011;Harrigan et al., 2014) (sections 4-5). Rather than seeking a single response to extremes, the precautionary principle requires a more inclusive investigation of possible underlying causes. Systematically perturbing one or more compo- 30 nents of a physically-based model is presently the leading approach to control drivers and evaluate system behaviour. However, emerging attribution methods combining measures of information transfer and dynamical system approaches have also been proposed to better assess causal co-dependencies (Hall and Perdigão, 2021).
Attribution remains an ongoing challenge (section 6) as the brevity of observational records frequently precludes robust detection of nonstationarities. Pooled approaches -both using model ensemble members (e.g. Thompson et al., 2017;Kelder et al., 2020;Massey et al., 2015) and spatial pooling of observations (e.g. Prosdocimi et al., 2019;Blum et al., 2020) -hold considerable promise for enhanced detection and attribution. These methods allow investigation of various types of nonstationarity and their driving processes, including the inter-dependencies between driving variables, by increasing the sample 5 size required to detect significant change. However, they may not necessarily overcome all challenges, such as distinguishing between periodicities and temporal trends, if the time window is too short. Over regional scales, large-sample approaches are also one of the solutions to understanding the more general properties of environmental hazards and their drivers (e.g. Addor et al., 2020).
The final step in the nonstationary workflow is the management of future extremes (section 7). Beyond consideration of 10 single extremes, there are fundamental issues with the prediction and management of compound extremes. Just as the lack of observations is a key challenge for individual extremes, the difficulty is even greater when it comes to understanding the intersection, sequencing, and joint occurrence of very rare events, which by definition are limited in the observed record.
Pooled or ensemble approaches are also likely to be an option for tackling some of the major outstanding research directions -such as future climate changes; effects of land-atmosphere interactions and feedbacks; their impacts; land cover effects; 15 and cascading events (AghaKouchak et al., 2020). However, as climate models become more complex, there is no guarantee that nonstationary signals may emerge more clearly than before. Dynamical changes in atmospheric circulation are highly uncertain, and postprocessing of large multi-model ensembles is required to help extract a signal from the noise (e.g. Smith et al., 2020). The 'promise' of reduced uncertainty with better science is also not guaranteed: statistical postprocessing and the most advanced techniques may still be inconsequential if a process is poorly captured within the modelled world. 20 Thus, alongside sophisticated detection and attribution techniques, there remains a need for practical tools for managing future change. Methods range from simple look-up tables and factors for engineering design, storylines to evaluate adaptation options in more holistic ways (e.g. Yates et al., 2015), through to user-driven design of environmental decision support systems (EDSS; e.g. McIntosh et al., 2011;Zulkafli et al., 2017). Alongside the more quantitative methods, storylines may allow us to imagine futures with multiple nonstationary drivers, and to frame narratives in terms that are meaningful for decision making 25 (Hazeleger et al., 2015;Shepherd et al., 2018;Sillmann et al., 2021). Developing intelligible ways to translate the science, including "low regret" options for adaptation, remains a key area for research.