Grain size analysis is the key to understand the sediment dynamics of river systems. We propose

Understanding the hydrological and geomorphological processes of rivers is crucial for their sustainable development so as to mitigate the risk of extreme flood events and to preserve the biodiversity in aquatic habitats. Grain size data of gravel- and cobble-bed streams are key to advance the understanding and modeling of such processes

One important key indicator for modeling sediment dynamics of a river system is the

To the best of our knowledge, this includes at least the following German-speaking countries: Switzerland, Germany, and Austria.

This procedure of surface sampling is commonly referred to as pebble counts along transectsAn obvious idea to accelerate data acquisition is to estimate grain size distribution from images. So-called

Other researchers have proposed to analyze 3D data acquired with terrestrial or airborne lidar or through photogrammetric stereo matching

While automatic grain size estimation from ground-level images is more efficient than traditional field measurements

Illustration of the two final products generated with

In this paper, we propose a novel approach based on convolutional neural networks (CNNs) that efficiently maps grain size distributions over entire gravel bars, using georeferenced and orthorectified images acquired with a low-cost UAV. This not only allows our generic approach to estimate the full grain size distribution at each location in the orthophoto but also to estimate characteristic grain sizes directly using the same model architecture (Fig.

It is worth noting that the annotation strategy and the CNN are not tightly coupled. Since the CNN is agnostic, it could be trained on grain size data created with different sampling strategies to meet other national standards.

We evaluate the performance of our method and its robustness to new, unseen locations with different imaging conditions (e.g., weather, lighting, shadows) and environmental factors (e.g., wet grains, algae covering) through cross-validation on a set of 25 gravel bars

end-to-end estimation of the full grain size distribution at particular locations in the orthophoto, over areas of

robust mapping of grain size distribution over entire gravel bars;

generic approach to map characteristic grain sizes with the same model architecture;

mapping of mean diameters

robust estimation of

In this section, we review related work on automated grain size estimation from images. We refer the reader to

While not clearly explained in

In contrast to previous work, we view the frequency or volume distribution of grain sizes as a probability distribution (of sampling a certain size), and we fit our model by minimizing the discrepancy between the predicted and ground truth distributions. Our method is inspired by

We collected a dataset of 1491 digitized line samples acquired from a total of 25 different gravel bars on six Swiss rivers (see Table

With the exception of location

Overview map with the 25 ground truth locations of the investigated gravel bars in Switzerland.

One example image tile from each of the 25 sites is shown in Fig.

Example image tiles (

We acquired images with an off-the-shelf consumer UAV, namely, the DJI Phantom 4 Pro. Its camera has a 20 megapixel CMOS sensor (

The accuracy of the image scale has a direct effect on the grain size measurement from georeferenced images

We introduce a new annotation strategy (Fig.

Overview of the line sampling procedure.

Our annotation strategy has several advantages. First, digital line sampling is the one-to-one counterpart of the current state-of-the-art field method in the digital domain. Second, the labeling process is more convenient, as it can be carried out remotely and with arbitrary breaks. Third, image-based line sampling is repeatable and reproducible. Multiple experts can label the exact same location, which makes it possible to compute standard deviations and quantify the uncertainty of the ground truth. Finally,

Overview of the ground truth data.

In total,

Many hydrological parameters are continuous by nature and can be estimated via regression. Neural networks are generic machine learning algorithms that can perform both classification and regression. In the following, we discuss details of our methodology for regressing grain size distributions of entire gravel bars from UAV images.

Before feeding image tiles to the CNN, we apply a few standard preprocessing steps. To simplify the implicit encoding of the metric scale into the CNN output, the ground sampling distance (GSD) of the image tiles is unified to 0.25 cm. The expected resolution of a

Finally, following best practice for neural networks, we normalize the intensities of the RGB channels to be standard normal distributed with mean of 0 and standard deviation of 1, which leads to faster convergence of gradient-based optimization

Our CNN architecture, which we call

A popular technique to improve convergence is

Our proposed

We tested different network depths (i.e., number of blocks/layers) and found the following architecture to work best:

In contrast to

As CNNs are modular learning machines, the same CNN architecture can be used to predict different outputs. As already described, we can predict either discrete (relative) distributions or scalars such as a characteristic grain size. We thus train

relative frequency distribution (

relative volume distribution (

characteristic mean diameter (

Depending on the target type (probability distribution or scalar), we choose a suitable

We use the ADAM optimizer

To enhance the diversity of the training data, many techniques for image data augmentation have been proposed, which simulate natural variations of the data. We employ randomly horizontal and vertical flipping of the input images. This makes the model more robust and, in particular, avoids overfitting to certain sun angles with their associated shadow directions.

Various error metrics exist to compare ground truth distributions to predicted distributions. Here, we focus on three popular and intuitive metrics that perform best for our task: the Earth mover's distance (shortened to EMD; also known as the Wasserstein metric), the Kullback–Leibler divergence (KLD), and the Intersection over Union (IoU; also known as the Jaccard index).

The Earth mover's distance (Eq.

For completeness, we note that there is a smoothed and symmetric (but less popular) variant of KLD, i.e., the Jensen–Shannon divergence.

To optimize and evaluate CNN variants that directly predict scalar values (like, for example, GRAINet, which directly predicts the mean diameter

The trained

To avoid any train–test split bias, we randomly shuffle the full dataset and create 10 disjoint subsets, such that each sample is contained only in a single subset. Each of these subsets is used once as the hold-out test set, while the remaining nine subsets are used for training

Whether or not a model is useful in practice strongly depends on its capability to generalize across a wide range of scenes unseen during training. Modern CNNs have millions of parameters, and in combination with their nonlinear properties, these models have high capacity. Thus, if not properly regularized or if trained on a too small dataset, CNNs can potentially memorize spurious correlations specific to the training locations which would result in poor generalization to unseen data. We are particularly interested to know if the proposed approach can be applied to a new (unseen) gravel bar. In order to validate whether

The predictive accuracy of machine learning models depend on the quality of the labels used for training. In fact, label noise that would lead to inferior performance of the model is introduced partially by the labeling method itself. Grain annotation in images is somewhat subjective and thus differs across different annotators. The advantage of our

On the one hand, by combining the output of

dense high-resolution maps of the spatial distribution of characteristic grain sizes,

grading curves for entire gravel bars, by averaging the grading curves at individual line samples.

For all experiments, the data are separated into three disjoint sets: a

The initial learning rate is empirically set to 0.0003, and each batch contains eight image tiles, which is the maximum possible within the 8 GB memory limit of our GPU (Nvidia GTX 1080). While we run all experiments for 150 epochs for convenience, the final model weights are not defined by the last epoch but taken from the epoch with the lowest validation loss. An individual experiment takes less than 4 h to train. Due to the extensive cross-validation, we parallelize across multiple GPUs to run the experiments in reasonable time.

Our proposed

relative frequency distribution (

relative volume distribution (

characteristic mean diameter (

In order to get an empirical upper bound for the achievable accuracy, we compare the performance of

We evaluate the quality of the ground truth data in two ways. First, the

Comparison of digital line samples with 22 in situ line samples collected in the field.

From 22 out of the 25 gravel bars, two to three field measurements from experienced experts were available (see Fig.

We compute statistics of three to five repeated annotations of 17 randomly selected image tiles (see Table

Repeated annotations of the same tile by two experts.

Recall that this comparison of multiple annotators is only possible because

As explained in Sect.

Results for

When estimating the relative

The regression performance for the relative

In comparison to the values reported in Table

Looking at the difference between the

Example image tiles where the

Error cases where the

Figure

Comparing the predictions with the ground truth distributions in Fig.

In combination with the smoother output of the CNN, the sharp jump in the

To investigate to what degree the texture features learned by the CNN are interpretable with respect to grain sizes, we visualize the activation maps of the last convolution layer, before the global average pooling, in Fig.

Activation maps after the last convolutional layer for two examples. Each of the 21 maps corresponds to a specific histogram bin of the grain size distribution, where bin 0 corresponds to the smallest and bin 20 to the largest grains. Light colors are low activation, and darker red denotes higher activation.

We compute grading curves from the predicted relative

Grading curves resulting from optimizing different loss functions (from left to right: EMD, KLD, IoU) for two example gravel bars, namely,

Grading curves of the 25 gravel bars, estimated with random 10-fold cross-validation.

This qualitative comparison indicates that regressing the

Characteristic grain sizes can be derived from the predicted distributions, or

We again analyze the effect of different loss functions, namely, the mean squared error (MSE) and the mean absolute error (MAE) when training

Results for

Scatter plots of the estimated

Mean absolute error of the predicted

If our target quantity is the

We conclude that end-to-end regression of

Robust estimates of characteristic grain sizes (e.g.,

Error of the mean

The average standard deviation

Variation of

Maps of the characteristic mean diameter

Obviously, the map created with

Hence,

We study the generalization capability of

Grading curves of the 25 gravel bars, estimated with geographical cross-validation.

Mean absolute error of the

Grading curves for all 25 gravel bars are given in Fig.

We also study the generalization regarding the estimation of the

Example image tile resampled to lower resolutions. Left to right: full resolution of 0.25 cm and tested downsampling factors: 2, 4, 8, 16, 32, and 40, corresponding to 0.5, 1.0, 2.0, 4.0, 8.0, and 10.0 cm, respectively.

Performance of

Interestingly, the performance for regressing

We have shown that

Obviously, creating a large, manually labeled training dataset is time-consuming, which is a property our CNN shares with other supervised machine learning methods. However, at test time the proposed approach requires no parameter tuning by the user, which is a considerable advantage for large-scale applications, where traditional image processing pipelines struggle, since they are fairly sensitive to varying imaging conditions. Semiautomatic image labeling with the support of traditional image processing tools

The CNN predictions for a full orthophoto are masked manually to the gravel bars. Our CNN is only trained on gravel images and did not see any purely non-gravel images patches with, e.g., vegetation, sand, or water. Consequently such inputs lie far outside the training distribution and result in arbitrary predictions that need to be masked out by the user. The network could also be trained to ignore samples with land cover other than gravel, but this is beyond the scope of the present paper. It could be added in the future to further reduce manual work.

We present experiments to evaluate the generalization of our approach to new locations, i.e., unseen gravel bars. In this setup, the data are exploited best, allowing the CNN to learn features invariant to the imaging conditions by providing 24 different training orthophotos in each experiment. That experimental setup is valid to investigate geographical generalization, since there is no strong correlation between bars from the same river. An alternative experiment would be to hold out all bars from a specific river for testing. This might be necessary in some geographical conditions with slowly varying river properties to avoid any misinterpretation and overly optimistic results. We have compared the average performance drop in the generalization experiment between five gravel bars on individual river reaches, i.e., bars that are separated by tributaries with new input of sediment, against all bars. Both groups yield comparable performance drops. We conclude that, within our dataset, seeing bars from the same river during training does not lead to over-optimistic results (see Fig.

The per bar generalization experiment is furthermore justified by the fact that the characteristics of the investigated gravel bars vary greatly along the same river, both quantitatively (mean

If we were to hold out, say, the whole river Aare, we would not only substantially reduce the number of training samples but also the diversity of imaging conditions. In fact, within our experimental setup we already present one hold-one-river-out experiment for the river Kleine Emme, from which only one bar is included in our dataset. Even though this bar contains the largest number of digital line samples, its estimated grain size distribution fits rather well in the geographical cross-validation experiment (see Fig.

Ultimately, it is important to keep in mind that data-driven approaches, like the one proposed, will only give reasonable estimates if the test data approximately match the training data distribution. These approaches will not perform well for out-of-distribution (OOD) samples. Detecting such OOD samples is an open problem and an active research direction.

While existing statistical approaches are limited to output characteristic grain sizes (

Nevertheless, we present a generic learning approach; i.e., the same architecture can also be trained to directly predict other desired grain size metrics derived from the distribution, such as the mean diameter

A direct quantitative comparison to previous work with different application focus and different data is only possible to a limited extent. For example,

Our CNN-based approach makes it possible to robustly estimate grain size distributions and characteristic mean diameters from raw images. By analyzing global image features,

Advantages are manifold. First, results are objective and reproducible, as they are not influenced by a subjectively chosen sampling location and grain selection. Second, the resulting curves and

Our experiments highlight some limitations due to the limited sample size for training. While 10-fold cross-fold validation yields very satisfying results, the poorer performance of the geographical cross-validation reveals that collecting and annotating sufficiently large and varied training sets is essential. Unseen unique local environmental factors such as wet stones or algae covering caused performance drops in the generalization experiment. However, if the model has seen a few of these samples (random cross-validation) the performance is more robust against such disturbances. Additionally, the performance of

Finally, the best performance has been achieved with high-resolution imagery taken at 10 m flying altitude. At this altitude it takes approximately 15 min to cover an area of 1 ha with a DJI Phantom 4 Pro (that has a max flight time of approx. 30 min per battery). It would be advantageous to reduce the flight time per area by flying at higher altitudes. As our resolution study on artificially downsampled images shows, the CNN may yield satisfactory performance on images with 1–2 cm resolution corresponding to 40–80 m flying altitude. While this is a promising result, it remains to be tested on images taken at such flying altitudes. We expect that retraining the model with high altitude image–label pairs will lead to similar performance as in the artificial case.

The presented

We have presented

As CNNs are generic machine learning models, they offer great flexibility to directly predict other variables, like, for example, the ratio

Illustration of the convolutional neural network architecture. Left: full architecture where

Overview of the 25 investigated gravel bars. From left to right: slope and bed width of the river at this location with the corresponding annual mean water runoff. Number of annotated image tiles (

Statistics of the mean diameter

Mean error (bias) of the

Resolution study for

Scatter plots of

Tiles grouped by river name. The examples illustrate the variability between different bars along the same river.

Generalization performance per gravel bar. Increase of mean absolute error (MAE) from the random cross-validation to the geographical generalization experiment for the

The code with a demonstration on a subset of the data is available:

NL and AR developed the code and carried out the experiments. AI designed the data acquisition and analyzed the results. KS, JW, and RH provided guidance during project planning and experimentation. All authors contributed to the article, under the lead of NL and AI.

The authors declare that they have no conflict of interest.

We thank Hunziker, Zarn & Partner for sharing the ground truth data for this research project.

This paper was edited by Matjaz Mikos and reviewed by Patrice Carbonneau and two anonymous referees.