One of the main objectives of the scientific enterprise is the development of well-performing yet parsimonious models for all natural phenomena and systems. In the 21st century, scientists usually represent their models, hypotheses, and experimental observations using digital computers. Measuring performance and parsimony of computer models is therefore a key theoretical and practical challenge for 21st century science. “Performance” here refers to a model's ability to reduce predictive uncertainty about an object of interest. “Parsimony” (or complexity) comprises two aspects: descriptive complexity – the size of the model itself which can be measured by the disk space it occupies – and computational complexity – the model's effort to provide output. Descriptive complexity is related to inference quality and generality; computational complexity is often a practical and economic concern for limited computing resources.

In this context, this paper has two distinct but related goals. The first is to propose a practical method of measuring computational complexity by utility software “Strace”, which counts the total number of memory visits while running a model on a computer. The second goal is to propose the “bit by bit” method, which combines measuring computational complexity by “Strace” and measuring model performance by information loss relative to observations, both in bit. For demonstration, we apply the “bit by bit” method to watershed models representing a wide diversity of modelling strategies (artificial neural network, auto-regressive, process-based, and others). We demonstrate that computational complexity as measured by “Strace” is sensitive to all aspects of a model, such as the size of the model itself, the input data it reads, its numerical scheme, and time stepping. We further demonstrate that for each model, the bit counts for computational complexity exceed those for performance by several orders of magnitude and that the differences among the models for both computational complexity and performance can be explained by their setup and are in accordance with expectations.

We conclude that measuring computational complexity by “Strace” is practical, and it is also general in the sense that it can be applied to any model that can be run on a digital computer. We further conclude that the “bit by bit” approach is general in the sense that it measures two key aspects of a model in the single unit of bit. We suggest that it can be enhanced by additionally measuring a model's descriptive complexity – also in bit.

One of the main objectives of the scientific enterprise is the development of parsimonious yet well-performing models for all natural phenomena and systems. Such models should produce output in agreement with observations of the related real-world system, i.e. perform well in terms of accuracy and precision and overall “rightness” (Kirchner, 2006). Another key aspect of evaluating such models is their complexity; i.e. they should be brief, elegant, explainable, understandable, communicable, teachable, and small. Mathematical analytical models – e.g. Newton's laws – represent an ideal type of model because they combine performance (high accuracy and precision when compared with experimental observations) with minimal yet adequate complexity (high elegance, brevity, and communicability). Another key aspect of model complexity is how efficiently a model produces its output. This is especially relevant for large models used in operational settings, where computational effort or – closely related – computation times are an issue. In Fig. 1a, these key aspects of model evaluation are referred to as “descriptive complexity”, “computational complexity”, and “performance”. A simple example to illustrate their relation: suppose we want to bake a cake; then the length of the recipe measures its descriptive complexity, the time or effort it takes to actually prepare the cake by following the recipe instructions measures its computational complexity, and the (dis)agreement of our cake with the gold standard cake from the pastry shop measures its performance.

Many approaches exist to guide model development (Fig. 1b), and they differ by the way they handle and the emphasis they put on each of the three previously discussed key aspects (see e.g. Schoups et al., 2008). We will in the following briefly describe some of these guidelines to provide the background for the “bit by bit” approach suggested in the paper.

In the framework of algorithmic information theory (AIT) (Kolmogorov, 1968; Solomonoff, 1964, 1978; Chaitin, 1966), descriptive complexity of a model is measured by its size expressed in bit when stored as instructions for a computer. It is therefore a formalization of Occam's razor. Furthermore, the same concept of descriptive complexity can also be directly applied to data. The complexity of data is formalized as its shortest description length, and the best model for the data is that shortest description: the shortest computer program that has the data as an output. It is noteworthy that in all these approaches that employ Occam's razor, an emphasis is placed on descriptive complexity and performance but is completely independent of any practical considerations such as limited storage space or computing power; i.e. it ignores computational complexity. So while Occam promotes models that achieve effective compression of experimental data, compression for the sake of meeting constraints in a storage-limited world is not the primary goal but rather the reverse: finding the shortest description is the process of inference, achieved by distilling patterns from data in order to find general predictive laws.

In summary, both Occam's razor and the AIT-based extension argued for by Weijs and Ruddell (2020) are designed with a focus on inference, i.e. on distilling small and universal laws from experimental data, while the focus of validation set approaches is mainly on performance. In neither of them is the model's effort of actually making its predictions directly considered. This effort, however, can be an important quality of a model in settings where computing resources are limited. In earth science modelling, this is the rule rather than the exception for the following reasons: (i) scales of earth systems cannot be separated easily and in some cases not at all, so even for local questions it may be necessary to simulate large systems at a level of great spatio-temporal detail; (ii) calibration of model parameters from data needs many repeated model runs for parameter identification; (iii) models used in optimal decision making require repeated use to identify the optimal alternative. The efficiency at which models generate their output is the subject of the discipline of analysis of algorithms (AOA). In AOA, it is referred to as computational complexity and can be measured in terms of two finite resources that are needed for computation: time and/or space. Time complexity relates to the time a computer needs to arrive at the result. Time complexity can be measured in terms of clock cycles or number of floating point operations, and often it is the scaling with the input size that is of interest. Space complexity relates to the memory used, i.e. the maximum number of binary transistor states needed during the execution of the program. As for descriptive complexity, the reads of this memory can be interpreted as answers to yes/no questions and can be measured in bit.

In the context of the guidelines for model development discussed in the
previous section, this paper has two distinct but related goals: the

For demonstration, we run hydrological models of various types (artificial neural network, auto-regressive, simple and more advanced process-based, and both approximate and exact restatements of experimental observations) that all aim to perform the same task of predicting discharge at the outlet of a watershed. Akin to Weijs and Ruddell (2020), we examine possible trade-offs between computational complexity vs. information loss. It is important to note that the purpose of the model comparison here is not primarily to identify the best among the different modelling approaches; rather, it serves as a demonstration of how “Strace” is sensitive to all facets of a model and how differences among the models can be explained by their setup and are in accordance with expectations. In short, the aim is to provide a proof of concept.

The remainder of the paper is structured as follows: in Sect. 2, we describe the real-world system we seek to represent (a small Alpine watershed in western Austria), the range of models we use for demonstration of the “bit by bit” concept, and the implementation environment and criteria for measuring model performance and computational complexity. In Sect. 3, we present and compare the models in terms of these criteria and illuminate differences between descriptive and computational complexity. In Sect. 4, we draw conclusions, discuss the limitations of the approach, and provide directions for future work.

A brief note on the uses of the term “complexity”, in this paper and in the hydrological sciences in general: in this paper, we use it in very specific ways to refer to different characteristics of a model. We have adopted the term “descriptive complexity” from algorithmic information theory to express the parsimony of a model by its size in bit when stored on a computer and the term “computational complexity” from analysis of algorithms to express the efficiency at which a model generates its output
by the number of memory visits during program execution. In the hydrological
sciences in general, “complexity” is most often used in the wide sense of
its dictionary definition (see “complex” in Merriam-Webster at

Our research contributes to the large existing body of complexity studies in hydrology, and we believe that by expressing all key aspects of computer-based models – performance, descriptive complexity, and computational complexity – in the single general unit of bit can help facilitate comprehensive model evaluation and optimization.

The real-world system we seek to represent with our models is the Dornbirnerach catchment in western Austria. Upstream of river gauge Hoher
Steg (Q_Host), the target of our model predictions, the catchment covers 113 km

We selected altogether eight modelling approaches with the aim of covering a wide range of model characteristics such as type (ignorant, perfect, conceptual-hydrological and data-driven), structure (single and double linear reservoir), numerical scheme (explicit and iterative) and precision (double and integer). The models are listed and described in Table 1; additional information is given in Fig. 2. We trained/calibrated each model on a 5-year calibration period (1 January 1996–31 December 2000) and validated them in a 5-year validation period (1 January 2001–31 December 2005).

Models used in the study and their characteristics.

All models were implemented as Python scripts running on Python 3.6 with the installed packages Numpy, Pandas, Scipy, Keras and H5py. The experiments were done on a computer running Red Hat Enterprise Linux Server release 7.4 on a 16-core Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00 GHz processor.

All models were evaluated in terms of the two criteria described in the introduction: performance, i.e. the model's ability to reduce predictive uncertainty about the target, and computational complexity, i.e. the effort required to make the model generate a prediction about the target. Similar to Weijs and Ruddell (2020), we express both quantities in bits, to be able to investigate whether direct comparison or combining both counts in a single measure helps interpretation.

As in Weijs and Rudell (2020), we express model performance in terms of
information losses. In information theory, information is defined as the
negative logarithm of the probability

Alternatively to measuring information losses of model predictions compared to an upper benchmark – the observations – as described above, it is also possible to measure information gains compared to a lower benchmark – the entropy of a uniform distribution – which expresses minimum prior knowledge. Weijs and Ruddell (2020), which we refer to throughout the text, used information losses because they directly translate to a description length. For reasons of comparability we applied the same concept here.

To avoid fitting of theoretical functions to the empirical data distributions, we calculated conditional entropy of discrete (binned)
distributions, i.e. normalized empirical histograms. Choice of the binning
scheme has important implications for the values of the information measures derived from the binned distributions: while the lower bound for entropy,

When calculated in the described manner, a lower bound and two upper benchmarks for the values of conditional entropy can be stated: if the model
perfectly predicts the true target value, it will be zero. Non-zero values
of conditional entropy quantify exactly the information lost by using an imperfect prediction. If predictor and target are independent, the conditional entropy will be equal to the unconditional entropy of the
target, which in our case is

We quantify computational complexity by the total number of memory read and write visits (in bit) on a computer while running the model. In the context of information theory, these bit counts and the bits measuring model performance by conditional entropy in the previous section can both be interpreted in the same manner as a number of binary yes/no questions that were either already asked and answered during the model run (in the former case) or still need to be asked (in the latter case) in order to fully reproduce the data.

Counting memory visits while running a computer program can be conveniently
done by “Strace”, a troubleshooting and monitoring utility for Linux (see

To evaluate the reproducibility of the countings, we repeated each model run 100 times, clearing the memory cache between individual runs. As the countings were in fact very close, we simply took the average of all runs as
a single value representing model computational complexity. The main steps
of applying “Strace” in our work were as follows.

We traced the read() and write() system calls of the models while executing their code in Python and wrote them into a target log file running the following command in the Linux command line: “strace -o target.log -e trace

After generating the target log file, we calculated the sum of all read operations from the target log file running the following command: “cat target.log

As stated previously, it is

Before discussing the model results for the six use cases in terms of performance and computational complexity, we first provide a short and
exemplary visualization of the model predictions to illustrate their general
behaviour. In Fig. 3, observed precipitation at Ebnit (Fig. 3a) and observed
and simulated discharge time series (Fig. 3b) of all models at gauge Hoher
Steg are shown for a rainfall-runoff event in June 2002, which lies within
the validation period. The observed hydrograph (bold blue) shows a flood
peak of 71 m

Here we discuss the model results in terms of model performance and model
computational complexity for six use cases. Model performance is expressed
as the remaining uncertainty, at each time step, about the observed data D
given the related model simulation

Model performance expressed by its inverse, information loss per time step, measured by conditional entropy in bits vs. model computational complexity measured by the average number of memory visits per time step in bits for Model-00 to Model-08.

In the lower left corner of Fig. 4, a black square indicates a loose upper
bound of the descriptive complexity of a single recording of our target
discharge series Q_Host. The value (18.8 bit) was calculated by simply dividing the size of the Q_Host validation data set by the number of time steps. This represents the raw size of a single data point in the series, without any compression, and if we want we can compare it to the computational effort of

We started this paper by stating that one of the main objectives of the scientific enterprise is the development of well-performing yet parsimonious models for natural phenomena and systems, that models nowadays are mainly computer models, and that three key aspects for evaluating such models are descriptive complexity, computational complexity, and performance. We continued by describing several paradigms to guide model development: Occam's razor puts an emphasis on descriptive complexity and is often combined with performance considerations, but it ignores computational complexity; Weijs and Ruddell (2020) express both model performance and descriptive complexity in bit and by adding the two obtain a single measure for what they call “strong parsimony”; validation set approaches focus on performance and promote general and parsimonious models only indirectly by evaluating models on data not seen during calibration. Neither of these approaches directly incorporates computational complexity. We suggested closing this gap by “Strace”, a troubleshooting and monitoring utility, which measures computational complexity by the total number of memory visits while running a model on a computer. We further proposed the “bit by bit” method, which combines measuring computational complexity by “Strace” and measuring model performance by information loss relative to observations, all in bit, akin to Weijs and Ruddell (2020).

For a proof of concept, we applied the “bit by bit” method in combination with a validation set approach – to also consider descriptive complexity, if only indirectly – with the example of a range of watershed models (artificial neural network, autoregressive, simple, and advanced process-based with various numerical schemes). From the tested models, a third-order autoregressive model provided the best trade-off between computational complexity and performance, while the LSTM and a conceptual model operating in high temporal resolution showed very high computational complexity. For all models, computational complexity (in bit) exceeded the missing information (in bit) expressing model performance by about 3 orders of magnitude. We also compared a simple upper bound of descriptive complexity of the target data set to model computational complexity: the latter exceeded the former by about 2 orders of magnitude. Apart from these specific results, the main take-home messages from this proof-of-concept application are that (i) measuring computational complexity by “Strace” is general in the sense that it can be applied to any model that can be run on a digital computer; (ii) “Strace” is sensitive to all aspects of a model, such as the size of the model itself, the input data it reads, its numerical scheme, and time stepping; (iii) the “bit by bit” approach is general in the sense that it measures two key aspects of a model in the single unit of bit, such that they can be used together to guide model analysis and optimization in a Pareto trade-off manner in the general setting of incremental learning. It can be useful especially in operational settings where the speed of information processing is a bottleneck. Unlike approaches to estimate computational complexity via model execution time, the bit counting by “Strace” is unaffected by other ongoing processes on the computer competing for CPU time. This increases reproducibility and unambiguousness of the results. The “bit by bit” approach can help promote better model code in two ways: computational complexity is sensitive to poor (inefficient) coding and performance is sensitive to wrong (erroneous) coding. This is relevant as computer models in the earth sciences have grown increasingly complex in recent years, and efficient, modular, and error-free code is a precondition for further progress (Hutton et al., 2016).

During the development of this paper we encountered several interesting – and still open – questions: the first was about where to set the system boundaries. For example, should forcing data – which often are key drivers of model performance – be considered part of the model and hence be included in the counting, or not? If we consider a model that performs well even with limited input data to be more parsimonious than another, which heavily relies on information contained in the input, we should do so. However, we could also argue that the input is not part of the model and should therefore be excluded from the counting. This question also applies to the extent to which the computational setting on the computer should be included in the counting and is open for debate. We also still struggle to provide a rigorous description of the nature and strength of the relation between descriptive and computational complexity. Clearly they describe two distinctly different characteristics of a model, but they are also related, as “Strace” counts both the size of a program and the computational effort of running it. Both the descriptive complexity and performance of a model are typically orders of magnitude smaller than its computational complexity, which renders their simple additive combination to a single, overall measure of modelling quality impractical. Nevertheless, we suggest that combining the approach by Weijs and Ruddell (2020) with measuring computational complexity by “Strace” will be worth exploring in the future. It potentially offers a comprehensive and multi-faceted way of model evaluation applicable across the earth sciences, where all key aspects of a model are expressed in a single unit, bit.

The code used to conduct all analyses in this paper is publicly available at

All data used to conduct the analyses in this paper and the result files are publicly available at

EA wrote all Python scripts and code related to Strace and conducted all model runs. UE designed the study and wrote all Matlab scripts. EA, UE, SVW, BLR and RAPP wrote the manuscript together.

The authors declare that they have no conflict of interest.

We thank Jörg Meyer from KIT-SCC for helpful discussions about the best way to bit-count models, Markus Götz from KIT-SCC for discussions about the LSTM model, and Clemens Mathis from Wasserwirtschaft Vorarlberg, Austria, for providing the case study data. We gratefully acknowledge support by the Deutsche Forschungsgemeinschaft (DFG) and the Open Access Publishing Fund of the Karlsruhe Institute of Technology (KIT). RP also acknowledges FCT under references UIDB/00329/2020 and UID/EEA/50008/2019. Benjamin L. Ruddell acknowledges Northern Arizona University for providing start-up grants used in part for this work. We thank Elena Toth and an anonymous referee for their detailed comments, which helped improve the clarity of the manuscript.

The article processing charges for this open-access publication were covered by a Research Centre of the Helmholtz Association.

This paper was edited by Roberto Greco and reviewed by Elena Toth and one anonymous referee.