Upload
monica-macias
View
218
Download
0
Embed Size (px)
Citation preview
8/3/2019 Stat Mod 1011
1/67
Brandenburg Universityof Technology at Cottbus
Dept. of Ecosystems andEnvironmental Informatics
Statistical Modelling
Univ.-Prof. Dr. habil. Albrecht Gnauck
International Master Course of StudyHydroinformatics EuroAquae
Winter term 2010/2011
8/3/2019 Stat Mod 1011
2/67
Contents
1. Events and data1.1 Analysis and control of aquatic ecosystems1.2 Statistical management of ecological data
1.3 Sampling strategies1.4 Re-sampling and pre-treatment of data
2. Probability functions and statistical measures2.1 Probability functions of ecological data2.2 Normal and skewed probability distribution functions2.3 Comparison of expectations2.4 Statistical measures
3. Statistical test procedures3.1 Introduction
3.2 General procedure of hypothesis testing3.3 Rules of decision3.4 Selected test procedures
4. Linear regression and correlation analysis4.1 Steps of linear regression4.2 Confidence region of regression line4.3 The power of linear regression4.4 Empirical covariance and statistical measures of correlation
5. Nonlinear regression analysis5.1 Polynomial regression5.2 Periodic regression5.3 Trend functions5.4 Comparison of regression functions
6. Time series analysis6.1 Dynamic behaviour of time series6.2 Description of time series in the time and frequency domain6.3 Stationary processes6.4 Correlation and spectral functions
7. Analysis of cycling processes7.1 Introduction7.2 Fourier analysis7.3 Digital data filter7.4 Wavelets
Literature
2
8/3/2019 Stat Mod 1011
3/67
1. Events and data
Statistical modellingof hydrological systems is an important task to extract
information from former and actual states of aquatic ecosystems (aquifers,
freshwater ecosystems, marine ecosystems) by means of water quantity data
and water quality data. Holism and reductionism are the two different ap-
proaches to study and model ecological processes and systems. Both ap-
proaches are needed for ecosystems modelling, simulation and management.
Holism
Aquatic ecosystems are complex systems with nonlinear interrelation-
ships. Holism attempts to reveal the properties of ecosystems by studying the
system as a whole. The system properties cannot be found by a study of thesystem components separately. It is required that the study be on the system
level. This does imply that a study of the ecosystem components is not suffi-
cient. The components of ecosystems are coordinated to such an extent that
ecosystems work as indivisible unities. A study of ecosystem components level
will never reveal the ecosystem properties.
Reductionism
To simplify the ecosystem study and to facilitate the interpretation of ecologicalprocesses the ecosystem components are separated from the system
level. This method is useful to find governing relationships in real systems. This
method has obvious shortcomings when the functioning of the entire ecosystem
is to be revealed. As an example: A forest is more than the sum of all trees.
The analysis and control of dynamic aquatic ecosystems such as ponds,
lakes, reservoirs and river basins is often a complicated task because of the
high number of system elements (or components) and interrelationships be-
tween system elements and between system elements and their environments.
To solve management problems the system has to be decomposed and nonlin-
ear interrelationships have to be linearised. Furthermore, the controllability of
aquatic ecosystems has to refer to different and parallel working subsystems
and system states. The quality of aquatic ecosystem analysis depends on the
flexibility of statistical models used. The restricted information structure of com-
plex aquatic ecosystems and aggregation of information lead to uncertainties ofthe modelling process and of the resulting models. Dynamic processes within
3
8/3/2019 Stat Mod 1011
4/67
aquatic ecosystems are initiated by switching of input and state variables. They
result in rapid changes of system states and output variables (non-autonomous
control) or in low changes (autonomous control). In general, complex dynamic
systems like hydrological systems (aquatic ecosystems) are characterised by
three features (table 1).
Table 1: General characterisation of complex dynamic systems
Feature Solving procedure
High dimension Decomposition of system
Uncertainty Analysis of dynamic characteristics(observability, controllability, pertur-bability, reachability, robustness, sta-bility, sensitivity)
Restricted information structure Aggregation of information
Statistical modellingof hydrological systems is based on data. They are ob-
servations about characteristics and/or attributes of hydrological input, state and
output variables. A group of state variables under study is called a (statistical)
population (e. g. data of water flow, salinity data of river water, BOD data of a
waste water treatment plant). If the frequency distribution of the attributes of a
population is known, then it is possible to describe it by a probability densityfunction or probability distribution function, which is an analytical function de-
fined by a number of parameters.
For the study of aquatic ecosystems a subset of the population or a sample is
used. A population is denoted as univariate if only one variable (or water qua-
lity indicator) is considered. Common univariate measures are averages as
measures of location of centres of data clouds along an axis, and measures of
their dispersion as variance or spanning width. If more than one variable (or
indicator) is considered the population is denoted as a multivariate one.
Regression and correlation analysis belong to experimental statistical
modelling of hydrological systems which is based on methods of the theory
of probability. To solve practical problems such approaches are necessary
which are compatible with the stochastic nature of the input variables and state
equations. Statistical procedures will be the adequate mathematical methods as
long as the processes within the systems and their describing equations are
4
8/3/2019 Stat Mod 1011
5/67
unknown. A distinction is made between two groups of methods depending on
whether the variable time is included or not: Static methods (without consid-
eration of time as variable) and dynamic methods (with consideration of the
variable time). The latter one is often called time series analysis or dynamic sta-
tistics. Simple and multiple linear and non-linear regression and correlation be-
long to static methods as well as multivariate statistical procedures. Static pro-
cedures answer the question whether there is a relationship between two or
more variables of an environmental system. This question can be answered by
a regression analysis which gives out the type of relationship between vari-
ables.
Statistical modelling is done for different purposes. Administrations as well asindustrial and agricultural companies use statistical data and results to plan their
operations and economic developments. Researchers use statistics mainly as a
first step to derive new scientific results. Therefore, the topics of statistical
modelling can be formulated by:
1. Data sampling (Methods: Sampling design, re-sampling, plausibility
checks, outlier correction).
2. Data analysis to fulfil the requirements of environmental administra-tions and associations (Methods: Descriptive statistics, frequency distri-
butions, averages, variances, error correction, significance tests).
3. Data analysis to fulfil the requirements of different professional users
(e.g. industry, agriculture, forestry) (Methods: Explanatory statistics, mul-
tivariate statistics, geostatistics, time series analysis).
4. Basic research (Methods: Regression and correlation analysis, multi-
variate statistics, advanced statistical techniques, digital data filtering,
frequency analysis).
Disturbances of statistical analysis of hydrological dataare given by:
1. Mostly, only small sets of data of representative regularly sampled data
are available.
2. The power of natural and artificial (man-made) external as well as natural
internal driving forces on hydrological indicators influence the quality
ofdata to be obtained.
5
8/3/2019 Stat Mod 1011
6/67
3. Mostly, the a-priori process information on water quality indicators is
low.
4. Hydrologic processes possess different rate constants.
5. Cycling effects in hydrological data are induced by natural internal or
external as well as by man-made external processes.
Classification of hydrological data
Hydrological data may be classified by their origin:
1. Measured and/or observed data of hydrological indicators will be ob-
tained by field samples and/or laboratory experiments. They are directly
observed (direct observations) or indirectly observed (due to calibration
of analytical instruments or sensors).2. Summary data will be derived from statistics or from restricted observ-
able ecological, respective water quality indicators.
3. Simulated data will be obtained by simulation models.
1.1 Analysis and control of aquatic ecosystems
An aquatic ecosystem is a biotic and functional system or unit, which is able to
sustain life and includes all biological and non-biological variables in that unit.
Spatial and temporal scales are not specified a-priori, but are entirely based
upon the objectives of the ecosystem study. Ecosystems are often called com-
plex systems.
Several approaches exist to study the behaviour of ecosystems.
Empirical studies collect bits of information. An attempt is made to integrate
and assemble the studies into a complete picture.
Comparative studies are presented to compare some structural and functionalcomponents for a range of ecosystem types.
Experimental studies where manipulations of a whole ecosystem are used to
identify and elucidate ecological mechanisms.
Modelling and computer simulation studies to work out ecosystem man-
agement plans and to derive eco-technological tools for goal oriented control
actions.
6
8/3/2019 Stat Mod 1011
7/67
Information systems and decision support systems studies to support in-
dustrial, agricultural and administrative ecological decisions and to work out
medium-term and long-term development plans for ecological management.
Like many words for which people have an intuitive understanding, a system isdifficult to define precisely. In relation to the physical and biological sciences, a
system is an organised collection of interrelated physical components charac-
terised by a boundary and functional unity. A system is a collection of communi-
cating materials and processes that together perform some set of functions. A
system is an interlocking complex of processes characterised by many recipro-
cal cause-effect pathways.
A system is a set of interrelated objects (elements, parts) that have certain gen-
eral properties:
1. It fulfils a certain function, i.e. it can be defined by a system purpose recog-
nisable by an observer.
2. It has a characteristic constellation of essential system elements and an es-
sential system structure which determine its function, purpose, and identity.
3. It loses its identity if it is destroyed.
Analysis and control of aquatic ecosystems are often complicated because
of the high number of system elements and interrelationships between sys-
tem elements and between an ecosystem and its environment. Mostly, an eco-
system will be analysed as one unit. Dynamic processes within ecosystems are
initiated by switching processes of input and state variables with different trans-
fer time constants (fig. 1). If they are overlaid by external and internal distur-
bances it can not be distinguished which part of ecosystem response and its
intensity stem from a single ecological element.
For ecosystem analysis, the complex structure of an ecosystem requires its de-
composition and linearization of nonlinear interrelationships. The controllability
of ecosystems has to refer to different working elements (or subsystems) and
system states. Therefore, the whole ecosystem will be divided into several sub-
systems with internal and external feedbacks. This leads to uncertain state-
ments on the ecosystem behaviour. The quality of ecosystem analysis depends
7
8/3/2019 Stat Mod 1011
8/67
on the flexibility of mathematical models used for computation. Restricted infor-
mation structure and aggregation of information lead to model errors.
Figure 1: Switching processes within a freshwater ecosystem
Ecosystems are multidimensional systems with several input and output
variables. They can be seen as black box, grey box or white box systems. In
dependence of the numbers of input and output variables SIMO-, MIMO-, SISO-
and MISO-systems will be distinguished.
Ecosystems can be considered as stochastic transfer systems described
by its state variables and parameters. They are characterised by measurable
inputs, immeasurable (stochastic) disturbances as well as by measurement er-
rors. In the case of real systems, disturbances, input signals and measurement
errors will be overlaid and produce disturbed (and unsure) output signals.
Transfer functions are represented by
1. Pulse functionx(t) = 0 for t < 0 and t > T,x(t) =x0 for 0 t T,
2. Jump function:x(t) =x0(t) with (t) = 0 for t < 0 and (t) = 1 for t 0,
3. Harmonic function: x(t) = x0 + cos(t+) for - < t < + orx(t) = x0ej(t+) =x0
+e
jt withx0+=x0 e
j,
8
8/3/2019 Stat Mod 1011
9/67
4. White noise function.
Other transfer functions are
1. Exponential function:x(t) =x0 e-t/T
for 0 t < + orx(t) =x0+
ejt e
tfor
0 t < + and 0,2. Periodic function: x(t) = a0/2 + i aicos(i0t) + i bisin(i0t) or
x(t) = i ciej(i0t),
3. Dirac impulse:x(t) = 0 for t < 0 and t > T,x(t) = (t) with (t) = 0 for t 0
and (t)dt = 1,
4. Ramp function:x(t) = 0 for t < 0 andx(t) = at for t 0 or
5. Time discrete signal: x~(t) = kx(kT)(t-kT) with k = 0, 1, 2,
and T 1/(2fmax) where fmax is the maximum frequency contained in the
data serie.
Feedback structures (or couplings) within ecosystems are given by simple
feed-forward, feed-back self-tuning or complicated couplings between the eco-
system elements.
1.2 Statistical management of ecological data
To handle and investigate hydrological data with sense they should be charac-
terised by some relationships. The increase of information content of hydrologi-
cal data analysis is expressed by the number of data operations. Four scales
can be distinguished (fig. 2).
Increaseofinformationcontent
Ratio Scale
Interval Scale
Ordinal Scale
Nominal Scale
Figure 2: Data scales in hydrological research
9
8/3/2019 Stat Mod 1011
10/67
Transformations from one data scale to another serve as unificators of vari-
ables (tab. 2). The information content (knowledge, antithesis of uncertainty)
and the scale level should not be changed during sampling and/or statistical
data analysis. If there is no empiric equivalence scale, then the data are valu-
ated as comparable.
Table 2: Comparison of data scales in hydrology
Scale Arithmetic operation Statistical measure
Ratio Scale +, -, , / Geometric mean
Interval Scale +, - Arithmetic mean
Ordinal Scale none Median, Quartiles
Nominal Scale none Frequencies only
Nominal scale:
No relationship between events, sometimes they are coded by numbers (e. g.
lottery, pie charts), no arithmetic operation possible.
Ordinal scale:
Ranking of events or representations, classification of environmental indicators
(e. g. EU water quality classes, soil classes etc.), ordinal comparisons are pos-
sible: Class I > Class II, estimation of median and quartiles.
Interval scale:
Ordinal scale with equal intervals (e. g. water temperature), statements on dis-
tances and differences between data are allowable. No natural origin (Zero
point) exists.
Ratio scale:
It is an interval scale with a natural origin and allows statements on ratios (e.
g. concentrations).
One of the most important characteristic of hydrological data is its uncertainty
which can be characterised as a state or condition of incomplete or unreliable
knowledge. Sources of uncertainty are characterised by
1. Statistical analysis depends on the a-priori information of essential hy-
drological variables considered.
2. Hydrological variables and theirrates of changes have different scales
in time and space.
10
8/3/2019 Stat Mod 1011
11/67
3. Mostly, a small set of representative data will be available.
The strength of disturbances of the data observed leads to fuzzy effects of in-
terpretations.
Figure 3 shows different types of annual water quality data series which can
be distinguished by their statistical measures:
0
10
20
30
TW(C)
0
5
10
NH4(mg/l)
0.6
0.8
1
Lf(mS/cm)
7
8
9
pH-value
0
0.5
1
NO2(mg/l)
0
0.5
1
1.5
o-PO4-P(mg/l)
J FM AM J J A SO ND0
5
10
15
time (month)
O2(mg/l)
J F M AM J J A SO ND0
5
10
15
time (month)
NO
3(mg/l)
J F M AM J J A SO ND0
20
40
time (month)
DO
C(mg/l)
Figure 3: Data series of water quality samples
The quality and usability of hydrological data are usually highly depending on
the suitability of the sample and the adequacy of the sampling or monitoring
program. The goal of sampling is to get information about the frequency distri-
butions of data indicating environmental states or about the distribution parame-
ters. These estimates are called sample statistics and form a base to give
prognoses on environmental developments in general, but also on hydrological
changes. If an investigation is based on samples then sampling statistics de-
pends on the particular sampling environment, on stationary or instationary
external or internal effects as well as on random influences. Sampling fre-
quency depends on hydrologic process dynamics, on the degree of water pollu-
tion, on the type of pollution, and on the type of substance. Different results may
11
8/3/2019 Stat Mod 1011
12/67
be obtained if different samples are selected. This variation in the data from
sample to sample is called sampling variability. The difference between a sta-
tistic and the true population value is called sampling error. It increases if more
random factors influence the sampling procedure. There is a margin of uncer-
tainty expressed in terms of the sampling variance of the estimator. Sampling
variance is a measure of the precision of the estimates.
Comparison of hydrological data series:
1. Average is time dependent, dispersion is approximately time constant.
2. Average is approximately time constant, dispersion is time dependent.
3. Average and dispersion are time dependent.
Variability within data series is caused by:
1. Environmental influences or factors,
2. Intrinsic factors between water samples,
3. Different sample treatment,
4. Different data treatment.
1.3 Sampling strategies
Ecological data are obtained by field samples and/or laboratory analysis.
They are directly observed (direct observations) or indirectly observed (due to
calibration of analytical instruments and sensors). Summary data are derived
from statistics or by restricted observable indicators. Simulated data are ob-
tained by simulation models.
Sampling design is based on different procedures. The most common used
designs are
1. Systematic (periodically) sampling (yearly, monthly, weekly, daily, andhourly).
2. Sampling based on the level ofadmissible fault of the annual mean.
3. Random sampling.
4. Sample size for normal distributed data without trend and peri-
odicities:
n = ((t(95)v)/e(x*))2 with t(95) = 1.96, v = x*/s100 and e(x*) = 10% allow-
able deviation from mean.
12
8/3/2019 Stat Mod 1011
13/67
The sampling location in space and time can have a very real effect on the qual-
ity and usefulness of data in hydrology. Site selection should be made primarily
on the basis of the goal of the study as well as on the nature of the hydrologic
process or phenomenon under consideration. Optimum number of samples,
frequency of sampling and spacing can be estimated either by preliminary sam-
pling experiments, by conclusions from expert knowledge, by practical experi-
ences, or by statistical sampling design formulas and methods. Geostatistical
methods can be helpful to determine optimal space distribution of sampling
points.
The sampling procedure covers three parts.
1. Hypothesis (program purpose, sampling design, formulation of ques-tions),
2. Observation (sampling techniques, sampling protocol, analytical tech-
niques),
3. Interpretation (data analysis, interpretation of results)
Recommendations for hydrological sampling:
1. The goals and needs for hydrological data collection should be formu-
lated explicitly for each application before sampling is started.2. Priorknowledge of factors that affect hydrological variables to be sam-
pled should be given.
3. During sampling significant changes of external and internal driving
forces should not take place.
4. Existing estimates may be sufficient if they were obtained by an unbi-
ased sampling design.
5. Sampling design in hydrology should cover the water budget (surface
and groundwater), hydrochemical variables (organic and inorganic sub-
stances, metabolites), hydrophysical variables (considering internal and
external driving forces), hydrobiological variables (life cycle of plants and
organisms, conversion of organic and inorganic substances), microbi-
ological variables, and other variables as required.
Disturbances of data analysis:
Only small sets of representative regular sampled data are avail-able.
13
8/3/2019 Stat Mod 1011
14/67
The power of external and internal driving forces on water qual-
ity (hydrological) indicators influences the quality of data to be ob-
tained.
The a-priori process information on water quality (hydrological)
indicators is low.
Water quality (hydrologic) processes possess different rate con-
stants.
1.4 Re-sampling and pre-treatment of data
Series of measurements of hydrological data are time series of data recorded at
discrete points in time often with unequal sampling intervals. In practice, they
often contain missing data or they are based on different sampling intervals in
time and space. To extract hydrologic process information from single data
(events) the data series should be completed and based on a regular sampling
grid. The application of static and dynamic statistical methods for analysing
such data sets requires equidistant data. Re-sampling generally means data
interpolation or, in the case of noisy information, dataapproximation. Figure
4 gives an overview on these procedures.
R a w h y d r o lo g i c a l d a ta
In te rpo la t i on
E q u i d is t a n t d a t a
A p p ro x im a tio n D ig ita l d a ta f i lte r in g
S t a t ic D y n a m ic
F u n c t io n a l r e l a t io n s h i p
H i g h p a s sL o w p a s s
C o n s i s t e n t d a t a
Figure 4: Interpolation, approximation and digital filtering of data
14
8/3/2019 Stat Mod 1011
15/67
The goal of the application of interpolation and approximation methods onto
incomplete time series is to fill the intervals between two grid points so that se-
ries of measurements with small unique sampling intervals are kept. Table 3
contains some commonly used interpolation methods.
Table 3: Interpolation methods
Method Algorithm Characteristics
+
+ 3,x*, s.
The test statistic: r= (|(x+
-x*)|/s)n/(n-1),wherex+ is to be expected as an outlier,x* is the expectation of the sample, s is
the standard deviation of the sample, and n sample size. Choice of signifi-
cance level = 0.05, degrees of freedom f= n 2.
Decision: Acceptance ifrcalc< rtab, otherwise rejection (cf. table 10).
Table 10: Table of r test (according to Kaiser and Gottschalk 1974)
f = n - 1 P(95) P(99) P(99,9)
1 1,409 1,414 1,4142 1,645 1,715 1,7303 1,757 1,918 1,9824 1,814 2,051 2,1785 1,848 2,142 2,3296 1,870 2,208 2,4477 1,885 2,256 2,5408 1,895 2,294 2,6169 1,903 2,324 2,67810 1,910 2,348 2,73012 1,920 2,385 2,81214 1,926 2,412 2,87416 1,931 2,432 2,29118 1,935 2,447 2,20520 1,937 2, 460 2,99050 1,951 2,529 3,166100 1,956 2,553 3,227200 1,958 2,564 3,265300 1,958 2,566 3,271500 1,959 2,570 3,279700 1,959 2,572 3,283
1,960 2,576 3,291
29
8/3/2019 Stat Mod 1011
30/67
Example:
From laboratory analysis of water quality exist a small data set of BOD data with
x1 = 30,4 mg/l,x2 = 30,1 mg/l,x3 = 30,5 mg/l,x4 = 30,9 mg/l,x5 = 29,2 mg/l. The
last value is expected to be an outlier. That would mean the data set is inhomo-
geneous.
x* = 30,2; s = 0,638 ; n = 5
Test statistic: r= (|(29,2 30,2)|/0,638)5/(5-1) = (1,0/0,638)5/4= 1,5671,118
= 1,752
Comparison: rcalc and rtab for f= n 2 = 3: rcalc = 1,752, r(95) = 1,757; r(99) =
1,918; r(99,9) = 1,982.
Decision: Ifrcalc< rtab, then acceptx5: 1,752 < 1,757.
Result and interpretation: The value x5 is not an outlier and belongs to the data
set. The data set itself seems to be homogeneous. In the case that a value has
been found as an outlier the average and variance have to be re-calculated and
tested again.
30
8/3/2019 Stat Mod 1011
31/67
4. Linear regression and correlation analysis
A regression analysis is required for problems in which stochastic dependen-
cies (stochastic cause-effect relationships) have to be described by functions
with one or more several variables. Linear regression analysis is one of the
best studied statistical methods. Goal of a simple or multiple linear regression
analysis is the determination of a linear relationship between two or more
measurable (or observable) variables or characteristicsXand Yof a hydrologi-
cal system. The measurement values of size n consist of n pairs of data (x1, y1),
(x2, y2),, (xn, yn) (orn-tupels of data) which can be considered as realisations
of a two-dimensional (or n-dimensional) random vector (X, Y).
4.1 Steps of linear regression
1. Step: Scatter-plot of variables of interest (fig. 10).
0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
7,6
7,8
8,0
8,2
8,4
8,6
8,8
9,0
pH
Figure 10: Scatterplot of hydrological variables
2. Step: Estimate the relationship (positive or negative) between variables.
Directions of relationships
1. Positive relationship: Increasing values of X and increasing values of Y.
2. Negative relationship: Increasing values of X and decreasing values of Y.
3. No relationship between X and Y (e. g. parallels to the axes).
The relationships can be strong or weak.
31
8/3/2019 Stat Mod 1011
32/67
3. Step: Formulate the (linear) model equation (fig. 11):
pH = 7.868 + 0.025 Temp
7,6
7,8
8,0
8,2
8,4
8,6
8,8
9,0
0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
observed
linear
Linear regression between Temp and pH
Figure 11: Linear relationship between variables
4.2 Confidence region of regression line
4. Step: Calculate the confidence region of the regression line.
The general model of linear regression is given by y = a +bx. Using the confi-
dence intervals ofa and b a confidence region of the (mean) linear model EY =
a + bx can be defined by gu < EY < go where gu = y* - sy*t and go = y* + sy*t. The
limits of confidence are symmetric hyperbolas around the linear regression
model y* = a* + b*x. They get their minimum for x = x* and increase with for
other x values. Therefore, the confidence statements will be fuzzier. The width
of the confidence band L depend from sy* and can be calculated by L = 2 sy*t.
4.3 The power of linear regression
The strength of a relationship is expressed by the empirical (linear) correlation
coefficient: r= (xi x*)(yi y*)/(xi x*)2(yi y*)
2. By means of this formula
(explanation see chapter 4.4) the next step of linear regression procedure is
derived.
32
8/3/2019 Stat Mod 1011
33/67
5. Step: Calculate the power of relationship: r = 0.493 or B = r2
= 0.243.
The calculation algorithm is presented in chapter 4.4.
To derive statistical characteristics of a linear regression model the following
cases should be distinguished:
1. b high, r high, s low,
2. b high, r low, s high,
3. b low, r low, s low,
4. b low, r very low, s high.
4.4 Empirical Covariance and statistical measures of correlation
A correlation analysis answers the question about the strength and direction of
a linear (but not severe functional) relationship between two or more variables.
The power or intensity of such a relationship is expressed by correlation. Meas-
ures of correlation are the correlation coefficient r, the performance index B = r2
or the partial correlation coefficient rxy,z.
Combining data series of different water quantity or water quality variables re-
ferring to two or more measurable characteristics sets of pairs of data (x1, y1),
(x2, y2) ,, (xn, yn) orn-tupel of data will be obtained (fig. 12).
X
120100806040200Y
120
100
80
60
40
20
0
-20
Figure 12: Scatterplot of a bivariate relationship.
These sets of data can be seen as realisations of a two- or multi-dimensional
stochastic vector (X, Y,). Normal probability distribution of data pairs or data
tupel is a (strong) prerequisite.
33
8/3/2019 Stat Mod 1011
34/67
A visualisation of a relationship between three variables is possible but in some
cases not really helpful. The information content is high but cannot be extracted
very clearly (fig. 13).
Y
120 140
0
20
120100
40
60
100
80
80
100
120
8060
X Z
6040 4020 200 0
Figure 13: 3-D scatterplot of variables
Such relationships are characterised by statistical measures which are denoted
as correlation measures. In principle, arithmetic means and empirical variances
of data series are used:
x* = 1/n xi and y* = 1/n yi
sx2
= 1/(n-1) (xi x*)2
and sy2
= 1/(n-1) (yi y*)2.
A new data series with n pairs of data (xi, yi), i = 1, , n is formed by two vari-
ables {X} and {Y}. The empirical covariance sxy will be calculated as follows:
)yy()xx(1n
1s i
n
1iixy
==
==
n
1iii
)yxnyx(1n
1.
sxy can be positive or negative. For small values ofxi, the difference between
arithmetic mean and xi will be negative. For big values ofxi, the difference be-
tween arithmetic mean and xi will be positive. This is also valid for data yi. For
this reason, a negative covariance characterises a relationship where big values
xi are connected with small values yi mostly and vice versa.
By normalisation ofsxy with empirical standard deviations sx und sy one gets the
empirical coefficient of correlation rxy:
34
8/3/2019 Stat Mod 1011
35/67
ss
sr
yx
xyxy = .
Because ofsxy = syx also rxy = ryx is valid. rxy is a measure of strength and direc-tion of a linear relationship between hydrological variablesXand Y.
Statistical measures of correlation between two or more hydrological variables
are mainly based on the assumption that the data sets are subsets of Gaussian
distributed data sets. The rank correlation procedure functions without assum-
ing a normal probability distribution of the data set to be analysed.
Empirical bivariate correlation coefficient
r= (xi x*)(yi y*)/(xi x*)2(yi y*)2
Performance index (coefficient of determination)
B = r2
Partial correlation coefficients
rxy,z = (rxy - rxzryz)/(1 rxz2)(1 ryz
2)
rxz,y = (rxz - rxyryz)/(1 rxy2)(1 ryz
2)
ryz,x = (ryz - rxyrxz)/(1 rxy2
)(1 rxz2
)
Multiple correlation coefficients
x, y, z x = f(y, z)
Rx, yz = rxy2
+ rxz2
2rxyrxzryz)/(1 - ryz2)
Multiple performance index
BBx.yz = (rxy2 + rxz
2 2rxyrxzryz)/(1 - ryz2)
SPEARMANs rank correlation
(Valid for small sample size, normal probability distribution not necessary)
)1(
)(6
12
1
2
= =
nn
yx
r
n
iii
S)1(
6
12
1
2
= =
nn
iDn
i
Table 11 contains data and an explanation of the the ranking procedure for a
SPEARMAN-test.
35
8/3/2019 Stat Mod 1011
36/67
Table 11: Data and procedure of rank correlation
xi R(xi) yi R(yi) Di Di2
0,5 5,5 4 3 2,5 6,25
0,8 7,5 6 5 2,5 6,25
1,1 10 2 1 9 81
0,5 5,5 10 8 -2,5 6,250,4 4 8 6 -2 4
0,3 2 12 10 -8 64
0,9 9 5 4 5 25
0,8 7,5 3 2 5,5 30,25
0,3 2 9 7 -5 25
0,3 2 11 9 -7 49
297
Result: rS = -0,8
Comparison ofrS and rStab (positive values only):
Forn 30 the table of probability values ofrS has to be used. Forn > 30 the
table of standardised normal probability distribution should be used:
rSTab(95) = 0.5515; rSTab(99) = 0.7333; rSTab(99,9) = 0,8667.
Decision: If rS rStab, then reject rS. the example shows that for each signifi-
cance level rSrStab is valid.
Result and interpretation: Between both data sets exists a relatively strong
negative correlation.
36
8/3/2019 Stat Mod 1011
37/67
5. Nonlinear regression analysis
In the case that a linear regression model is not valid or insufficient other re-
gression models should be tested. From this statement the following step of
(linear) regression procedure is derived:
6. Step: Find out other model types if the linear model is insufficient (fig. 13).
Figure 14 contains some standard nonlinear regression models computed by
means of SPSS. The results are presented in table 12.
7,6
7,8
8,0
8,2
8,4
8,6
8,8
9,0
0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
observed
linearlogarithmic
invers
squared
cubic
composed
power
S-shaped
growth
exponential
logistic
pH
Figure 14: Linear and nonlinear regression curves
Table 12: Results of nonlinear regression modelsModel B b0 b1 b2 b3LIN 0.243 7.8683 0.0252
LOG 0.198 7.6798 0.2194
INV 0.158 8.3568 -1.2875
QUA 0.308 8.1097 -0.0310 0.0022
CUB 0.432 7.4486 0.2131 -0.0191 0.0005
COM 0.238 7.8731 1.0030
POW 0.194 7.6972 0.0262
S 0.156 2.1219 -0.1544
GRO 0.238 2.0635 0.0030
EXP 0.238 7.8731 0.0030LGS 0.238 0.1270 0.9970
37
8/3/2019 Stat Mod 1011
38/67
When comparing the performance indexes of these standard models the best
statistical model is the cubic one. But this model represents the data cloud by
43.2% only. The remaining 56.8% are not described by the model. As an overall
outcome of this analysis all of these models should be rejected and other types
of nonlinear models should be investigated.
5.1 Polynomial regression
The basic model is given by y= a0+ aixi, i= 1,, where n is called the order
of the polynomial. Figure 15 shows polynomials of different order. Each of the
polynomials represents the given data set by a relatively high degree of per-
formance. For 6th
and 7th
order polynomials the performance will be B = 1.
Figure 15: Examples of polynomial regression
38
8/3/2019 Stat Mod 1011
39/67
By comparing the graphs different interpretations are possible. For the polyno-
mial of 7th
order the graph indicates negative values which do not exist. The
advantage of polynomial regression is to get an algorithm for calculation of the
existing nonlinear relationship between hydrological variables. Disadvantages
are the high number of coefficients and sometimes physically not realistic re-
sults. The best models are not the ones where the graphs are joining all data
points.
Other model types used in water quality management are multiple linear or
nonlinear regression models (e. g. DO(t) = a0 + a1TW + a2Q + a3BSB or DO(t) =
a0+ a1TW + a2Q + a3BSB + a4TW + a5Q + a6BSB + a7TW) or models derived
from control theory (e. g. stochastic transfer method). A continuous dynamicprocess is described by a time discrete model applying the z-transformation on
a difference equation, G(z) = B(z-1
)/A(z-1
) +(z)
5.2 Periodic regression
The basic relationship is given by y = a + b1sin x + b2cos x. The equation
represents the simplest form of periodic regression or so-called Fourier polyno-
mial. In an extended form this method is called Fourier analysis (see chapter 7).In figure 16 water temperature of a reservoir at three depth levels (0m, 10m,
25m) and the approximating graphs are presented.
Figure 16: Periodic regression of water temperature in a reservoir
39
8/3/2019 Stat Mod 1011
40/67
It can clearly be seen that water temperature (and all other hydrological cycling
variables) can be approximated very well by periodic functions. The advantage
of this family of regression type functions is the visualisation of a cycling proc-
ess, the disadvantage is that the functions are valid for fixed cycling periods
only.
5.3 Trend functions
Medium-term and long-term temporal and spatial developments (trends) of hy-
drological variables can be estimated by simple, explicitly given functions. Pa-
rameter estimation is done by the method of least squares (MKQ). Figure 17
shows the development of BOD in along a river stretch following a polynomial of
2nd order.
y = 0,0908x2
- 0,5374x + 2,6386
R2
= 0,9501
0,0
0,5
1,0
1,5
2,0
2,5
TeK0030 SPK0010 SPK0020 Hv0190 Hv0200
sampling point
BOD(mg/l)
Figure 17: Polynomial trend function for BOD in a river
Other examples of linear and nonlinear trend functions are presented in figures
18 to 20.
y = 0,0166x + 0,0854
R2
= 0,8938
0,00
0,05
0,10
0,15
0,20
25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200
sampling point
o-PO4-P(mg/l)
Figure 18: Linear trend of phosphate phosphorus in a channel
40
8/3/2019 Stat Mod 1011
41/67
The linear function (also denoted as a polynomial of 1st
order) is able to follow
the increasing trend of phosphate phosphorus load due to waste water input in
a low flow channel with acceptable accuracy. The deviations of regression line
from measurements are small. For the same river stretch, the approximating 2nd
order polynomial of water flow (fig. 19) shows stronger deviations after conjunc-
tion of the main river with a channel. The reason for this are changing hydraulic
conditions and increasing values of water flow. The stationary or uniform flow
conditions of the first part of the water body are disturbed now. Considering the
performance index the graph should be acceptable. But the regression model is
not able to compensate the positive jump in water flow because it works with
fixed parameters (coefficients). Therefore, another regression model should
used.
y = 3,3914x2
- 18,053x + 51,117
R2
= 0,809
0
10
20
30
40
50
60
70
25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200
sampling point
flow(m3/s)
Figure 19: Quadratic trend function of water flow
On the other hand, for the same river stretch the trend of chlorophyll-a is ex-
pressed by a 2nd
order polynomial again (fig. 20).
y = 1,4627x2
- 6,4221 x + 67,115
R2
= 0,6459
0
20
40
60
80
25014 TeK0030 SPK0010 SPK0020 H v0190 H v0200
sampling point
Chlorophyll-a(g/l)
Figure 20: Quadratic trend function of chlorophyll-a
41
8/3/2019 Stat Mod 1011
42/67
The performance index is lower than before in fig. 19 for water flow because of
some disturbances caused by hydrophysical phenomenon. But the trend fol-
lows the computed polynomial. Taking into account the variations in chlorophyll
measurements the trend polynomial is quite acceptable.
The following table gives a survey on trend functions used to estimate the de-
velopments of water quality in a river (table 13). All polynomials are of 2nd order.
The signs in the last column indicate significance on a 95% probability level.
Table 13: Trend functions of water quality in the River Havel
Water quality indicator Trend R P (95%)
Water flowTemperature
ConductivityChloride
polynomialpolynomial
polynomialpolynomial
0,81260,6177
0,19710,0382
++
--
DOBODCSV
polynomialpolynomialpolynomial
0,38580,42640,7611
+++
NH4-NNO2-NNO3-NO-PO4-PTPSiO2
exponentialexponentialexponentialexponentialpolynomialpolynomial
0,56690,48790,47460,86830,08220,8888
++++-+
Suspended matterChlorophyll-aInorg. part of biomassLoss of org. matter
polynomialpolynomialpolynomialpolynomial
0,02270,60320,67420,1418
-++-
As can be seen from table 13, polynomial and exponential trend functions are
sufficient to describe the changing water quality mathematically.
Interpretations of trend functions can be given as follows:
Linear trend:
y(t) = a0 (t) + a1 (t) x(t).
(Interpretation of parameters: (a0) mean initial value, (a1) mean rate of
change)
Squared trend:
y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t).
(Interpretation of parameters: (a0) - mean initial value, (a1) - mean rate of
change, (a2) mean process acceleration)
42
8/3/2019 Stat Mod 1011
43/67
Polynomial trend:
y(t) = a0(t) + a1 (t) x(t) + a2(t) x2(t) + ..... + an (t) x
n(t).
(Interpretation of parameters is mostly impossible).
Exponential trend:x(t) = x(0) e
- kt+ E.
(Interpretation according to 1st
order kinetics:x(0) initial concentration value, k
rate of change, E random quota).
5.4 Comparison of regression functions
To describe one and the same data set different nonlinear models can be ap-
plied.
Figure 21: Comparison of different regression functions for the same data set
43
8/3/2019 Stat Mod 1011
44/67
By comparing the initial and the final reach of regression functions the best
functional relationship will be selected (fig. 21). Also the linear model seems to
be suitable. As can be seen in part H, the middle range of all computed models
shows very small variations while the initial and the final part of the graphs show
a spreading of curves. An evaluation of the quality of fit can be given by:
Linear coefficient of determination (performance index):
R2
= B = ( -y y ) / (y - y ) ,
Nonlinear performance index: Bnl = 1 - ( (y - )y2
/ (n-1) sy ),
Residual sum of squares: SR = (yi - )y2, or
Residual dispersion: s
2
= SR/(n m 1) (n number of data, m number ofparameters).
44
8/3/2019 Stat Mod 1011
45/67
6. Time series analysis
The distinction between discrete and continuous variables is not a clear dichot-
omy because continuous processes (seen from a physical point of view of un-
derstanding nature) will be observed at discrete time events. Therefore, mostly
random variables are observed.
6.1 Dynamic behaviour of time series
Freshwater ecosystems may be seen as switching networks where inputs are
transformed into outputs by an operatorwhich describes the transient behav-
iour of ecological processes (fig. 22). The overall operator transforms input
signals into output signals: y(t) = x(t) where the signals will be smoothed
(damped), and there exists some redundancy between input and output signals.
x(t) y(t)Figure 22: Schematic diagram of a transfer process
Therefore, water related processes are represented by time varying signals. In
figure 23, NO3-N raw data are described by a polynomial trend as follows: NO3-
N(t) = 1,8987 0,0754 t + 0,0028 t2
- 0,00003 t3. An exact mathematical (or
functional) description of random fluctuations is not possible. The function de-
scribes more or less the mean behaviour of the process.
706050403020100
3,5
3,0
2,5
2,0
1,5
1,0
,5
0,0
Figure 23: Approximation of a time varying process by a function
45
8/3/2019 Stat Mod 1011
46/67
6.2 Description of time series in time and in frequency domain
Hydrological systems can be seen as stochastic transfer systems described by
system state variables and parameters. They are characterised by measurable
inputs, not measurable (stochastic) disturbances as well as by measuring er-
rors. Disturbances, input signals and measurement errors will be overlaid and
will produce output signals.
Mathematical descriptions of hydrological time series can be represented by
time domain functions (cf. transfer functions, pp. 8 and 9).
In the frequency domain hydrological time series are represented by Fourier-
transforms of correlation functions, by coherency functions as well as by wave-
lets.
6.3 Stationary processes
Because of time lags between input and output processes stationary processes
will then be reached when all transient processes are decayed. Therefore,
some statistical characteristics of signals should only be grasped. If statistical
characteristics do not change in time, then these processes are called station-
ary processes. Process averages and dispersions will not change so much in
time. Therefore, stationary random processes can be investigated on different
time intervals between - < t < + .
Statistical characteristics of stationary random processes can be expressed
by
1. Probability density functionp(x) of signals X(t),
2. Auto-correlation functionxx(),
3. Spectral power density function Sxx()
A time varying process is expressed by a stochastic signal X(t). For each time
stroke tn one measured valueXn(t) will be obtained. The further development of
the process can be predicted only for a short time interval. When the process is
described by an analytical (deterministic) function f(t) then the time behaviour
can be predicted completely. Only some statistical statements on the future de-
velopment of the processX(t) can be given:
Prob(X(tn+1) x) P(x),
46
8/3/2019 Stat Mod 1011
47/67
or
Prob(a < X(t) b) = p(x)dx.
The Gaussian distribution with a bell-shaped density is one of the most impor-
tant probability density distributions wherep(x) = 1/2exp-(x-x*)2
/22
. Impor-
tant expectations are linear average: E(x) = xp(x) dxand squared average:
E(x2) = x2p(x) dx.
6.4 Correlation and spectral functions
The probability density function gives an information about the probability of the
processX(t) that the amplitude at time t lies betweenxand (x+ x):
Prob(x
8/3/2019 Stat Mod 1011
48/67
spectrum of a stationary signal which is a distribution of the variance of the sig-
nal as a function of frequency. The frequency components that account for the
largest share of the variance are revealed. Each peak represents the part of the
variance of the signal that is due to a cycle of a different period or length. Sig-
nificant periodicity in the signal will induce a sharp peak in a periodogram. The
auto-covariance function is the time domain counterpart of the periodogram.
The periodogram of water temperature (figure 24) shows a single distinct peak
which indicates the major cyclic behaviour. The low frequency component is
responsible for the general tendency of the indicator.
Figure 24: Periodogram of water temperature of the Lower Havel River
Figure 25: Periodogram of pH
The periodogram of pH in figure 25 shows that the highest variance is displayed
by a low frequency. Small fluctuations are not dominant and can be neglected.
48
8/3/2019 Stat Mod 1011
49/67
Only long term changes are responsible for the overall observed behaviour of
the indicator.
The periodogram of pH is similar to that of dissolved oxygen presented in figure
26. High variances at low frequencies are observed. This means that the gen-eral tendency of this indicator is determined by long term changes.
Figure 26: Periodogram of dissolved oxygen.
For the indicator of phytoplankton biomass the periodogram is shown in figure
27. The periodogram represents low frequency components which exhibit the
highest variances and some small fluctuation at higher frequencies. They de-
termine the long term behaviour of the indicator. Two distinct peaks reveal two
cycles of different periods and amplitudes.
Figure 27: Periodogram of chlorophyll-a
The cross-power spectrum Sxy() of two stochastic ecological processes x(t)
and y(t) is the Fourier transform of the CCF:
49
8/3/2019 Stat Mod 1011
50/67
Sxy() = 1/2xy()e-j d
It is a complex function.
The coherency function Co() is a measure of synchronicity of (two) signals. It
is calculated on the base of periodograms of both signals by
Coxy() = |Sxy()|2/Sxx()Syy(),
where |Sxy()| = Re(Sxx())2
+ Re(Syy())2
and for the phase shift between
both signals () = arc tan (Im(Sxy())/Re(Sxy())) is valid.
The limitation of CCF is considered by what is called a window function h():
~Sxy() = 1/2xy()h()e
-j d = Sxy()H( - ),
where H() is the Fourier transform of h() which distorts Sxy() to~Sxy().
50
8/3/2019 Stat Mod 1011
51/67
7. Analysis of cycling processes
Cycling processes in hydrology are natural. In fig. 28 some examples of cycling
processes with different periods and frequencies are presented.
0
100
200
Q(m3/s)
200
600
1000
EC(S/cm)
0
15
30
Tw(C)
0
10
20
O2(mg/l)
1985 1987 1989 1991 1993 19956
8
10
time (a)
pH
Figure 28: Cycling water quality indicators
Such processes are caused mostly by natural external driving forces but also by
natural internal driving forces. They lay out different time and frequency behav-
iour of the water quality (hydrological) processes. Water quality processes are
characterised by different time parameters such as time delay, threshold val-
ues, altering, physiological parameters and others. State transitions take
place on intervals (ai (t), bi (t)) with probability densities wi(t) of time delays of
system variables and probabilities pi(t) for each realisation of a state transition:
Forai(t) wi(t) bi(t):pi(t) = wi(t) dt.
On the other hand, hydrological variables vary often with high frequencies be-
cause of random changes of internal system states and/or fluctuations of vari-
ables. Switching processes of input variables take place at certain different time
events. Time delays in the courses of action of system components lead to re-
tardations in the changes of system states and to redundancies in the data
transfer.
51
8/3/2019 Stat Mod 1011
52/67
A state transition can be characterised by a quadrupel
i(t) = {ai(t), bi(t), wi (t),pi(t)).
A classification of hydrological systems can be given by its characteristics of
signals and by the type of change of dynamic properties (table 14).
Table 14: Classification of hydrological systems
Classification Remark
Characteristics of signals
Modulation
Quantification
Change of amplitudes, frequencies andphases of signals
Discretisation of time domain of ampli-tudes and duration interval of signals
Adaptability of system
adaptive
non-adaptive
Change of systems states, change of in-puts and disturbances, change of parame-ters, change of system structure
fixed parameters, no change of ecosystemstructure
7.1 Introduction
Mathematical equations describe either the time dependency (function of time t)
which is called description in the time domain or the frequency dependency(function of frequency or cycles per time unit) which is called description in
the frequency domain. Mostly, cycling (or periodic) processes in hydrological
context are caused by natural external driving forces. On the other hand, aperi-
odic hydrological processes are mainly influenced by artificial (man-made) ex-
ternal driving forces.
Another distinction can be made by the ability to reproduce a time-varying proc-
ess. In the case of correct reproduction and forecast of a process it is called a
deterministic one. Otherwise it is called a non-deterministic or stochastic (ran-
dom) process. Each deterministic process x(t) is characterised by its time de-
velopment (or behaviour) x = x(t) with - < t < +.
A harmonic process is described by a trigonometric function
x(t) = x0cos(i + i) with - < t < +.
i = 2/Ti is the basic cycling frequency (circle frequency), Ti is the period of
cycle, and i is the shift of phase.
52
8/3/2019 Stat Mod 1011
53/67
7.2 Fourier analysis
A periodic process with period T0 is described by a Fourier series of the form
x(t) = a0/2 + aicos(i0t) + bisin(i0t),
with - i + , 0 = 2/T0 frequency of the basic cycle, T0 period of cycle.
The amplitudes ai and bi are calculated as follows: ai = 1/T0x(t)cos(i0t)dt,
bi = 1/T0x(t)sin(i0t)dtand a0 = 1/2T0x(t)dt.
The Fourier polynomial is an approximation which represents the minimum
mean squared deviation of a cycling process. Then, the amplitudes of the ap-
proximating function are given by Ai = ai2
+ bi2.
Phase shifts are given in the interval [0, 2] by i = arc tan bi/ai.
Figure 29 shows a Fourier approximation of global radiation process. It can be
seen that the approximation is shifted from the real frequencies due to a fixed
frequency. This fact causes some error.
1996 1997 1998 1999 2000 2001 2002 20030
50
100
150
200
250
300
350
400
450
500
time (a)
globalradiation(W/m2)
raw datacomponent with max. amplitude (f=1/352d)
Figure 29: Fourier approximation of global radiation
Fourier approximations can be used to explain the variance of a cycling process
by its basic frequency. Table 14 gives an example on the usefulness of this
method for physical, chemical and biological environmental or ecological vari-
ables respectively. The 3rd
column of table 14 contains the values of total vari-ance of the time series under consideration. The last column contains the val-
53
8/3/2019 Stat Mod 1011
54/67
ues of variance which are explained by the dominant cycle contained in the
timw series. The best results will be obtained for physical variables, followed by
chemical variables. Insufficient results are obtained for biological variables.
Table 14: Fourier analysis of water quality indicators
Indicator Reservoir Totalvariance
(%)
Aver-age
Std.dev.
Variance of theyearly cycle(% of totalvariance)
TEMP Saidenbach 90.0 12.0 44.43 84.62
Neunzehnhain 90.0 11.9 33.41 74.35
Kliava 95.7 11.1 56.80 92.76Slapy 95.9 12.0 52.25 90.63
DO Saidenbach 71.6 10.5 2.90 22.74
Neunzehnhain 76.2 10.1 1.98 22.19Kliava 75.9 10.2 4.37 37.35Slapy 76.2 7.7 8.88 34.81
CHA Saidenbach 36.8 5.4 32.87 1.94
Neunzehnhain 35.4 1.7 1.59 1.96
Example:Approximation of water temperature of reservoirs (yearly domi-
nant harmonic cycle):
Reservoir Saidenbach
TEMP(t) = 12.0 + 1.458cos((6/180)t) 4.462sin((6/180)t
Reservoir Neunzehnhain
TEMP(t) = 11.9 + 0.693cos((6/180)t) + 4.415sin((6/180)t
Reservoir Kliava
TEMP(t) = 11.1 - 6.650cos((9/180)t) - 7.820sin((9/180)t)
Reservoir Slapy
TEMP(t) = 12.0 - 7.073cos((10/180)t) - 6.684sin((10/180)t)
7.3 Digital data filter
Digital filter function transfer sequences of input signals to sequences of output
signals by compressing or decompressing noisy information contained in the
measured signals of hydrological processes. The results of applying digital fil-
ters are consistent data series which can be used for modelling, simulation and
optimisation in hydrological sciences.
Basic filter functions are derived from an ideal low pass filter:
54
8/3/2019 Stat Mod 1011
55/67
Ideal low pass
)(11|)(| 2
2
FH
+=
Butterworth filter (power low pass)
+=1
1|)(| 22
nH
(Amplitude response should be as flat as possible in the pass band).
Tschebyshev filter, type 1
)(1
1
|)(| 222
cH n+=
( - ripple factor (or eccentricity), = 0.1526. In the pass band a ripple is ac-
cepted. The transition from pass band to stop band is steeper than for the But-
terworth filter).
Tschebyshev filter, type 2 (inverse Chebyshev-Filter)
)(*11|)(| 22
2
cH
n+= with * = 2/(1-).
(In the stop band a ripple is accepted.)
Elliptic filter (Cauer filter)
)(*1
1|)(|22
2
FH
n+=
(Ripples arise in the pass band and in the stop range. One gets the steepest
transition between both frequency bands).
To get an acceptable transfer behaviour filters of order 1 to 3 should be used
only. Figures 30 to 33 represent the transfer behaviours of digital filters for dif-
ferent water quality time series. Higher order filters show rippling transfer be-
haviours and cause nonlinear effects in the output sequences of signals. This
leads to misinterpretations and unexplainable events within the data series.
55
8/3/2019 Stat Mod 1011
56/67
0 50 100 150 200 250 300 3500
0.5
1
1.5
2
2.5
3
3.5
reciprocal of critical frequency (d)
standarderrorO2(mg/l)
order 1order 2order 3order 4order 5order 6
Figure 30: Selection of filter order of a Butterworth filter for DO
The higher order filters lead to changing (welling) transfer behaviour during the
filtering process as can be seen in figs. 30 and 31.
0 50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
reciprocal of critical frequency (d)
standarderrorpH-value
order 1order 2
order 3order 4order 5order 6
Figure 31: Selection of filter order of a Butterworth filter for pH
They show this behaviour for a Butterworth filter. Tchebychev 1 filters for chlo-
rophyll-a and for water temperature (figs. 32 and 33) demonstrate the distur-
bances within the transfer process.
56
8/3/2019 Stat Mod 1011
57/67
0 50 100 150 200 250 300 350
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
reciprocal of critical frequency (d)
standarderrortotalchlorophy
ll-a(mg/l)
order 1order 2order 3order 4order 5order 6
Figure 32: Tschebychev 1 filter for chlorophyll-a
0 50 100 150 200 250 300 3500
0.5
1
1.5
2
2.5
reciprocal of critical frequency (d)
standarderrorwatertemperature(C)
order 1order 2order 3order 4order 5order 6
Figure 33: Tschebychev 1 filter for water temperature
The first step of digital data filtering procedures is the selection of a complete
hydrological time series. If the data series contains some gaps interpolation
methods should be used to get a time series with equidistant data. This is a
strong prerequisite for all further steps. Fig. 34 shows such a data series for the
variable conductivity of the Oder River at Frankfurt.
57
8/3/2019 Stat Mod 1011
58/67
1993 1994 1995 1996 1997 1998 1999 2000
400
600
800
1000
1200
1400
1600
time (a)
conductivity(S/cm
)
Figure 34: Original data series of conductivity
In the next step the critical frequency is calculated from spectral density function
(fig. 35). As confidence band the 95% - confidence region should be selected.
For the example a critical frequency fg = 0.053 was used.
0 0.1 0.2 0.3 0.4 0.510
0
101
102
103
104
105
106
frequency (1/d)
powerdensity(conductivity)
fg
power density
upper bound (confidence interval 95%)lower bound (confidence interval 95%)
Figure 35: Selection of critical frequency of the filter
The last step consists of computation of the digital filter and reconstruction of
the original data series. In case of the Oder River an elliptic filter was used to
reconstruct the original time series and to get a consistent time series for mod-
elling (fig. 36).
58
8/3/2019 Stat Mod 1011
59/67
1980 1990 2000200
300
400
500
600
700
800
900
1000
1100
1200
time (a)
con
duc
tiv
ity
(S/cm
)
Raw DataElliptic Filter (1. order, f
8/3/2019 Stat Mod 1011
60/67
7.4 Wavelets
Wavelet analysis has been proven quite useful for time scale based signal
analysis. It is a solution for the time scale analysis problem because it offers an
effective approach to extract both the information on the time localization and
the frequency content of the time series. It has the ability to decompose time
series into several sub-series which may be associated with particular time
scales. As a result, the interpretation of features in hydrological time series may
be facilitated by first applying an appropriate wavelet transform and subse-
quently interpreting each individual sub-series.
The following questions can be effectively answered with the help of wavelet
analysis:
1. What is the dominant scale of variation influencing the observed gen-
eral tendency of the indicator?
2. Are the variations from one day to the next more prominent than the
variations from one week to the next?
3. Are the statistical variations in the hydrological indicatorhomogenous
across time?
4. What are the time dependent variations such as the presence oftrends?
5. How are two indicators related on a scale by scale basis? How do they
covary at different scales?
The wavelet analysis imitates the windowed Fourier analysis by using basis
functions (wavelets) that are better suited to capture local behaviour of non-
stationary signals. The wavelet transformation is a function of two variables
W(u,s) obtained by projecting a signal X(t) on to a particular wavelet and is
given by
,)()(),( , dtttXsuW su
=
=s
ut
stsu
1)(,
which gives a translated and dilated version of the original wavelet function. The
coefficients that are obtained are a function of the location and scale parame-
ters. Applying shifted and scaled versions of a wavelet function decomposes the
signal into simpler components. It is the effect of the shifting and scaling proc-
60
8/3/2019 Stat Mod 1011
61/67
ess what makes this representation possible and is referred to as multiresolu-
tion analysis.
The wavelet transform is usually applied in the form of a filter bank, comprising
two filters. The scaling filter known as the father wavelet is a low pass filterwhile the wavelet filter known as the mother wavelet is a high pass filter. Given
a signal X(t) of length n = 2j, the filtering procedure can be performed a maxi-
mum ofjtime, giving rise tojdifferent wavelet scales. The wavelet coefficients
or detail coefficients are produced by the wavelet filter while the scaling filter
gives rise to the smooth version of the signal used at the next scale. Given the
respective father and mother wavelets,
=
J
JJ
kJ kt222 2,
= 1)( dtt
and
=
j
jj
kj
kt
2
22 2,
= 0)( dtt
where J,k is the father wavelet and j,k is the mother wavelet with the scale
parameter s being restricted to the dyadic scale 2j. If a signal is projected onto
a given basis function
= kJkJ tfS ,, )( ,
then
= kjkj tfd ,, )(
will be obtained with SJ,k being the coefficients for the father wavelet at a maxi-
mum scale of 2j (the smooth coefficients) and dj,k being the detail coefficients
from the mother wavelet at all scales from 1 toj, to the maximal scale. Based on
these coefficients, the function f(t) can be represented by
)(....)()()( ,1,1,,,, tdtdtStf kk
kkJ
k
kjkJ
k
kJ+++=
and can be equally represented by
f(t)=Sj+ Dj + Dj-1+ + Dj + D1
61
8/3/2019 Stat Mod 1011
62/67
where
)(,, tSS kJk
kJJ =
and
)(,, tdD kjk
kjj = .
Multiresolution decomposition (MRD) reveals the variations at different scales
denoted by d. Figure 37 shows the details of the multiresolution analysis of
dissolved oxygen sampled at daily interval.
Figure 37: Multiresolution analysis details of dissolved oxygen signal
sampled at daily intervals
The details reveal the high frequency variations present in the dissolved oxygen
time series or provide an additive decomposition of the high frequency variation
on a scale by scale basis. The notations d1, d2, d3, d4, d5, d6 and d7 reveal the
variations occurring at one day, 2 days, 4 days, 8 days, 16 days 32 and 64 days
respectively. This progressive decomposition reveals the differences in fluctua-
tions from one scale to another. It effectively shows that the lower scales are
less important compared to the higher scales of variation.
62
8/3/2019 Stat Mod 1011
63/67
Multiresolution analysis (MRA) filters information in the signal at different scales
represented by a. In fig. 38 an example of a MRA and MRD is given for long-
term observations of dissolved oxygen in eutrophic freshwater ecosystem. Tak-
ing of from the original signal s all high frequent events the basic nature of the
cycling process comes out. This can be seen at level a7.
Figure 38: Wavelet analysis of DO
Figure 38 reveals that the variations occurring at a time scale of 1 day are
equally of relatively low intensity and are not able to influence the general ten-
dency observed in the dissolved oxygen signal. However, the fluctuations oc-curring at higher time scales such as scale 8 are strong enough to influence the
long term behaviour of the signal. Hence, the long term tendency observed in
the dissolved oxygen time series is significantly influenced only by the fluctua-
tions occurring at the higher scales and not the lower scales. At the lower
scales, the fluctuations are higher during the warmer months than during the
colder months. At the higher scales such as scale 32, the fluctuations are high
throughout the year. It is quite interesting to examine the variance at different
scales to effectively quantify these variations.
63
8/3/2019 Stat Mod 1011
64/67
An overview on the respective frequencies is given in table 15.
Table 15: Frequencies and scales of MRA and MRD
MRA scale MRD scale Frequency
a1 d1 1a2 d2 2
a3 d3 4
a4 d4 8
a5 d5 16
a6 d6 32
a7 d7 64
a8 d8 128a9 d9 256
a10 d10 512
a11 d11 1024
a12 d12 2048
The variance of a signal can equally be decomposed by using this technique.
For a signal Xt the time varying variance of the scale Sj of a wavelet coefficient
Wj,t can be calculated by
)var(2
1)( ,
2
, tj
j
jtx wS
S = .
Similar to the wavelet variance of a univariate signal, the wavelet covariance
decomposes the covariance between two signals on a scale by scale basis by
),cov()( ,,11
txtj
j
x xxS =
=
.
The wavelet variance shown in figure 39 reveals the intensity of variation from
one scale to the next of the dissolved oxygen time series. This graphical repre-
sentation of the wavelet variance enables the researcher to answer questions
concerning the dominant scale of variation in the time series, the homogeneity
of variations from one scale to the next, the importance of the variations at one
scale compared to the variations occurring at another scale.
64
8/3/2019 Stat Mod 1011
65/67
*
*
*
* **
0.05
0.10
0.20
0.50
1.00
2.00
Wavelet Scale
L
L
L LLU
U
U
U UU
1 2 4 8 16
32
Figures 39: Wavelet variance of dissolved oxygen with db4
65
8/3/2019 Stat Mod 1011
66/67
Literature
Adorf, H.-M., 1995: Interpolation of Irregularly Sampled Data Series - A Survey.
In: Shaw, R. A., H. E. Payne and J. J. E. Hayes (eds.): Astronomical Data
Analysis Software and Systems IV. ASP Conference Series, Vol. 77,
Academic Press, New York, pp. 1-4.
Box, G. E. P., G. M. Jenkins and G. C. Reinsel, 1994: Time Series Analysis. 3rd
ed., Prentice Hall, Englewood Cliffs.
Brmaud, P., 2002: Mathematical Principles of Signal Processing. Springer,
New York, 2002.
Brockwell, P. J. and R. A. Davis, 1998: Introduction to Time Series and Fore-
casting. Springer, Berlin.
Franses, P. H., 1999: Periodicity and Structural Breaks in Environmetric Time
Series. In: Mahendrarajah, S., A. J. Jakeman and M. McAleer (eds.):
Modelling Change in Integrated Economic and Environmental Systems.
Wiley, New York.
Gentili, S., Magnaterra, L., and G. Passerini, 2004: An Introduction to the statis-
tical filling of environmental data time series. In: Latini, G. and G.
Passerini (eds.): Handling Missing Data. WIT Press, Southampton, pp. 1-
27.
Han, J. and M. Kamber, 2006: Data Mining Concepts and Techniques. Mor-
gan Kaufmann, New York.
Hipel, K. W. and A. I. McLeod, 1994: Time Series Modelling of Water Re-
sources and Environmental Systems. Elsevier, Amsterdam.
Jrgensen, S. E. und W. J. Mitsch (eds.), 1983: Application of Ecological Mod-
elling in
Keith, L. H. (ed.), 1988: Principles of Environmental Sampling. ACS Profes-
sional Reference Book, ASC, Salem.
Latini, G. and G. Passerini (eds.), 2004: Handling Missing Data. WIT Press,
Southampton.
Little, R. J. A. and D. B. Rubin, 1983: Missing Data in Large Data Sets. In:
Wright, T. (ed.): Statistical Methods and the Improvement of Data Quality.
Academic Press, London, pp. 73-82.
Little, R. J. A. and D. B. Rubin, 1987: Statistical Analysis with Missing Data.Wiley, Chichester.
66
8/3/2019 Stat Mod 1011
67/67
Mallat, S. (1998): A wavelet tour of signal processing. Academic Press, New
York.
Mller, W. G., 2001: Collecting Spatial Data. Springer, Berlin.
Pollock, D. S. G., 1999: A Handbook of Time-Series Analysis, Signal Process-
ing and Dy
Powell, T. M. and J. H. Steele (eds.), 1995: Ecological Time Series. Chapman &
Hall, New York.
Rebecca, M., 1998: Spectral analysis of time-series data. Guilford Press, New
York.
Reckhow, K. H. und S. C. Chapra, 1983: Engineering Approaches for Lake
Management. Vol. 1: Data Analysis and Empirical Modelling. Butterworth,
Woburn.
Shumway, R. H. and D. S. Stoffer, 2000: Time Series Analysis and Its Applica-
tions. Springer, New York.
Stein, M. L., 1999: Interpolation of Spatial Data. Springer, Berlin.
Strakraba, M. und A. Gnauck, 1985: Freshwater Ecosystems Modelling and
Simulation. Elsevier, Amsterdam.