Stat Mod 1011

8/3/2019 Stat Mod 1011

1/67

Brandenburg Universityof Technology at Cottbus

Dept. of Ecosystems andEnvironmental Informatics

Statistical Modelling

Univ.-Prof. Dr. habil. Albrecht Gnauck

International Master Course of StudyHydroinformatics EuroAquae

Winter term 2010/2011

8/3/2019 Stat Mod 1011

2/67

Contents

1. Events and data1.1 Analysis and control of aquatic ecosystems1.2 Statistical management of ecological data

1.3 Sampling strategies1.4 Re-sampling and pre-treatment of data

2. Probability functions and statistical measures2.1 Probability functions of ecological data2.2 Normal and skewed probability distribution functions2.3 Comparison of expectations2.4 Statistical measures

3. Statistical test procedures3.1 Introduction

3.2 General procedure of hypothesis testing3.3 Rules of decision3.4 Selected test procedures

4. Linear regression and correlation analysis4.1 Steps of linear regression4.2 Confidence region of regression line4.3 The power of linear regression4.4 Empirical covariance and statistical measures of correlation

5. Nonlinear regression analysis5.1 Polynomial regression5.2 Periodic regression5.3 Trend functions5.4 Comparison of regression functions

6. Time series analysis6.1 Dynamic behaviour of time series6.2 Description of time series in the time and frequency domain6.3 Stationary processes6.4 Correlation and spectral functions

7. Analysis of cycling processes7.1 Introduction7.2 Fourier analysis7.3 Digital data filter7.4 Wavelets

Literature

2

8/3/2019 Stat Mod 1011

3/67

1. Events and data

Statistical modellingof hydrological systems is an important task to extract

information from former and actual states of aquatic ecosystems (aquifers,

freshwater ecosystems, marine ecosystems) by means of water quantity data

and water quality data. Holism and reductionism are the two different ap-

proaches to study and model ecological processes and systems. Both ap-

proaches are needed for ecosystems modelling, simulation and management.

Holism

Aquatic ecosystems are complex systems with nonlinear interrelation-

ships. Holism attempts to reveal the properties of ecosystems by studying the

system as a whole. The system properties cannot be found by a study of thesystem components separately. It is required that the study be on the system

level. This does imply that a study of the ecosystem components is not suffi-

cient. The components of ecosystems are coordinated to such an extent that

ecosystems work as indivisible unities. A study of ecosystem components level

will never reveal the ecosystem properties.

Reductionism

To simplify the ecosystem study and to facilitate the interpretation of ecologicalprocesses the ecosystem components are separated from the system

level. This method is useful to find governing relationships in real systems. This

method has obvious shortcomings when the functioning of the entire ecosystem

is to be revealed. As an example: A forest is more than the sum of all trees.

The analysis and control of dynamic aquatic ecosystems such as ponds,

lakes, reservoirs and river basins is often a complicated task because of the

high number of system elements (or components) and interrelationships be-

tween system elements and between system elements and their environments.

To solve management problems the system has to be decomposed and nonlin-

ear interrelationships have to be linearised. Furthermore, the controllability of

aquatic ecosystems has to refer to different and parallel working subsystems

and system states. The quality of aquatic ecosystem analysis depends on the

flexibility of statistical models used. The restricted information structure of com-

plex aquatic ecosystems and aggregation of information lead to uncertainties ofthe modelling process and of the resulting models. Dynamic processes within

3

8/3/2019 Stat Mod 1011

4/67

aquatic ecosystems are initiated by switching of input and state variables. They

result in rapid changes of system states and output variables (non-autonomous

control) or in low changes (autonomous control). In general, complex dynamic

systems like hydrological systems (aquatic ecosystems) are characterised by

three features (table 1).

Table 1: General characterisation of complex dynamic systems

Feature Solving procedure

High dimension Decomposition of system

Uncertainty Analysis of dynamic characteristics(observability, controllability, pertur-bability, reachability, robustness, sta-bility, sensitivity)

Restricted information structure Aggregation of information

Statistical modellingof hydrological systems is based on data. They are ob-

servations about characteristics and/or attributes of hydrological input, state and

output variables. A group of state variables under study is called a (statistical)

population (e. g. data of water flow, salinity data of river water, BOD data of a

waste water treatment plant). If the frequency distribution of the attributes of a

population is known, then it is possible to describe it by a probability densityfunction or probability distribution function, which is an analytical function de-

fined by a number of parameters.

For the study of aquatic ecosystems a subset of the population or a sample is

used. A population is denoted as univariate if only one variable (or water qua-

lity indicator) is considered. Common univariate measures are averages as

measures of location of centres of data clouds along an axis, and measures of

their dispersion as variance or spanning width. If more than one variable (or

indicator) is considered the population is denoted as a multivariate one.

Regression and correlation analysis belong to experimental statistical

modelling of hydrological systems which is based on methods of the theory

of probability. To solve practical problems such approaches are necessary

which are compatible with the stochastic nature of the input variables and state

equations. Statistical procedures will be the adequate mathematical methods as

long as the processes within the systems and their describing equations are

4

8/3/2019 Stat Mod 1011

5/67

unknown. A distinction is made between two groups of methods depending on

whether the variable time is included or not: Static methods (without consid-

eration of time as variable) and dynamic methods (with consideration of the

variable time). The latter one is often called time series analysis or dynamic sta-

tistics. Simple and multiple linear and non-linear regression and correlation be-

long to static methods as well as multivariate statistical procedures. Static pro-

cedures answer the question whether there is a relationship between two or

more variables of an environmental system. This question can be answered by

a regression analysis which gives out the type of relationship between vari-

ables.

Statistical modelling is done for different purposes. Administrations as well asindustrial and agricultural companies use statistical data and results to plan their

operations and economic developments. Researchers use statistics mainly as a

first step to derive new scientific results. Therefore, the topics of statistical

modelling can be formulated by:

1. Data sampling (Methods: Sampling design, re-sampling, plausibility

checks, outlier correction).

2. Data analysis to fulfil the requirements of environmental administra-tions and associations (Methods: Descriptive statistics, frequency distri-

butions, averages, variances, error correction, significance tests).

3. Data analysis to fulfil the requirements of different professional users

(e.g. industry, agriculture, forestry) (Methods: Explanatory statistics, mul-

tivariate statistics, geostatistics, time series analysis).

4. Basic research (Methods: Regression and correlation analysis, multi-

variate statistics, advanced statistical techniques, digital data filtering,

frequency analysis).

Disturbances of statistical analysis of hydrological dataare given by:

1. Mostly, only small sets of data of representative regularly sampled data

are available.

2. The power of natural and artificial (man-made) external as well as natural

internal driving forces on hydrological indicators influence the quality

ofdata to be obtained.

5

8/3/2019 Stat Mod 1011

6/67

3. Mostly, the a-priori process information on water quality indicators is

low.

4. Hydrologic processes possess different rate constants.

5. Cycling effects in hydrological data are induced by natural internal or

external as well as by man-made external processes.

Classification of hydrological data

Hydrological data may be classified by their origin:

1. Measured and/or observed data of hydrological indicators will be ob-

tained by field samples and/or laboratory experiments. They are directly

observed (direct observations) or indirectly observed (due to calibration

of analytical instruments or sensors).2. Summary data will be derived from statistics or from restricted observ-

able ecological, respective water quality indicators.

3. Simulated data will be obtained by simulation models.

1.1 Analysis and control of aquatic ecosystems

An aquatic ecosystem is a biotic and functional system or unit, which is able to

sustain life and includes all biological and non-biological variables in that unit.

Spatial and temporal scales are not specified a-priori, but are entirely based

upon the objectives of the ecosystem study. Ecosystems are often called com-

plex systems.

Several approaches exist to study the behaviour of ecosystems.

Empirical studies collect bits of information. An attempt is made to integrate

and assemble the studies into a complete picture.

Comparative studies are presented to compare some structural and functionalcomponents for a range of ecosystem types.

Experimental studies where manipulations of a whole ecosystem are used to

identify and elucidate ecological mechanisms.

Modelling and computer simulation studies to work out ecosystem man-

agement plans and to derive eco-technological tools for goal oriented control

actions.

6

8/3/2019 Stat Mod 1011

7/67

Information systems and decision support systems studies to support in-

dustrial, agricultural and administrative ecological decisions and to work out

medium-term and long-term development plans for ecological management.

Like many words for which people have an intuitive understanding, a system isdifficult to define precisely. In relation to the physical and biological sciences, a

system is an organised collection of interrelated physical components charac-

terised by a boundary and functional unity. A system is a collection of communi-

cating materials and processes that together perform some set of functions. A

system is an interlocking complex of processes characterised by many recipro-

cal cause-effect pathways.

A system is a set of interrelated objects (elements, parts) that have certain gen-

eral properties:

1. It fulfils a certain function, i.e. it can be defined by a system purpose recog-

nisable by an observer.

2. It has a characteristic constellation of essential system elements and an es-

sential system structure which determine its function, purpose, and identity.

3. It loses its identity if it is destroyed.

Analysis and control of aquatic ecosystems are often complicated because

of the high number of system elements and interrelationships between sys-

tem elements and between an ecosystem and its environment. Mostly, an eco-

system will be analysed as one unit. Dynamic processes within ecosystems are

initiated by switching processes of input and state variables with different trans-

fer time constants (fig. 1). If they are overlaid by external and internal distur-

bances it can not be distinguished which part of ecosystem response and its

intensity stem from a single ecological element.

For ecosystem analysis, the complex structure of an ecosystem requires its de-

composition and linearization of nonlinear interrelationships. The controllability

of ecosystems has to refer to different working elements (or subsystems) and

system states. Therefore, the whole ecosystem will be divided into several sub-

systems with internal and external feedbacks. This leads to uncertain state-

ments on the ecosystem behaviour. The quality of ecosystem analysis depends

7

8/3/2019 Stat Mod 1011

8/67

on the flexibility of mathematical models used for computation. Restricted infor-

mation structure and aggregation of information lead to model errors.

Figure 1: Switching processes within a freshwater ecosystem

Ecosystems are multidimensional systems with several input and output

variables. They can be seen as black box, grey box or white box systems. In

dependence of the numbers of input and output variables SIMO-, MIMO-, SISO-

and MISO-systems will be distinguished.

Ecosystems can be considered as stochastic transfer systems described

by its state variables and parameters. They are characterised by measurable

inputs, immeasurable (stochastic) disturbances as well as by measurement er-

rors. In the case of real systems, disturbances, input signals and measurement

errors will be overlaid and produce disturbed (and unsure) output signals.

Transfer functions are represented by

1. Pulse functionx(t) = 0 for t < 0 and t > T,x(t) =x0 for 0 t T,

2. Jump function:x(t) =x0(t) with (t) = 0 for t < 0 and (t) = 1 for t 0,

3. Harmonic function: x(t) = x0 + cos(t+) for - < t < + orx(t) = x0ej(t+) =x0

+e

jt withx0+=x0 e

j,

8

8/3/2019 Stat Mod 1011

9/67

4. White noise function.

Other transfer functions are

1. Exponential function:x(t) =x0 e-t/T

for 0 t < + orx(t) =x0+

ejt e

tfor

0 t < + and 0,2. Periodic function: x(t) = a0/2 + i aicos(i0t) + i bisin(i0t) or

x(t) = i ciej(i0t),

3. Dirac impulse:x(t) = 0 for t < 0 and t > T,x(t) = (t) with (t) = 0 for t 0

and (t)dt = 1,

4. Ramp function:x(t) = 0 for t < 0 andx(t) = at for t 0 or

5. Time discrete signal: x~(t) = kx(kT)(t-kT) with k = 0, 1, 2,

and T 1/(2fmax) where fmax is the maximum frequency contained in the

data serie.

Feedback structures (or couplings) within ecosystems are given by simple

feed-forward, feed-back self-tuning or complicated couplings between the eco-

system elements.

1.2 Statistical management of ecological data

To handle and investigate hydrological data with sense they should be charac-

terised by some relationships. The increase of information content of hydrologi-

cal data analysis is expressed by the number of data operations. Four scales

can be distinguished (fig. 2).

Increaseofinformationcontent

Ratio Scale

Interval Scale

Ordinal Scale

Nominal Scale

Figure 2: Data scales in hydrological research

9

8/3/2019 Stat Mod 1011

10/67

Transformations from one data scale to another serve as unificators of vari-

ables (tab. 2). The information content (knowledge, antithesis of uncertainty)

and the scale level should not be changed during sampling and/or statistical

data analysis. If there is no empiric equivalence scale, then the data are valu-

ated as comparable.

Table 2: Comparison of data scales in hydrology

Scale Arithmetic operation Statistical measure

Ratio Scale +, -, , / Geometric mean

Interval Scale +, - Arithmetic mean

Ordinal Scale none Median, Quartiles

Nominal Scale none Frequencies only

Nominal scale:

No relationship between events, sometimes they are coded by numbers (e. g.

lottery, pie charts), no arithmetic operation possible.

Ordinal scale:

Ranking of events or representations, classification of environmental indicators

(e. g. EU water quality classes, soil classes etc.), ordinal comparisons are pos-

sible: Class I > Class II, estimation of median and quartiles.

Interval scale:

Ordinal scale with equal intervals (e. g. water temperature), statements on dis-

tances and differences between data are allowable. No natural origin (Zero

point) exists.

Ratio scale:

It is an interval scale with a natural origin and allows statements on ratios (e.

g. concentrations).

One of the most important characteristic of hydrological data is its uncertainty

which can be characterised as a state or condition of incomplete or unreliable

knowledge. Sources of uncertainty are characterised by

1. Statistical analysis depends on the a-priori information of essential hy-

drological variables considered.

2. Hydrological variables and theirrates of changes have different scales

in time and space.

10

8/3/2019 Stat Mod 1011

11/67

3. Mostly, a small set of representative data will be available.

The strength of disturbances of the data observed leads to fuzzy effects of in-

terpretations.

Figure 3 shows different types of annual water quality data series which can

be distinguished by their statistical measures:

0

10

20

30

TW(C)

0

5

10

NH4(mg/l)

0.6

0.8

1

Lf(mS/cm)

7

8

9

pH-value

0

0.5

1

NO2(mg/l)

0

0.5

1

1.5

o-PO4-P(mg/l)

J FM AM J J A SO ND0

5

10

15

time (month)

O2(mg/l)

J F M AM J J A SO ND0

5

10

15

time (month)

NO

3(mg/l)

J F M AM J J A SO ND0

20

40

time (month)

DO

C(mg/l)

Figure 3: Data series of water quality samples

The quality and usability of hydrological data are usually highly depending on

the suitability of the sample and the adequacy of the sampling or monitoring

program. The goal of sampling is to get information about the frequency distri-

butions of data indicating environmental states or about the distribution parame-

ters. These estimates are called sample statistics and form a base to give

prognoses on environmental developments in general, but also on hydrological

changes. If an investigation is based on samples then sampling statistics de-

pends on the particular sampling environment, on stationary or instationary

external or internal effects as well as on random influences. Sampling fre-

quency depends on hydrologic process dynamics, on the degree of water pollu-

tion, on the type of pollution, and on the type of substance. Different results may

11

8/3/2019 Stat Mod 1011

12/67

be obtained if different samples are selected. This variation in the data from

sample to sample is called sampling variability. The difference between a sta-

tistic and the true population value is called sampling error. It increases if more

random factors influence the sampling procedure. There is a margin of uncer-

tainty expressed in terms of the sampling variance of the estimator. Sampling

variance is a measure of the precision of the estimates.

Comparison of hydrological data series:

1. Average is time dependent, dispersion is approximately time constant.

2. Average is approximately time constant, dispersion is time dependent.

3. Average and dispersion are time dependent.

Variability within data series is caused by:

1. Environmental influences or factors,

2. Intrinsic factors between water samples,

3. Different sample treatment,

4. Different data treatment.

1.3 Sampling strategies

Ecological data are obtained by field samples and/or laboratory analysis.

They are directly observed (direct observations) or indirectly observed (due to

calibration of analytical instruments and sensors). Summary data are derived

from statistics or by restricted observable indicators. Simulated data are ob-

tained by simulation models.

Sampling design is based on different procedures. The most common used

designs are

1. Systematic (periodically) sampling (yearly, monthly, weekly, daily, andhourly).

2. Sampling based on the level ofadmissible fault of the annual mean.

3. Random sampling.

4. Sample size for normal distributed data without trend and peri-

odicities:

n = ((t(95)v)/e(x*))2 with t(95) = 1.96, v = x*/s100 and e(x*) = 10% allow-

able deviation from mean.

12

8/3/2019 Stat Mod 1011

13/67

The sampling location in space and time can have a very real effect on the qual-

ity and usefulness of data in hydrology. Site selection should be made primarily

on the basis of the goal of the study as well as on the nature of the hydrologic

process or phenomenon under consideration. Optimum number of samples,

frequency of sampling and spacing can be estimated either by preliminary sam-

pling experiments, by conclusions from expert knowledge, by practical experi-

ences, or by statistical sampling design formulas and methods. Geostatistical

methods can be helpful to determine optimal space distribution of sampling

points.

The sampling procedure covers three parts.

1. Hypothesis (program purpose, sampling design, formulation of ques-tions),

2. Observation (sampling techniques, sampling protocol, analytical tech-

niques),

3. Interpretation (data analysis, interpretation of results)

Recommendations for hydrological sampling:

1. The goals and needs for hydrological data collection should be formu-

lated explicitly for each application before sampling is started.2. Priorknowledge of factors that affect hydrological variables to be sam-

pled should be given.

3. During sampling significant changes of external and internal driving

forces should not take place.

4. Existing estimates may be sufficient if they were obtained by an unbi-

ased sampling design.

5. Sampling design in hydrology should cover the water budget (surface

and groundwater), hydrochemical variables (organic and inorganic sub-

stances, metabolites), hydrophysical variables (considering internal and

external driving forces), hydrobiological variables (life cycle of plants and

organisms, conversion of organic and inorganic substances), microbi-

ological variables, and other variables as required.

Disturbances of data analysis:

Only small sets of representative regular sampled data are avail-able.

13

8/3/2019 Stat Mod 1011

14/67

The power of external and internal driving forces on water qual-

ity (hydrological) indicators influences the quality of data to be ob-

tained.

The a-priori process information on water quality (hydrological)

indicators is low.

Water quality (hydrologic) processes possess different rate con-

stants.

1.4 Re-sampling and pre-treatment of data

Series of measurements of hydrological data are time series of data recorded at

discrete points in time often with unequal sampling intervals. In practice, they

often contain missing data or they are based on different sampling intervals in

time and space. To extract hydrologic process information from single data

(events) the data series should be completed and based on a regular sampling

grid. The application of static and dynamic statistical methods for analysing

such data sets requires equidistant data. Re-sampling generally means data

interpolation or, in the case of noisy information, dataapproximation. Figure

4 gives an overview on these procedures.

R a w h y d r o lo g i c a l d a ta

In te rpo la t i on

E q u i d is t a n t d a t a

A p p ro x im a tio n D ig ita l d a ta f i lte r in g

S t a t ic D y n a m ic

F u n c t io n a l r e l a t io n s h i p

H i g h p a s sL o w p a s s

C o n s i s t e n t d a t a

Figure 4: Interpolation, approximation and digital filtering of data

14

8/3/2019 Stat Mod 1011

15/67

The goal of the application of interpolation and approximation methods onto

incomplete time series is to fill the intervals between two grid points so that se-

ries of measurements with small unique sampling intervals are kept. Table 3

contains some commonly used interpolation methods.

Table 3: Interpolation methods

Method Algorithm Characteristics

+

+ 3,x*, s.

The test statistic: r= (|(x+

-x*)|/s)n/(n-1),wherex+ is to be expected as an outlier,x* is the expectation of the sample, s is

the standard deviation of the sample, and n sample size. Choice of signifi-

cance level = 0.05, degrees of freedom f= n 2.

Decision: Acceptance ifrcalc< rtab, otherwise rejection (cf. table 10).

Table 10: Table of r test (according to Kaiser and Gottschalk 1974)

f = n - 1 P(95) P(99) P(99,9)

1 1,409 1,414 1,4142 1,645 1,715 1,7303 1,757 1,918 1,9824 1,814 2,051 2,1785 1,848 2,142 2,3296 1,870 2,208 2,4477 1,885 2,256 2,5408 1,895 2,294 2,6169 1,903 2,324 2,67810 1,910 2,348 2,73012 1,920 2,385 2,81214 1,926 2,412 2,87416 1,931 2,432 2,29118 1,935 2,447 2,20520 1,937 2, 460 2,99050 1,951 2,529 3,166100 1,956 2,553 3,227200 1,958 2,564 3,265300 1,958 2,566 3,271500 1,959 2,570 3,279700 1,959 2,572 3,283

1,960 2,576 3,291

29

8/3/2019 Stat Mod 1011

30/67

Example:

From laboratory analysis of water quality exist a small data set of BOD data with

x1 = 30,4 mg/l,x2 = 30,1 mg/l,x3 = 30,5 mg/l,x4 = 30,9 mg/l,x5 = 29,2 mg/l. The

last value is expected to be an outlier. That would mean the data set is inhomo-

geneous.

x* = 30,2; s = 0,638 ; n = 5

Test statistic: r= (|(29,2 30,2)|/0,638)5/(5-1) = (1,0/0,638)5/4= 1,5671,118

= 1,752

Comparison: rcalc and rtab for f= n 2 = 3: rcalc = 1,752, r(95) = 1,757; r(99) =

1,918; r(99,9) = 1,982.

Decision: Ifrcalc< rtab, then acceptx5: 1,752 < 1,757.

Result and interpretation: The value x5 is not an outlier and belongs to the data

set. The data set itself seems to be homogeneous. In the case that a value has

been found as an outlier the average and variance have to be re-calculated and

tested again.

30

8/3/2019 Stat Mod 1011

31/67

4. Linear regression and correlation analysis

A regression analysis is required for problems in which stochastic dependen-

cies (stochastic cause-effect relationships) have to be described by functions

with one or more several variables. Linear regression analysis is one of the

best studied statistical methods. Goal of a simple or multiple linear regression

analysis is the determination of a linear relationship between two or more

measurable (or observable) variables or characteristicsXand Yof a hydrologi-

cal system. The measurement values of size n consist of n pairs of data (x1, y1),

(x2, y2),, (xn, yn) (orn-tupels of data) which can be considered as realisations

of a two-dimensional (or n-dimensional) random vector (X, Y).

4.1 Steps of linear regression

1. Step: Scatter-plot of variables of interest (fig. 10).

0,0 5,0 10,0 15,0 20,0 25,0 30,0

Temp

7,6

7,8

8,0

8,2

8,4

8,6

8,8

9,0

pH

Figure 10: Scatterplot of hydrological variables

2. Step: Estimate the relationship (positive or negative) between variables.

Directions of relationships

1. Positive relationship: Increasing values of X and increasing values of Y.

2. Negative relationship: Increasing values of X and decreasing values of Y.

3. No relationship between X and Y (e. g. parallels to the axes).

The relationships can be strong or weak.

31

8/3/2019 Stat Mod 1011

32/67

3. Step: Formulate the (linear) model equation (fig. 11):

pH = 7.868 + 0.025 Temp

7,6

7,8

8,0

8,2

8,4

8,6

8,8

9,0

0,0 5,0 10,0 15,0 20,0 25,0 30,0

Temp

observed

linear

Linear regression between Temp and pH

Figure 11: Linear relationship between variables

4.2 Confidence region of regression line

4. Step: Calculate the confidence region of the regression line.

The general model of linear regression is given by y = a +bx. Using the confi-

dence intervals ofa and b a confidence region of the (mean) linear model EY =

a + bx can be defined by gu < EY < go where gu = y* - sy*t and go = y* + sy*t. The

limits of confidence are symmetric hyperbolas around the linear regression

model y* = a* + b*x. They get their minimum for x = x* and increase with for

other x values. Therefore, the confidence statements will be fuzzier. The width

of the confidence band L depend from sy* and can be calculated by L = 2 sy*t.

4.3 The power of linear regression

The strength of a relationship is expressed by the empirical (linear) correlation

coefficient: r= (xi x*)(yi y*)/(xi x*)2(yi y*)

2. By means of this formula

(explanation see chapter 4.4) the next step of linear regression procedure is

derived.

32

8/3/2019 Stat Mod 1011

33/67

5. Step: Calculate the power of relationship: r = 0.493 or B = r2

= 0.243.

The calculation algorithm is presented in chapter 4.4.

To derive statistical characteristics of a linear regression model the following

cases should be distinguished:

1. b high, r high, s low,

2. b high, r low, s high,

3. b low, r low, s low,

4. b low, r very low, s high.

4.4 Empirical Covariance and statistical measures of correlation

A correlation analysis answers the question about the strength and direction of

a linear (but not severe functional) relationship between two or more variables.

The power or intensity of such a relationship is expressed by correlation. Meas-

ures of correlation are the correlation coefficient r, the performance index B = r2

or the partial correlation coefficient rxy,z.

Combining data series of different water quantity or water quality variables re-

ferring to two or more measurable characteristics sets of pairs of data (x1, y1),

(x2, y2) ,, (xn, yn) orn-tupel of data will be obtained (fig. 12).

X

120100806040200Y

120

100

80

60

40

20

0

-20

Figure 12: Scatterplot of a bivariate relationship.

These sets of data can be seen as realisations of a two- or multi-dimensional

stochastic vector (X, Y,). Normal probability distribution of data pairs or data

tupel is a (strong) prerequisite.

33

8/3/2019 Stat Mod 1011

34/67

A visualisation of a relationship between three variables is possible but in some

cases not really helpful. The information content is high but cannot be extracted

very clearly (fig. 13).

Y

120 140

0

20

120100

40

60

100

80

80

100

120

8060

X Z

6040 4020 200 0

Figure 13: 3-D scatterplot of variables

Such relationships are characterised by statistical measures which are denoted

as correlation measures. In principle, arithmetic means and empirical variances

of data series are used:

x* = 1/n xi and y* = 1/n yi

sx2

= 1/(n-1) (xi x*)2

and sy2

= 1/(n-1) (yi y*)2.

A new data series with n pairs of data (xi, yi), i = 1, , n is formed by two vari-

ables {X} and {Y}. The empirical covariance sxy will be calculated as follows:

)yy()xx(1n

1s i

n

1iixy

==

==

n

1iii

)yxnyx(1n

1.

sxy can be positive or negative. For small values ofxi, the difference between

arithmetic mean and xi will be negative. For big values ofxi, the difference be-

tween arithmetic mean and xi will be positive. This is also valid for data yi. For

this reason, a negative covariance characterises a relationship where big values

xi are connected with small values yi mostly and vice versa.

By normalisation ofsxy with empirical standard deviations sx und sy one gets the

empirical coefficient of correlation rxy:

34

8/3/2019 Stat Mod 1011

35/67

ss

sr

yx

xyxy = .

Because ofsxy = syx also rxy = ryx is valid. rxy is a measure of strength and direc-tion of a linear relationship between hydrological variablesXand Y.

Statistical measures of correlation between two or more hydrological variables

are mainly based on the assumption that the data sets are subsets of Gaussian

distributed data sets. The rank correlation procedure functions without assum-

ing a normal probability distribution of the data set to be analysed.

Empirical bivariate correlation coefficient

r= (xi x*)(yi y*)/(xi x*)2(yi y*)2

Performance index (coefficient of determination)

B = r2

Partial correlation coefficients

rxy,z = (rxy - rxzryz)/(1 rxz2)(1 ryz

2)

rxz,y = (rxz - rxyryz)/(1 rxy2)(1 ryz

2)

ryz,x = (ryz - rxyrxz)/(1 rxy2

)(1 rxz2

)

Multiple correlation coefficients

x, y, z x = f(y, z)

Rx, yz = rxy2

+ rxz2

2rxyrxzryz)/(1 - ryz2)

Multiple performance index

BBx.yz = (rxy2 + rxz

2 2rxyrxzryz)/(1 - ryz2)

SPEARMANs rank correlation

(Valid for small sample size, normal probability distribution not necessary)

)1(

)(6

12

1

2

= =

nn

yx

r

n

iii

S)1(

6

12

1

2

= =

nn

iDn

i

Table 11 contains data and an explanation of the the ranking procedure for a

SPEARMAN-test.

35

8/3/2019 Stat Mod 1011

36/67

Table 11: Data and procedure of rank correlation

xi R(xi) yi R(yi) Di Di2

0,5 5,5 4 3 2,5 6,25

0,8 7,5 6 5 2,5 6,25

1,1 10 2 1 9 81

0,5 5,5 10 8 -2,5 6,250,4 4 8 6 -2 4

0,3 2 12 10 -8 64

0,9 9 5 4 5 25

0,8 7,5 3 2 5,5 30,25

0,3 2 9 7 -5 25

0,3 2 11 9 -7 49

297

Result: rS = -0,8

Comparison ofrS and rStab (positive values only):

Forn 30 the table of probability values ofrS has to be used. Forn > 30 the

table of standardised normal probability distribution should be used:

rSTab(95) = 0.5515; rSTab(99) = 0.7333; rSTab(99,9) = 0,8667.

Decision: If rS rStab, then reject rS. the example shows that for each signifi-

cance level rSrStab is valid.

Result and interpretation: Between both data sets exists a relatively strong

negative correlation.

36

8/3/2019 Stat Mod 1011

37/67

5. Nonlinear regression analysis

In the case that a linear regression model is not valid or insufficient other re-

gression models should be tested. From this statement the following step of

(linear) regression procedure is derived:

6. Step: Find out other model types if the linear model is insufficient (fig. 13).

Figure 14 contains some standard nonlinear regression models computed by

means of SPSS. The results are presented in table 12.

7,6

7,8

8,0

8,2

8,4

8,6

8,8

9,0

0,0 5,0 10,0 15,0 20,0 25,0 30,0

Temp

observed

linearlogarithmic

invers

squared

cubic

composed

power

S-shaped

growth

exponential

logistic

pH

Figure 14: Linear and nonlinear regression curves

Table 12: Results of nonlinear regression modelsModel B b0 b1 b2 b3LIN 0.243 7.8683 0.0252

LOG 0.198 7.6798 0.2194

INV 0.158 8.3568 -1.2875

QUA 0.308 8.1097 -0.0310 0.0022

CUB 0.432 7.4486 0.2131 -0.0191 0.0005

COM 0.238 7.8731 1.0030

POW 0.194 7.6972 0.0262

S 0.156 2.1219 -0.1544

GRO 0.238 2.0635 0.0030

EXP 0.238 7.8731 0.0030LGS 0.238 0.1270 0.9970

37

8/3/2019 Stat Mod 1011

38/67

When comparing the performance indexes of these standard models the best

statistical model is the cubic one. But this model represents the data cloud by

43.2% only. The remaining 56.8% are not described by the model. As an overall

outcome of this analysis all of these models should be rejected and other types

of nonlinear models should be investigated.

5.1 Polynomial regression

The basic model is given by y= a0+ aixi, i= 1,, where n is called the order

of the polynomial. Figure 15 shows polynomials of different order. Each of the

polynomials represents the given data set by a relatively high degree of per-

formance. For 6th

and 7th

order polynomials the performance will be B = 1.

Figure 15: Examples of polynomial regression

38

8/3/2019 Stat Mod 1011

39/67

By comparing the graphs different interpretations are possible. For the polyno-

mial of 7th

order the graph indicates negative values which do not exist. The

advantage of polynomial regression is to get an algorithm for calculation of the

existing nonlinear relationship between hydrological variables. Disadvantages

are the high number of coefficients and sometimes physically not realistic re-

sults. The best models are not the ones where the graphs are joining all data

points.

Other model types used in water quality management are multiple linear or

nonlinear regression models (e. g. DO(t) = a0 + a1TW + a2Q + a3BSB or DO(t) =

a0+ a1TW + a2Q + a3BSB + a4TW + a5Q + a6BSB + a7TW) or models derived

from control theory (e. g. stochastic transfer method). A continuous dynamicprocess is described by a time discrete model applying the z-transformation on

a difference equation, G(z) = B(z-1

)/A(z-1

) +(z)

5.2 Periodic regression

The basic relationship is given by y = a + b1sin x + b2cos x. The equation

represents the simplest form of periodic regression or so-called Fourier polyno-

mial. In an extended form this method is called Fourier analysis (see chapter 7).In figure 16 water temperature of a reservoir at three depth levels (0m, 10m,

25m) and the approximating graphs are presented.

Figure 16: Periodic regression of water temperature in a reservoir

39

8/3/2019 Stat Mod 1011

40/67

It can clearly be seen that water temperature (and all other hydrological cycling

variables) can be approximated very well by periodic functions. The advantage

of this family of regression type functions is the visualisation of a cycling proc-

ess, the disadvantage is that the functions are valid for fixed cycling periods

only.

5.3 Trend functions

Medium-term and long-term temporal and spatial developments (trends) of hy-

drological variables can be estimated by simple, explicitly given functions. Pa-

rameter estimation is done by the method of least squares (MKQ). Figure 17

shows the development of BOD in along a river stretch following a polynomial of

2nd order.

y = 0,0908x2

- 0,5374x + 2,6386

R2

= 0,9501

0,0

0,5

1,0

1,5

2,0

2,5

TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

sampling point

BOD(mg/l)

Figure 17: Polynomial trend function for BOD in a river

Other examples of linear and nonlinear trend functions are presented in figures

18 to 20.

y = 0,0166x + 0,0854

R2

= 0,8938

0,00

0,05

0,10

0,15

0,20

25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

sampling point

o-PO4-P(mg/l)

Figure 18: Linear trend of phosphate phosphorus in a channel

40

8/3/2019 Stat Mod 1011

41/67

The linear function (also denoted as a polynomial of 1st

order) is able to follow

the increasing trend of phosphate phosphorus load due to waste water input in

a low flow channel with acceptable accuracy. The deviations of regression line

from measurements are small. For the same river stretch, the approximating 2nd

order polynomial of water flow (fig. 19) shows stronger deviations after conjunc-

tion of the main river with a channel. The reason for this are changing hydraulic

conditions and increasing values of water flow. The stationary or uniform flow

conditions of the first part of the water body are disturbed now. Considering the

performance index the graph should be acceptable. But the regression model is

not able to compensate the positive jump in water flow because it works with

fixed parameters (coefficients). Therefore, another regression model should

used.

y = 3,3914x2

- 18,053x + 51,117

R2

= 0,809

0

10

20

30

40

50

60

70

25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

sampling point

flow(m3/s)

Figure 19: Quadratic trend function of water flow

On the other hand, for the same river stretch the trend of chlorophyll-a is ex-

pressed by a 2nd

order polynomial again (fig. 20).

y = 1,4627x2

- 6,4221 x + 67,115

R2

= 0,6459

0

20

40

60

80

25014 TeK0030 SPK0010 SPK0020 H v0190 H v0200

sampling point

Chlorophyll-a(g/l)

Figure 20: Quadratic trend function of chlorophyll-a

41

8/3/2019 Stat Mod 1011

42/67

The performance index is lower than before in fig. 19 for water flow because of

some disturbances caused by hydrophysical phenomenon. But the trend fol-

lows the computed polynomial. Taking into account the variations in chlorophyll

measurements the trend polynomial is quite acceptable.

The following table gives a survey on trend functions used to estimate the de-

velopments of water quality in a river (table 13). All polynomials are of 2nd order.

The signs in the last column indicate significance on a 95% probability level.

Table 13: Trend functions of water quality in the River Havel

Water quality indicator Trend R P (95%)

Water flowTemperature

ConductivityChloride

polynomialpolynomial

polynomialpolynomial

0,81260,6177

0,19710,0382

++

--

DOBODCSV

polynomialpolynomialpolynomial

0,38580,42640,7611

+++

NH4-NNO2-NNO3-NO-PO4-PTPSiO2

exponentialexponentialexponentialexponentialpolynomialpolynomial

0,56690,48790,47460,86830,08220,8888

++++-+

Suspended matterChlorophyll-aInorg. part of biomassLoss of org. matter

polynomialpolynomialpolynomialpolynomial

0,02270,60320,67420,1418

-++-

As can be seen from table 13, polynomial and exponential trend functions are

sufficient to describe the changing water quality mathematically.

Interpretations of trend functions can be given as follows:

Linear trend:

y(t) = a0 (t) + a1 (t) x(t).

(Interpretation of parameters: (a0) mean initial value, (a1) mean rate of

change)

Squared trend:

y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t).

(Interpretation of parameters: (a0) - mean initial value, (a1) - mean rate of

change, (a2) mean process acceleration)

42

8/3/2019 Stat Mod 1011

43/67

Polynomial trend:

y(t) = a0(t) + a1 (t) x(t) + a2(t) x2(t) + ..... + an (t) x

n(t).

(Interpretation of parameters is mostly impossible).

Exponential trend:x(t) = x(0) e

- kt+ E.

(Interpretation according to 1st

order kinetics:x(0) initial concentration value, k

rate of change, E random quota).

5.4 Comparison of regression functions

To describe one and the same data set different nonlinear models can be ap-

plied.

Figure 21: Comparison of different regression functions for the same data set

43

8/3/2019 Stat Mod 1011

44/67

By comparing the initial and the final reach of regression functions the best

functional relationship will be selected (fig. 21). Also the linear model seems to

be suitable. As can be seen in part H, the middle range of all computed models

shows very small variations while the initial and the final part of the graphs show

a spreading of curves. An evaluation of the quality of fit can be given by:

Linear coefficient of determination (performance index):

R2

= B = ( -y y ) / (y - y ) ,

Nonlinear performance index: Bnl = 1 - ( (y - )y2

/ (n-1) sy ),

Residual sum of squares: SR = (yi - )y2, or

Residual dispersion: s

2

= SR/(n m 1) (n number of data, m number ofparameters).

44

8/3/2019 Stat Mod 1011

45/67

6. Time series analysis

The distinction between discrete and continuous variables is not a clear dichot-

omy because continuous processes (seen from a physical point of view of un-

derstanding nature) will be observed at discrete time events. Therefore, mostly

random variables are observed.

6.1 Dynamic behaviour of time series

Freshwater ecosystems may be seen as switching networks where inputs are

transformed into outputs by an operatorwhich describes the transient behav-

iour of ecological processes (fig. 22). The overall operator transforms input

signals into output signals: y(t) = x(t) where the signals will be smoothed

(damped), and there exists some redundancy between input and output signals.

x(t) y(t)Figure 22: Schematic diagram of a transfer process

Therefore, water related processes are represented by time varying signals. In

figure 23, NO3-N raw data are described by a polynomial trend as follows: NO3-

N(t) = 1,8987 0,0754 t + 0,0028 t2

- 0,00003 t3. An exact mathematical (or

functional) description of random fluctuations is not possible. The function de-

scribes more or less the mean behaviour of the process.

706050403020100

3,5

3,0

2,5

2,0

1,5

1,0

,5

0,0

Figure 23: Approximation of a time varying process by a function

45

8/3/2019 Stat Mod 1011

46/67

6.2 Description of time series in time and in frequency domain

Hydrological systems can be seen as stochastic transfer systems described by

system state variables and parameters. They are characterised by measurable

inputs, not measurable (stochastic) disturbances as well as by measuring er-

rors. Disturbances, input signals and measurement errors will be overlaid and

will produce output signals.

Mathematical descriptions of hydrological time series can be represented by

time domain functions (cf. transfer functions, pp. 8 and 9).

In the frequency domain hydrological time series are represented by Fourier-

transforms of correlation functions, by coherency functions as well as by wave-

lets.

6.3 Stationary processes

Because of time lags between input and output processes stationary processes

will then be reached when all transient processes are decayed. Therefore,

some statistical characteristics of signals should only be grasped. If statistical

characteristics do not change in time, then these processes are called station-

ary processes. Process averages and dispersions will not change so much in

time. Therefore, stationary random processes can be investigated on different

time intervals between - < t < + .

Statistical characteristics of stationary random processes can be expressed

by

1. Probability density functionp(x) of signals X(t),

2. Auto-correlation functionxx(),

3. Spectral power density function Sxx()

A time varying process is expressed by a stochastic signal X(t). For each time

stroke tn one measured valueXn(t) will be obtained. The further development of

the process can be predicted only for a short time interval. When the process is

described by an analytical (deterministic) function f(t) then the time behaviour

can be predicted completely. Only some statistical statements on the future de-

velopment of the processX(t) can be given:

Prob(X(tn+1) x) P(x),

46

8/3/2019 Stat Mod 1011

47/67

or

Prob(a < X(t) b) = p(x)dx.

The Gaussian distribution with a bell-shaped density is one of the most impor-

tant probability density distributions wherep(x) = 1/2exp-(x-x*)2

/22

. Impor-

tant expectations are linear average: E(x) = xp(x) dxand squared average:

E(x2) = x2p(x) dx.

6.4 Correlation and spectral functions

The probability density function gives an information about the probability of the

processX(t) that the amplitude at time t lies betweenxand (x+ x):

Prob(x

8/3/2019 Stat Mod 1011

48/67

spectrum of a stationary signal which is a distribution of the variance of the sig-

nal as a function of frequency. The frequency components that account for the

largest share of the variance are revealed. Each peak represents the part of the

variance of the signal that is due to a cycle of a different period or length. Sig-

nificant periodicity in the signal will induce a sharp peak in a periodogram. The

auto-covariance function is the time domain counterpart of the periodogram.

The periodogram of water temperature (figure 24) shows a single distinct peak

which indicates the major cyclic behaviour. The low frequency component is

responsible for the general tendency of the indicator.

Figure 24: Periodogram of water temperature of the Lower Havel River

Figure 25: Periodogram of pH

The periodogram of pH in figure 25 shows that the highest variance is displayed

by a low frequency. Small fluctuations are not dominant and can be neglected.

48

8/3/2019 Stat Mod 1011

49/67

Only long term changes are responsible for the overall observed behaviour of

the indicator.

The periodogram of pH is similar to that of dissolved oxygen presented in figure

26. High variances at low frequencies are observed. This means that the gen-eral tendency of this indicator is determined by long term changes.

Figure 26: Periodogram of dissolved oxygen.

For the indicator of phytoplankton biomass the periodogram is shown in figure

27. The periodogram represents low frequency components which exhibit the

highest variances and some small fluctuation at higher frequencies. They de-

termine the long term behaviour of the indicator. Two distinct peaks reveal two

cycles of different periods and amplitudes.

Figure 27: Periodogram of chlorophyll-a

The cross-power spectrum Sxy() of two stochastic ecological processes x(t)

and y(t) is the Fourier transform of the CCF:

49

8/3/2019 Stat Mod 1011

50/67

Sxy() = 1/2xy()e-j d

It is a complex function.

The coherency function Co() is a measure of synchronicity of (two) signals. It

is calculated on the base of periodograms of both signals by

Coxy() = |Sxy()|2/Sxx()Syy(),

where |Sxy()| = Re(Sxx())2

+ Re(Syy())2

and for the phase shift between

both signals () = arc tan (Im(Sxy())/Re(Sxy())) is valid.

The limitation of CCF is considered by what is called a window function h():

~Sxy() = 1/2xy()h()e

-j d = Sxy()H( - ),

where H() is the Fourier transform of h() which distorts Sxy() to~Sxy().

50

8/3/2019 Stat Mod 1011

51/67

7. Analysis of cycling processes

Cycling processes in hydrology are natural. In fig. 28 some examples of cycling

processes with different periods and frequencies are presented.

0

100

200

Q(m3/s)

200

600

1000

EC(S/cm)

0

15

30

Tw(C)

0

10

20

O2(mg/l)

1985 1987 1989 1991 1993 19956

8

10

time (a)

pH

Figure 28: Cycling water quality indicators

Such processes are caused mostly by natural external driving forces but also by

natural internal driving forces. They lay out different time and frequency behav-

iour of the water quality (hydrological) processes. Water quality processes are

characterised by different time parameters such as time delay, threshold val-

ues, altering, physiological parameters and others. State transitions take

place on intervals (ai (t), bi (t)) with probability densities wi(t) of time delays of

system variables and probabilities pi(t) for each realisation of a state transition:

Forai(t) wi(t) bi(t):pi(t) = wi(t) dt.

On the other hand, hydrological variables vary often with high frequencies be-

cause of random changes of internal system states and/or fluctuations of vari-

ables. Switching processes of input variables take place at certain different time

events. Time delays in the courses of action of system components lead to re-

tardations in the changes of system states and to redundancies in the data

transfer.

51

8/3/2019 Stat Mod 1011

52/67

A state transition can be characterised by a quadrupel

i(t) = {ai(t), bi(t), wi (t),pi(t)).

A classification of hydrological systems can be given by its characteristics of

signals and by the type of change of dynamic properties (table 14).

Table 14: Classification of hydrological systems

Classification Remark

Characteristics of signals

Modulation

Quantification

Change of amplitudes, frequencies andphases of signals

Discretisation of time domain of ampli-tudes and duration interval of signals

Adaptability of system

adaptive

non-adaptive

Change of systems states, change of in-puts and disturbances, change of parame-ters, change of system structure

fixed parameters, no change of ecosystemstructure

7.1 Introduction

Mathematical equations describe either the time dependency (function of time t)

which is called description in the time domain or the frequency dependency(function of frequency or cycles per time unit) which is called description in

the frequency domain. Mostly, cycling (or periodic) processes in hydrological

context are caused by natural external driving forces. On the other hand, aperi-

odic hydrological processes are mainly influenced by artificial (man-made) ex-

ternal driving forces.

Another distinction can be made by the ability to reproduce a time-varying proc-

ess. In the case of correct reproduction and forecast of a process it is called a

deterministic one. Otherwise it is called a non-deterministic or stochastic (ran-

dom) process. Each deterministic process x(t) is characterised by its time de-

velopment (or behaviour) x = x(t) with - < t < +.

A harmonic process is described by a trigonometric function

x(t) = x0cos(i + i) with - < t < +.

i = 2/Ti is the basic cycling frequency (circle frequency), Ti is the period of

cycle, and i is the shift of phase.

52

8/3/2019 Stat Mod 1011

53/67

7.2 Fourier analysis

A periodic process with period T0 is described by a Fourier series of the form

x(t) = a0/2 + aicos(i0t) + bisin(i0t),

with - i + , 0 = 2/T0 frequency of the basic cycle, T0 period of cycle.

The amplitudes ai and bi are calculated as follows: ai = 1/T0x(t)cos(i0t)dt,

bi = 1/T0x(t)sin(i0t)dtand a0 = 1/2T0x(t)dt.

The Fourier polynomial is an approximation which represents the minimum

mean squared deviation of a cycling process. Then, the amplitudes of the ap-

proximating function are given by Ai = ai2

+ bi2.

Phase shifts are given in the interval [0, 2] by i = arc tan bi/ai.

Figure 29 shows a Fourier approximation of global radiation process. It can be

seen that the approximation is shifted from the real frequencies due to a fixed

frequency. This fact causes some error.

1996 1997 1998 1999 2000 2001 2002 20030

50

100

150

200

250

300

350

400

450

500

time (a)

globalradiation(W/m2)

raw datacomponent with max. amplitude (f=1/352d)

Figure 29: Fourier approximation of global radiation

Fourier approximations can be used to explain the variance of a cycling process

by its basic frequency. Table 14 gives an example on the usefulness of this

method for physical, chemical and biological environmental or ecological vari-

ables respectively. The 3rd

column of table 14 contains the values of total vari-ance of the time series under consideration. The last column contains the val-

53

8/3/2019 Stat Mod 1011

54/67

ues of variance which are explained by the dominant cycle contained in the

timw series. The best results will be obtained for physical variables, followed by

chemical variables. Insufficient results are obtained for biological variables.

Table 14: Fourier analysis of water quality indicators

Indicator Reservoir Totalvariance

(%)

Aver-age

Std.dev.

Variance of theyearly cycle(% of totalvariance)

TEMP Saidenbach 90.0 12.0 44.43 84.62

Neunzehnhain 90.0 11.9 33.41 74.35

Kliava 95.7 11.1 56.80 92.76Slapy 95.9 12.0 52.25 90.63

DO Saidenbach 71.6 10.5 2.90 22.74

Neunzehnhain 76.2 10.1 1.98 22.19Kliava 75.9 10.2 4.37 37.35Slapy 76.2 7.7 8.88 34.81

CHA Saidenbach 36.8 5.4 32.87 1.94

Neunzehnhain 35.4 1.7 1.59 1.96

Example:Approximation of water temperature of reservoirs (yearly domi-

nant harmonic cycle):

Reservoir Saidenbach

TEMP(t) = 12.0 + 1.458cos((6/180)t) 4.462sin((6/180)t

Reservoir Neunzehnhain

TEMP(t) = 11.9 + 0.693cos((6/180)t) + 4.415sin((6/180)t

Reservoir Kliava

TEMP(t) = 11.1 - 6.650cos((9/180)t) - 7.820sin((9/180)t)

Reservoir Slapy

TEMP(t) = 12.0 - 7.073cos((10/180)t) - 6.684sin((10/180)t)

7.3 Digital data filter

Digital filter function transfer sequences of input signals to sequences of output

signals by compressing or decompressing noisy information contained in the

measured signals of hydrological processes. The results of applying digital fil-

ters are consistent data series which can be used for modelling, simulation and

optimisation in hydrological sciences.

Basic filter functions are derived from an ideal low pass filter:

54

8/3/2019 Stat Mod 1011

55/67

Ideal low pass

)(11|)(| 2

2

FH

+=

Butterworth filter (power low pass)

+=1

1|)(| 22

nH

(Amplitude response should be as flat as possible in the pass band).

Tschebyshev filter, type 1

)(1

1

|)(| 222

cH n+=

( - ripple factor (or eccentricity), = 0.1526. In the pass band a ripple is ac-

cepted. The transition from pass band to stop band is steeper than for the But-

terworth filter).

Tschebyshev filter, type 2 (inverse Chebyshev-Filter)

)(*11|)(| 22

2

cH

n+= with * = 2/(1-).

(In the stop band a ripple is accepted.)

Elliptic filter (Cauer filter)

)(*1

1|)(|22

2

FH

n+=

(Ripples arise in the pass band and in the stop range. One gets the steepest

transition between both frequency bands).

To get an acceptable transfer behaviour filters of order 1 to 3 should be used

only. Figures 30 to 33 represent the transfer behaviours of digital filters for dif-

ferent water quality time series. Higher order filters show rippling transfer be-

haviours and cause nonlinear effects in the output sequences of signals. This

leads to misinterpretations and unexplainable events within the data series.

55

8/3/2019 Stat Mod 1011

56/67

0 50 100 150 200 250 300 3500

0.5

1

1.5

2

2.5

3

3.5

reciprocal of critical frequency (d)

standarderrorO2(mg/l)

order 1order 2order 3order 4order 5order 6

Figure 30: Selection of filter order of a Butterworth filter for DO

The higher order filters lead to changing (welling) transfer behaviour during the

filtering process as can be seen in figs. 30 and 31.

0 50 100 150 200 250 300 3500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


standarderrorpH-value

order 1order 2

order 3order 4order 5order 6

Figure 31: Selection of filter order of a Butterworth filter for pH

They show this behaviour for a Butterworth filter. Tchebychev 1 filters for chlo-

rophyll-a and for water temperature (figs. 32 and 33) demonstrate the distur-

bances within the transfer process.

56

8/3/2019 Stat Mod 1011

57/67

0 50 100 150 200 250 300 350

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035


standarderrortotalchlorophy

ll-a(mg/l)


Figure 32: Tschebychev 1 filter for chlorophyll-a

0 50 100 150 200 250 300 3500

0.5

1

1.5

2

2.5


standarderrorwatertemperature(C)


Figure 33: Tschebychev 1 filter for water temperature

The first step of digital data filtering procedures is the selection of a complete

hydrological time series. If the data series contains some gaps interpolation

methods should be used to get a time series with equidistant data. This is a

strong prerequisite for all further steps. Fig. 34 shows such a data series for the

variable conductivity of the Oder River at Frankfurt.

57

8/3/2019 Stat Mod 1011

58/67

1993 1994 1995 1996 1997 1998 1999 2000

400

600

800

1000

1200

1400

1600

time (a)

conductivity(S/cm

)

Figure 34: Original data series of conductivity

In the next step the critical frequency is calculated from spectral density function

(fig. 35). As confidence band the 95% - confidence region should be selected.

For the example a critical frequency fg = 0.053 was used.

0 0.1 0.2 0.3 0.4 0.510

0

101

102

103

104

105

106

frequency (1/d)

powerdensity(conductivity)

fg

power density

upper bound (confidence interval 95%)lower bound (confidence interval 95%)

Figure 35: Selection of critical frequency of the filter

The last step consists of computation of the digital filter and reconstruction of

the original data series. In case of the Oder River an elliptic filter was used to

reconstruct the original time series and to get a consistent time series for mod-

elling (fig. 36).

58

8/3/2019 Stat Mod 1011

59/67

1980 1990 2000200

300

400

500

600

700

800

900

1000

1100

1200

time (a)

con

duc

tiv

ity

(S/cm

)

Raw DataElliptic Filter (1. order, f

8/3/2019 Stat Mod 1011

60/67

7.4 Wavelets

Wavelet analysis has been proven quite useful for time scale based signal

analysis. It is a solution for the time scale analysis problem because it offers an

effective approach to extract both the information on the time localization and

the frequency content of the time series. It has the ability to decompose time

series into several sub-series which may be associated with particular time

scales. As a result, the interpretation of features in hydrological time series may

be facilitated by first applying an appropriate wavelet transform and subse-

quently interpreting each individual sub-series.

The following questions can be effectively answered with the help of wavelet

analysis:

1. What is the dominant scale of variation influencing the observed gen-

eral tendency of the indicator?

2. Are the variations from one day to the next more prominent than the

variations from one week to the next?

3. Are the statistical variations in the hydrological indicatorhomogenous

across time?

4. What are the time dependent variations such as the presence oftrends?

5. How are two indicators related on a scale by scale basis? How do they

covary at different scales?

The wavelet analysis imitates the windowed Fourier analysis by using basis

functions (wavelets) that are better suited to capture local behaviour of non-

stationary signals. The wavelet transformation is a function of two variables

W(u,s) obtained by projecting a signal X(t) on to a particular wavelet and is

given by

,)()(),( , dtttXsuW su

=

=s

ut

stsu

1)(,

which gives a translated and dilated version of the original wavelet function. The

coefficients that are obtained are a function of the location and scale parame-

ters. Applying shifted and scaled versions of a wavelet function decomposes the

signal into simpler components. It is the effect of the shifting and scaling proc-

60

8/3/2019 Stat Mod 1011

61/67

ess what makes this representation possible and is referred to as multiresolu-

tion analysis.

The wavelet transform is usually applied in the form of a filter bank, comprising

two filters. The scaling filter known as the father wavelet is a low pass filterwhile the wavelet filter known as the mother wavelet is a high pass filter. Given

a signal X(t) of length n = 2j, the filtering procedure can be performed a maxi-

mum ofjtime, giving rise tojdifferent wavelet scales. The wavelet coefficients

or detail coefficients are produced by the wavelet filter while the scaling filter

gives rise to the smooth version of the signal used at the next scale. Given the

respective father and mother wavelets,

=

J

JJ

kJ kt222 2,

= 1)( dtt

and

=

j

jj

kj

kt

2

22 2,

= 0)( dtt

where J,k is the father wavelet and j,k is the mother wavelet with the scale

parameter s being restricted to the dyadic scale 2j. If a signal is projected onto

a given basis function

= kJkJ tfS ,, )( ,

then

= kjkj tfd ,, )(

will be obtained with SJ,k being the coefficients for the father wavelet at a maxi-

mum scale of 2j (the smooth coefficients) and dj,k being the detail coefficients

from the mother wavelet at all scales from 1 toj, to the maximal scale. Based on

these coefficients, the function f(t) can be represented by

)(....)()()( ,1,1,,,, tdtdtStf kk

kkJ

k

kjkJ

k

kJ+++=

and can be equally represented by

f(t)=Sj+ Dj + Dj-1+ + Dj + D1

61

8/3/2019 Stat Mod 1011

62/67

where

)(,, tSS kJk

kJJ =

and

)(,, tdD kjk

kjj = .

Multiresolution decomposition (MRD) reveals the variations at different scales

denoted by d. Figure 37 shows the details of the multiresolution analysis of

dissolved oxygen sampled at daily interval.

Figure 37: Multiresolution analysis details of dissolved oxygen signal

sampled at daily intervals

The details reveal the high frequency variations present in the dissolved oxygen

time series or provide an additive decomposition of the high frequency variation

on a scale by scale basis. The notations d1, d2, d3, d4, d5, d6 and d7 reveal the

variations occurring at one day, 2 days, 4 days, 8 days, 16 days 32 and 64 days

respectively. This progressive decomposition reveals the differences in fluctua-

tions from one scale to another. It effectively shows that the lower scales are

less important compared to the higher scales of variation.

62

8/3/2019 Stat Mod 1011

63/67

Multiresolution analysis (MRA) filters information in the signal at different scales

represented by a. In fig. 38 an example of a MRA and MRD is given for long-

term observations of dissolved oxygen in eutrophic freshwater ecosystem. Tak-

ing of from the original signal s all high frequent events the basic nature of the

cycling process comes out. This can be seen at level a7.

Figure 38: Wavelet analysis of DO

Figure 38 reveals that the variations occurring at a time scale of 1 day are

equally of relatively low intensity and are not able to influence the general ten-

dency observed in the dissolved oxygen signal. However, the fluctuations oc-curring at higher time scales such as scale 8 are strong enough to influence the

long term behaviour of the signal. Hence, the long term tendency observed in

the dissolved oxygen time series is significantly influenced only by the fluctua-

tions occurring at the higher scales and not the lower scales. At the lower

scales, the fluctuations are higher during the warmer months than during the

colder months. At the higher scales such as scale 32, the fluctuations are high

throughout the year. It is quite interesting to examine the variance at different

scales to effectively quantify these variations.

63

8/3/2019 Stat Mod 1011

64/67

An overview on the respective frequencies is given in table 15.

Table 15: Frequencies and scales of MRA and MRD

MRA scale MRD scale Frequency

a1 d1 1a2 d2 2

a3 d3 4

a4 d4 8

a5 d5 16

a6 d6 32

a7 d7 64

a8 d8 128a9 d9 256

a10 d10 512

a11 d11 1024

a12 d12 2048

The variance of a signal can equally be decomposed by using this technique.

For a signal Xt the time varying variance of the scale Sj of a wavelet coefficient

Wj,t can be calculated by

)var(2

1)( ,

2

, tj

j

jtx wS

S = .

Similar to the wavelet variance of a univariate signal, the wavelet covariance

decomposes the covariance between two signals on a scale by scale basis by

),cov()( ,,11

txtj

j

x xxS =

=

.

The wavelet variance shown in figure 39 reveals the intensity of variation from

one scale to the next of the dissolved oxygen time series. This graphical repre-

sentation of the wavelet variance enables the researcher to answer questions

concerning the dominant scale of variation in the time series, the homogeneity

of variations from one scale to the next, the importance of the variations at one

scale compared to the variations occurring at another scale.

64

8/3/2019 Stat Mod 1011

65/67

*

*

*

* **

0.05

0.10

0.20

0.50

1.00

2.00

Wavelet Scale

L

L

L LLU

U

U

U UU

1 2 4 8 16

32

Figures 39: Wavelet variance of dissolved oxygen with db4

65

8/3/2019 Stat Mod 1011

66/67

Literature

Adorf, H.-M., 1995: Interpolation of Irregularly Sampled Data Series - A Survey.

In: Shaw, R. A., H. E. Payne and J. J. E. Hayes (eds.): Astronomical Data

Analysis Software and Systems IV. ASP Conference Series, Vol. 77,

Academic Press, New York, pp. 1-4.

Box, G. E. P., G. M. Jenkins and G. C. Reinsel, 1994: Time Series Analysis. 3rd

ed., Prentice Hall, Englewood Cliffs.

Brmaud, P., 2002: Mathematical Principles of Signal Processing. Springer,

New York, 2002.

Brockwell, P. J. and R. A. Davis, 1998: Introduction to Time Series and Fore-

casting. Springer, Berlin.

Franses, P. H., 1999: Periodicity and Structural Breaks in Environmetric Time

Series. In: Mahendrarajah, S., A. J. Jakeman and M. McAleer (eds.):

Modelling Change in Integrated Economic and Environmental Systems.

Wiley, New York.

Gentili, S., Magnaterra, L., and G. Passerini, 2004: An Introduction to the statis-

tical filling of environmental data time series. In: Latini, G. and G.

Passerini (eds.): Handling Missing Data. WIT Press, Southampton, pp. 1-

27.

Han, J. and M. Kamber, 2006: Data Mining Concepts and Techniques. Mor-

gan Kaufmann, New York.

Hipel, K. W. and A. I. McLeod, 1994: Time Series Modelling of Water Re-

sources and Environmental Systems. Elsevier, Amsterdam.

Jrgensen, S. E. und W. J. Mitsch (eds.), 1983: Application of Ecological Mod-

elling in

Keith, L. H. (ed.), 1988: Principles of Environmental Sampling. ACS Profes-

sional Reference Book, ASC, Salem.

Latini, G. and G. Passerini (eds.), 2004: Handling Missing Data. WIT Press,

Southampton.

Little, R. J. A. and D. B. Rubin, 1983: Missing Data in Large Data Sets. In:

Wright, T. (ed.): Statistical Methods and the Improvement of Data Quality.

Academic Press, London, pp. 73-82.

Little, R. J. A. and D. B. Rubin, 1987: Statistical Analysis with Missing Data.Wiley, Chichester.

66

8/3/2019 Stat Mod 1011

67/67

Mallat, S. (1998): A wavelet tour of signal processing. Academic Press, New

York.

Mller, W. G., 2001: Collecting Spatial Data. Springer, Berlin.

Pollock, D. S. G., 1999: A Handbook of Time-Series Analysis, Signal Process-

ing and Dy

Powell, T. M. and J. H. Steele (eds.), 1995: Ecological Time Series. Chapman &

Hall, New York.

Rebecca, M., 1998: Spectral analysis of time-series data. Guilford Press, New

York.

Reckhow, K. H. und S. C. Chapra, 1983: Engineering Approaches for Lake

Management. Vol. 1: Data Analysis and Empirical Modelling. Butterworth,

Woburn.

Shumway, R. H. and D. S. Stoffer, 2000: Time Series Analysis and Its Applica-

tions. Springer, New York.

Stein, M. L., 1999: Interpolation of Spatial Data. Springer, Berlin.

Strakraba, M. und A. Gnauck, 1985: Freshwater Ecosystems Modelling and

Simulation. Elsevier, Amsterdam.

Documents

Stat Mod 1011