Stat Mod 1011

Embed Size (px)

Citation preview

  • 8/3/2019 Stat Mod 1011

    1/67

    Brandenburg Universityof Technology at Cottbus

    Dept. of Ecosystems andEnvironmental Informatics

    Statistical Modelling

    Univ.-Prof. Dr. habil. Albrecht Gnauck

    International Master Course of StudyHydroinformatics EuroAquae

    Winter term 2010/2011

  • 8/3/2019 Stat Mod 1011

    2/67

    Contents

    1. Events and data1.1 Analysis and control of aquatic ecosystems1.2 Statistical management of ecological data

    1.3 Sampling strategies1.4 Re-sampling and pre-treatment of data

    2. Probability functions and statistical measures2.1 Probability functions of ecological data2.2 Normal and skewed probability distribution functions2.3 Comparison of expectations2.4 Statistical measures

    3. Statistical test procedures3.1 Introduction

    3.2 General procedure of hypothesis testing3.3 Rules of decision3.4 Selected test procedures

    4. Linear regression and correlation analysis4.1 Steps of linear regression4.2 Confidence region of regression line4.3 The power of linear regression4.4 Empirical covariance and statistical measures of correlation

    5. Nonlinear regression analysis5.1 Polynomial regression5.2 Periodic regression5.3 Trend functions5.4 Comparison of regression functions

    6. Time series analysis6.1 Dynamic behaviour of time series6.2 Description of time series in the time and frequency domain6.3 Stationary processes6.4 Correlation and spectral functions

    7. Analysis of cycling processes7.1 Introduction7.2 Fourier analysis7.3 Digital data filter7.4 Wavelets

    Literature

    2

  • 8/3/2019 Stat Mod 1011

    3/67

    1. Events and data

    Statistical modellingof hydrological systems is an important task to extract

    information from former and actual states of aquatic ecosystems (aquifers,

    freshwater ecosystems, marine ecosystems) by means of water quantity data

    and water quality data. Holism and reductionism are the two different ap-

    proaches to study and model ecological processes and systems. Both ap-

    proaches are needed for ecosystems modelling, simulation and management.

    Holism

    Aquatic ecosystems are complex systems with nonlinear interrelation-

    ships. Holism attempts to reveal the properties of ecosystems by studying the

    system as a whole. The system properties cannot be found by a study of thesystem components separately. It is required that the study be on the system

    level. This does imply that a study of the ecosystem components is not suffi-

    cient. The components of ecosystems are coordinated to such an extent that

    ecosystems work as indivisible unities. A study of ecosystem components level

    will never reveal the ecosystem properties.

    Reductionism

    To simplify the ecosystem study and to facilitate the interpretation of ecologicalprocesses the ecosystem components are separated from the system

    level. This method is useful to find governing relationships in real systems. This

    method has obvious shortcomings when the functioning of the entire ecosystem

    is to be revealed. As an example: A forest is more than the sum of all trees.

    The analysis and control of dynamic aquatic ecosystems such as ponds,

    lakes, reservoirs and river basins is often a complicated task because of the

    high number of system elements (or components) and interrelationships be-

    tween system elements and between system elements and their environments.

    To solve management problems the system has to be decomposed and nonlin-

    ear interrelationships have to be linearised. Furthermore, the controllability of

    aquatic ecosystems has to refer to different and parallel working subsystems

    and system states. The quality of aquatic ecosystem analysis depends on the

    flexibility of statistical models used. The restricted information structure of com-

    plex aquatic ecosystems and aggregation of information lead to uncertainties ofthe modelling process and of the resulting models. Dynamic processes within

    3

  • 8/3/2019 Stat Mod 1011

    4/67

    aquatic ecosystems are initiated by switching of input and state variables. They

    result in rapid changes of system states and output variables (non-autonomous

    control) or in low changes (autonomous control). In general, complex dynamic

    systems like hydrological systems (aquatic ecosystems) are characterised by

    three features (table 1).

    Table 1: General characterisation of complex dynamic systems

    Feature Solving procedure

    High dimension Decomposition of system

    Uncertainty Analysis of dynamic characteristics(observability, controllability, pertur-bability, reachability, robustness, sta-bility, sensitivity)

    Restricted information structure Aggregation of information

    Statistical modellingof hydrological systems is based on data. They are ob-

    servations about characteristics and/or attributes of hydrological input, state and

    output variables. A group of state variables under study is called a (statistical)

    population (e. g. data of water flow, salinity data of river water, BOD data of a

    waste water treatment plant). If the frequency distribution of the attributes of a

    population is known, then it is possible to describe it by a probability densityfunction or probability distribution function, which is an analytical function de-

    fined by a number of parameters.

    For the study of aquatic ecosystems a subset of the population or a sample is

    used. A population is denoted as univariate if only one variable (or water qua-

    lity indicator) is considered. Common univariate measures are averages as

    measures of location of centres of data clouds along an axis, and measures of

    their dispersion as variance or spanning width. If more than one variable (or

    indicator) is considered the population is denoted as a multivariate one.

    Regression and correlation analysis belong to experimental statistical

    modelling of hydrological systems which is based on methods of the theory

    of probability. To solve practical problems such approaches are necessary

    which are compatible with the stochastic nature of the input variables and state

    equations. Statistical procedures will be the adequate mathematical methods as

    long as the processes within the systems and their describing equations are

    4

  • 8/3/2019 Stat Mod 1011

    5/67

    unknown. A distinction is made between two groups of methods depending on

    whether the variable time is included or not: Static methods (without consid-

    eration of time as variable) and dynamic methods (with consideration of the

    variable time). The latter one is often called time series analysis or dynamic sta-

    tistics. Simple and multiple linear and non-linear regression and correlation be-

    long to static methods as well as multivariate statistical procedures. Static pro-

    cedures answer the question whether there is a relationship between two or

    more variables of an environmental system. This question can be answered by

    a regression analysis which gives out the type of relationship between vari-

    ables.

    Statistical modelling is done for different purposes. Administrations as well asindustrial and agricultural companies use statistical data and results to plan their

    operations and economic developments. Researchers use statistics mainly as a

    first step to derive new scientific results. Therefore, the topics of statistical

    modelling can be formulated by:

    1. Data sampling (Methods: Sampling design, re-sampling, plausibility

    checks, outlier correction).

    2. Data analysis to fulfil the requirements of environmental administra-tions and associations (Methods: Descriptive statistics, frequency distri-

    butions, averages, variances, error correction, significance tests).

    3. Data analysis to fulfil the requirements of different professional users

    (e.g. industry, agriculture, forestry) (Methods: Explanatory statistics, mul-

    tivariate statistics, geostatistics, time series analysis).

    4. Basic research (Methods: Regression and correlation analysis, multi-

    variate statistics, advanced statistical techniques, digital data filtering,

    frequency analysis).

    Disturbances of statistical analysis of hydrological dataare given by:

    1. Mostly, only small sets of data of representative regularly sampled data

    are available.

    2. The power of natural and artificial (man-made) external as well as natural

    internal driving forces on hydrological indicators influence the quality

    ofdata to be obtained.

    5

  • 8/3/2019 Stat Mod 1011

    6/67

    3. Mostly, the a-priori process information on water quality indicators is

    low.

    4. Hydrologic processes possess different rate constants.

    5. Cycling effects in hydrological data are induced by natural internal or

    external as well as by man-made external processes.

    Classification of hydrological data

    Hydrological data may be classified by their origin:

    1. Measured and/or observed data of hydrological indicators will be ob-

    tained by field samples and/or laboratory experiments. They are directly

    observed (direct observations) or indirectly observed (due to calibration

    of analytical instruments or sensors).2. Summary data will be derived from statistics or from restricted observ-

    able ecological, respective water quality indicators.

    3. Simulated data will be obtained by simulation models.

    1.1 Analysis and control of aquatic ecosystems

    An aquatic ecosystem is a biotic and functional system or unit, which is able to

    sustain life and includes all biological and non-biological variables in that unit.

    Spatial and temporal scales are not specified a-priori, but are entirely based

    upon the objectives of the ecosystem study. Ecosystems are often called com-

    plex systems.

    Several approaches exist to study the behaviour of ecosystems.

    Empirical studies collect bits of information. An attempt is made to integrate

    and assemble the studies into a complete picture.

    Comparative studies are presented to compare some structural and functionalcomponents for a range of ecosystem types.

    Experimental studies where manipulations of a whole ecosystem are used to

    identify and elucidate ecological mechanisms.

    Modelling and computer simulation studies to work out ecosystem man-

    agement plans and to derive eco-technological tools for goal oriented control

    actions.

    6

  • 8/3/2019 Stat Mod 1011

    7/67

    Information systems and decision support systems studies to support in-

    dustrial, agricultural and administrative ecological decisions and to work out

    medium-term and long-term development plans for ecological management.

    Like many words for which people have an intuitive understanding, a system isdifficult to define precisely. In relation to the physical and biological sciences, a

    system is an organised collection of interrelated physical components charac-

    terised by a boundary and functional unity. A system is a collection of communi-

    cating materials and processes that together perform some set of functions. A

    system is an interlocking complex of processes characterised by many recipro-

    cal cause-effect pathways.

    A system is a set of interrelated objects (elements, parts) that have certain gen-

    eral properties:

    1. It fulfils a certain function, i.e. it can be defined by a system purpose recog-

    nisable by an observer.

    2. It has a characteristic constellation of essential system elements and an es-

    sential system structure which determine its function, purpose, and identity.

    3. It loses its identity if it is destroyed.

    Analysis and control of aquatic ecosystems are often complicated because

    of the high number of system elements and interrelationships between sys-

    tem elements and between an ecosystem and its environment. Mostly, an eco-

    system will be analysed as one unit. Dynamic processes within ecosystems are

    initiated by switching processes of input and state variables with different trans-

    fer time constants (fig. 1). If they are overlaid by external and internal distur-

    bances it can not be distinguished which part of ecosystem response and its

    intensity stem from a single ecological element.

    For ecosystem analysis, the complex structure of an ecosystem requires its de-

    composition and linearization of nonlinear interrelationships. The controllability

    of ecosystems has to refer to different working elements (or subsystems) and

    system states. Therefore, the whole ecosystem will be divided into several sub-

    systems with internal and external feedbacks. This leads to uncertain state-

    ments on the ecosystem behaviour. The quality of ecosystem analysis depends

    7

  • 8/3/2019 Stat Mod 1011

    8/67

    on the flexibility of mathematical models used for computation. Restricted infor-

    mation structure and aggregation of information lead to model errors.

    Figure 1: Switching processes within a freshwater ecosystem

    Ecosystems are multidimensional systems with several input and output

    variables. They can be seen as black box, grey box or white box systems. In

    dependence of the numbers of input and output variables SIMO-, MIMO-, SISO-

    and MISO-systems will be distinguished.

    Ecosystems can be considered as stochastic transfer systems described

    by its state variables and parameters. They are characterised by measurable

    inputs, immeasurable (stochastic) disturbances as well as by measurement er-

    rors. In the case of real systems, disturbances, input signals and measurement

    errors will be overlaid and produce disturbed (and unsure) output signals.

    Transfer functions are represented by

    1. Pulse functionx(t) = 0 for t < 0 and t > T,x(t) =x0 for 0 t T,

    2. Jump function:x(t) =x0(t) with (t) = 0 for t < 0 and (t) = 1 for t 0,

    3. Harmonic function: x(t) = x0 + cos(t+) for - < t < + orx(t) = x0ej(t+) =x0

    +e

    jt withx0+=x0 e

    j,

    8

  • 8/3/2019 Stat Mod 1011

    9/67

    4. White noise function.

    Other transfer functions are

    1. Exponential function:x(t) =x0 e-t/T

    for 0 t < + orx(t) =x0+

    ejt e

    tfor

    0 t < + and 0,2. Periodic function: x(t) = a0/2 + i aicos(i0t) + i bisin(i0t) or

    x(t) = i ciej(i0t),

    3. Dirac impulse:x(t) = 0 for t < 0 and t > T,x(t) = (t) with (t) = 0 for t 0

    and (t)dt = 1,

    4. Ramp function:x(t) = 0 for t < 0 andx(t) = at for t 0 or

    5. Time discrete signal: x~(t) = kx(kT)(t-kT) with k = 0, 1, 2,

    and T 1/(2fmax) where fmax is the maximum frequency contained in the

    data serie.

    Feedback structures (or couplings) within ecosystems are given by simple

    feed-forward, feed-back self-tuning or complicated couplings between the eco-

    system elements.

    1.2 Statistical management of ecological data

    To handle and investigate hydrological data with sense they should be charac-

    terised by some relationships. The increase of information content of hydrologi-

    cal data analysis is expressed by the number of data operations. Four scales

    can be distinguished (fig. 2).

    Increaseofinformationcontent

    Ratio Scale

    Interval Scale

    Ordinal Scale

    Nominal Scale

    Figure 2: Data scales in hydrological research

    9

  • 8/3/2019 Stat Mod 1011

    10/67

    Transformations from one data scale to another serve as unificators of vari-

    ables (tab. 2). The information content (knowledge, antithesis of uncertainty)

    and the scale level should not be changed during sampling and/or statistical

    data analysis. If there is no empiric equivalence scale, then the data are valu-

    ated as comparable.

    Table 2: Comparison of data scales in hydrology

    Scale Arithmetic operation Statistical measure

    Ratio Scale +, -, , / Geometric mean

    Interval Scale +, - Arithmetic mean

    Ordinal Scale none Median, Quartiles

    Nominal Scale none Frequencies only

    Nominal scale:

    No relationship between events, sometimes they are coded by numbers (e. g.

    lottery, pie charts), no arithmetic operation possible.

    Ordinal scale:

    Ranking of events or representations, classification of environmental indicators

    (e. g. EU water quality classes, soil classes etc.), ordinal comparisons are pos-

    sible: Class I > Class II, estimation of median and quartiles.

    Interval scale:

    Ordinal scale with equal intervals (e. g. water temperature), statements on dis-

    tances and differences between data are allowable. No natural origin (Zero

    point) exists.

    Ratio scale:

    It is an interval scale with a natural origin and allows statements on ratios (e.

    g. concentrations).

    One of the most important characteristic of hydrological data is its uncertainty

    which can be characterised as a state or condition of incomplete or unreliable

    knowledge. Sources of uncertainty are characterised by

    1. Statistical analysis depends on the a-priori information of essential hy-

    drological variables considered.

    2. Hydrological variables and theirrates of changes have different scales

    in time and space.

    10

  • 8/3/2019 Stat Mod 1011

    11/67

    3. Mostly, a small set of representative data will be available.

    The strength of disturbances of the data observed leads to fuzzy effects of in-

    terpretations.

    Figure 3 shows different types of annual water quality data series which can

    be distinguished by their statistical measures:

    0

    10

    20

    30

    TW(C)

    0

    5

    10

    NH4(mg/l)

    0.6

    0.8

    1

    Lf(mS/cm)

    7

    8

    9

    pH-value

    0

    0.5

    1

    NO2(mg/l)

    0

    0.5

    1

    1.5

    o-PO4-P(mg/l)

    J FM AM J J A SO ND0

    5

    10

    15

    time (month)

    O2(mg/l)

    J F M AM J J A SO ND0

    5

    10

    15

    time (month)

    NO

    3(mg/l)

    J F M AM J J A SO ND0

    20

    40

    time (month)

    DO

    C(mg/l)

    Figure 3: Data series of water quality samples

    The quality and usability of hydrological data are usually highly depending on

    the suitability of the sample and the adequacy of the sampling or monitoring

    program. The goal of sampling is to get information about the frequency distri-

    butions of data indicating environmental states or about the distribution parame-

    ters. These estimates are called sample statistics and form a base to give

    prognoses on environmental developments in general, but also on hydrological

    changes. If an investigation is based on samples then sampling statistics de-

    pends on the particular sampling environment, on stationary or instationary

    external or internal effects as well as on random influences. Sampling fre-

    quency depends on hydrologic process dynamics, on the degree of water pollu-

    tion, on the type of pollution, and on the type of substance. Different results may

    11

  • 8/3/2019 Stat Mod 1011

    12/67

    be obtained if different samples are selected. This variation in the data from

    sample to sample is called sampling variability. The difference between a sta-

    tistic and the true population value is called sampling error. It increases if more

    random factors influence the sampling procedure. There is a margin of uncer-

    tainty expressed in terms of the sampling variance of the estimator. Sampling

    variance is a measure of the precision of the estimates.

    Comparison of hydrological data series:

    1. Average is time dependent, dispersion is approximately time constant.

    2. Average is approximately time constant, dispersion is time dependent.

    3. Average and dispersion are time dependent.

    Variability within data series is caused by:

    1. Environmental influences or factors,

    2. Intrinsic factors between water samples,

    3. Different sample treatment,

    4. Different data treatment.

    1.3 Sampling strategies

    Ecological data are obtained by field samples and/or laboratory analysis.

    They are directly observed (direct observations) or indirectly observed (due to

    calibration of analytical instruments and sensors). Summary data are derived

    from statistics or by restricted observable indicators. Simulated data are ob-

    tained by simulation models.

    Sampling design is based on different procedures. The most common used

    designs are

    1. Systematic (periodically) sampling (yearly, monthly, weekly, daily, andhourly).

    2. Sampling based on the level ofadmissible fault of the annual mean.

    3. Random sampling.

    4. Sample size for normal distributed data without trend and peri-

    odicities:

    n = ((t(95)v)/e(x*))2 with t(95) = 1.96, v = x*/s100 and e(x*) = 10% allow-

    able deviation from mean.

    12

  • 8/3/2019 Stat Mod 1011

    13/67

    The sampling location in space and time can have a very real effect on the qual-

    ity and usefulness of data in hydrology. Site selection should be made primarily

    on the basis of the goal of the study as well as on the nature of the hydrologic

    process or phenomenon under consideration. Optimum number of samples,

    frequency of sampling and spacing can be estimated either by preliminary sam-

    pling experiments, by conclusions from expert knowledge, by practical experi-

    ences, or by statistical sampling design formulas and methods. Geostatistical

    methods can be helpful to determine optimal space distribution of sampling

    points.

    The sampling procedure covers three parts.

    1. Hypothesis (program purpose, sampling design, formulation of ques-tions),

    2. Observation (sampling techniques, sampling protocol, analytical tech-

    niques),

    3. Interpretation (data analysis, interpretation of results)

    Recommendations for hydrological sampling:

    1. The goals and needs for hydrological data collection should be formu-

    lated explicitly for each application before sampling is started.2. Priorknowledge of factors that affect hydrological variables to be sam-

    pled should be given.

    3. During sampling significant changes of external and internal driving

    forces should not take place.

    4. Existing estimates may be sufficient if they were obtained by an unbi-

    ased sampling design.

    5. Sampling design in hydrology should cover the water budget (surface

    and groundwater), hydrochemical variables (organic and inorganic sub-

    stances, metabolites), hydrophysical variables (considering internal and

    external driving forces), hydrobiological variables (life cycle of plants and

    organisms, conversion of organic and inorganic substances), microbi-

    ological variables, and other variables as required.

    Disturbances of data analysis:

    Only small sets of representative regular sampled data are avail-able.

    13

  • 8/3/2019 Stat Mod 1011

    14/67

    The power of external and internal driving forces on water qual-

    ity (hydrological) indicators influences the quality of data to be ob-

    tained.

    The a-priori process information on water quality (hydrological)

    indicators is low.

    Water quality (hydrologic) processes possess different rate con-

    stants.

    1.4 Re-sampling and pre-treatment of data

    Series of measurements of hydrological data are time series of data recorded at

    discrete points in time often with unequal sampling intervals. In practice, they

    often contain missing data or they are based on different sampling intervals in

    time and space. To extract hydrologic process information from single data

    (events) the data series should be completed and based on a regular sampling

    grid. The application of static and dynamic statistical methods for analysing

    such data sets requires equidistant data. Re-sampling generally means data

    interpolation or, in the case of noisy information, dataapproximation. Figure

    4 gives an overview on these procedures.

    R a w h y d r o lo g i c a l d a ta

    In te rpo la t i on

    E q u i d is t a n t d a t a

    A p p ro x im a tio n D ig ita l d a ta f i lte r in g

    S t a t ic D y n a m ic

    F u n c t io n a l r e l a t io n s h i p

    H i g h p a s sL o w p a s s

    C o n s i s t e n t d a t a

    Figure 4: Interpolation, approximation and digital filtering of data

    14

  • 8/3/2019 Stat Mod 1011

    15/67

    The goal of the application of interpolation and approximation methods onto

    incomplete time series is to fill the intervals between two grid points so that se-

    ries of measurements with small unique sampling intervals are kept. Table 3

    contains some commonly used interpolation methods.

    Table 3: Interpolation methods

    Method Algorithm Characteristics

    +

    + 3,x*, s.

    The test statistic: r= (|(x+

    -x*)|/s)n/(n-1),wherex+ is to be expected as an outlier,x* is the expectation of the sample, s is

    the standard deviation of the sample, and n sample size. Choice of signifi-

    cance level = 0.05, degrees of freedom f= n 2.

    Decision: Acceptance ifrcalc< rtab, otherwise rejection (cf. table 10).

    Table 10: Table of r test (according to Kaiser and Gottschalk 1974)

    f = n - 1 P(95) P(99) P(99,9)

    1 1,409 1,414 1,4142 1,645 1,715 1,7303 1,757 1,918 1,9824 1,814 2,051 2,1785 1,848 2,142 2,3296 1,870 2,208 2,4477 1,885 2,256 2,5408 1,895 2,294 2,6169 1,903 2,324 2,67810 1,910 2,348 2,73012 1,920 2,385 2,81214 1,926 2,412 2,87416 1,931 2,432 2,29118 1,935 2,447 2,20520 1,937 2, 460 2,99050 1,951 2,529 3,166100 1,956 2,553 3,227200 1,958 2,564 3,265300 1,958 2,566 3,271500 1,959 2,570 3,279700 1,959 2,572 3,283

    1,960 2,576 3,291

    29

  • 8/3/2019 Stat Mod 1011

    30/67

    Example:

    From laboratory analysis of water quality exist a small data set of BOD data with

    x1 = 30,4 mg/l,x2 = 30,1 mg/l,x3 = 30,5 mg/l,x4 = 30,9 mg/l,x5 = 29,2 mg/l. The

    last value is expected to be an outlier. That would mean the data set is inhomo-

    geneous.

    x* = 30,2; s = 0,638 ; n = 5

    Test statistic: r= (|(29,2 30,2)|/0,638)5/(5-1) = (1,0/0,638)5/4= 1,5671,118

    = 1,752

    Comparison: rcalc and rtab for f= n 2 = 3: rcalc = 1,752, r(95) = 1,757; r(99) =

    1,918; r(99,9) = 1,982.

    Decision: Ifrcalc< rtab, then acceptx5: 1,752 < 1,757.

    Result and interpretation: The value x5 is not an outlier and belongs to the data

    set. The data set itself seems to be homogeneous. In the case that a value has

    been found as an outlier the average and variance have to be re-calculated and

    tested again.

    30

  • 8/3/2019 Stat Mod 1011

    31/67

    4. Linear regression and correlation analysis

    A regression analysis is required for problems in which stochastic dependen-

    cies (stochastic cause-effect relationships) have to be described by functions

    with one or more several variables. Linear regression analysis is one of the

    best studied statistical methods. Goal of a simple or multiple linear regression

    analysis is the determination of a linear relationship between two or more

    measurable (or observable) variables or characteristicsXand Yof a hydrologi-

    cal system. The measurement values of size n consist of n pairs of data (x1, y1),

    (x2, y2),, (xn, yn) (orn-tupels of data) which can be considered as realisations

    of a two-dimensional (or n-dimensional) random vector (X, Y).

    4.1 Steps of linear regression

    1. Step: Scatter-plot of variables of interest (fig. 10).

    0,0 5,0 10,0 15,0 20,0 25,0 30,0

    Temp

    7,6

    7,8

    8,0

    8,2

    8,4

    8,6

    8,8

    9,0

    pH

    Figure 10: Scatterplot of hydrological variables

    2. Step: Estimate the relationship (positive or negative) between variables.

    Directions of relationships

    1. Positive relationship: Increasing values of X and increasing values of Y.

    2. Negative relationship: Increasing values of X and decreasing values of Y.

    3. No relationship between X and Y (e. g. parallels to the axes).

    The relationships can be strong or weak.

    31

  • 8/3/2019 Stat Mod 1011

    32/67

    3. Step: Formulate the (linear) model equation (fig. 11):

    pH = 7.868 + 0.025 Temp

    7,6

    7,8

    8,0

    8,2

    8,4

    8,6

    8,8

    9,0

    0,0 5,0 10,0 15,0 20,0 25,0 30,0

    Temp

    observed

    linear

    Linear regression between Temp and pH

    Figure 11: Linear relationship between variables

    4.2 Confidence region of regression line

    4. Step: Calculate the confidence region of the regression line.

    The general model of linear regression is given by y = a +bx. Using the confi-

    dence intervals ofa and b a confidence region of the (mean) linear model EY =

    a + bx can be defined by gu < EY < go where gu = y* - sy*t and go = y* + sy*t. The

    limits of confidence are symmetric hyperbolas around the linear regression

    model y* = a* + b*x. They get their minimum for x = x* and increase with for

    other x values. Therefore, the confidence statements will be fuzzier. The width

    of the confidence band L depend from sy* and can be calculated by L = 2 sy*t.

    4.3 The power of linear regression

    The strength of a relationship is expressed by the empirical (linear) correlation

    coefficient: r= (xi x*)(yi y*)/(xi x*)2(yi y*)

    2. By means of this formula

    (explanation see chapter 4.4) the next step of linear regression procedure is

    derived.

    32

  • 8/3/2019 Stat Mod 1011

    33/67

    5. Step: Calculate the power of relationship: r = 0.493 or B = r2

    = 0.243.

    The calculation algorithm is presented in chapter 4.4.

    To derive statistical characteristics of a linear regression model the following

    cases should be distinguished:

    1. b high, r high, s low,

    2. b high, r low, s high,

    3. b low, r low, s low,

    4. b low, r very low, s high.

    4.4 Empirical Covariance and statistical measures of correlation

    A correlation analysis answers the question about the strength and direction of

    a linear (but not severe functional) relationship between two or more variables.

    The power or intensity of such a relationship is expressed by correlation. Meas-

    ures of correlation are the correlation coefficient r, the performance index B = r2

    or the partial correlation coefficient rxy,z.

    Combining data series of different water quantity or water quality variables re-

    ferring to two or more measurable characteristics sets of pairs of data (x1, y1),

    (x2, y2) ,, (xn, yn) orn-tupel of data will be obtained (fig. 12).

    X

    120100806040200Y

    120

    100

    80

    60

    40

    20

    0

    -20

    Figure 12: Scatterplot of a bivariate relationship.

    These sets of data can be seen as realisations of a two- or multi-dimensional

    stochastic vector (X, Y,). Normal probability distribution of data pairs or data

    tupel is a (strong) prerequisite.

    33

  • 8/3/2019 Stat Mod 1011

    34/67

    A visualisation of a relationship between three variables is possible but in some

    cases not really helpful. The information content is high but cannot be extracted

    very clearly (fig. 13).

    Y

    120 140

    0

    20

    120100

    40

    60

    100

    80

    80

    100

    120

    8060

    X Z

    6040 4020 200 0

    Figure 13: 3-D scatterplot of variables

    Such relationships are characterised by statistical measures which are denoted

    as correlation measures. In principle, arithmetic means and empirical variances

    of data series are used:

    x* = 1/n xi and y* = 1/n yi

    sx2

    = 1/(n-1) (xi x*)2

    and sy2

    = 1/(n-1) (yi y*)2.

    A new data series with n pairs of data (xi, yi), i = 1, , n is formed by two vari-

    ables {X} and {Y}. The empirical covariance sxy will be calculated as follows:

    )yy()xx(1n

    1s i

    n

    1iixy

    ==

    ==

    n

    1iii

    )yxnyx(1n

    1.

    sxy can be positive or negative. For small values ofxi, the difference between

    arithmetic mean and xi will be negative. For big values ofxi, the difference be-

    tween arithmetic mean and xi will be positive. This is also valid for data yi. For

    this reason, a negative covariance characterises a relationship where big values

    xi are connected with small values yi mostly and vice versa.

    By normalisation ofsxy with empirical standard deviations sx und sy one gets the

    empirical coefficient of correlation rxy:

    34

  • 8/3/2019 Stat Mod 1011

    35/67

    ss

    sr

    yx

    xyxy = .

    Because ofsxy = syx also rxy = ryx is valid. rxy is a measure of strength and direc-tion of a linear relationship between hydrological variablesXand Y.

    Statistical measures of correlation between two or more hydrological variables

    are mainly based on the assumption that the data sets are subsets of Gaussian

    distributed data sets. The rank correlation procedure functions without assum-

    ing a normal probability distribution of the data set to be analysed.

    Empirical bivariate correlation coefficient

    r= (xi x*)(yi y*)/(xi x*)2(yi y*)2

    Performance index (coefficient of determination)

    B = r2

    Partial correlation coefficients

    rxy,z = (rxy - rxzryz)/(1 rxz2)(1 ryz

    2)

    rxz,y = (rxz - rxyryz)/(1 rxy2)(1 ryz

    2)

    ryz,x = (ryz - rxyrxz)/(1 rxy2

    )(1 rxz2

    )

    Multiple correlation coefficients

    x, y, z x = f(y, z)

    Rx, yz = rxy2

    + rxz2

    2rxyrxzryz)/(1 - ryz2)

    Multiple performance index

    BBx.yz = (rxy2 + rxz

    2 2rxyrxzryz)/(1 - ryz2)

    SPEARMANs rank correlation

    (Valid for small sample size, normal probability distribution not necessary)

    )1(

    )(6

    12

    1

    2

    = =

    nn

    yx

    r

    n

    iii

    S)1(

    6

    12

    1

    2

    = =

    nn

    iDn

    i

    Table 11 contains data and an explanation of the the ranking procedure for a

    SPEARMAN-test.

    35

  • 8/3/2019 Stat Mod 1011

    36/67

    Table 11: Data and procedure of rank correlation

    xi R(xi) yi R(yi) Di Di2

    0,5 5,5 4 3 2,5 6,25

    0,8 7,5 6 5 2,5 6,25

    1,1 10 2 1 9 81

    0,5 5,5 10 8 -2,5 6,250,4 4 8 6 -2 4

    0,3 2 12 10 -8 64

    0,9 9 5 4 5 25

    0,8 7,5 3 2 5,5 30,25

    0,3 2 9 7 -5 25

    0,3 2 11 9 -7 49

    297

    Result: rS = -0,8

    Comparison ofrS and rStab (positive values only):

    Forn 30 the table of probability values ofrS has to be used. Forn > 30 the

    table of standardised normal probability distribution should be used:

    rSTab(95) = 0.5515; rSTab(99) = 0.7333; rSTab(99,9) = 0,8667.

    Decision: If rS rStab, then reject rS. the example shows that for each signifi-

    cance level rSrStab is valid.

    Result and interpretation: Between both data sets exists a relatively strong

    negative correlation.

    36

  • 8/3/2019 Stat Mod 1011

    37/67

    5. Nonlinear regression analysis

    In the case that a linear regression model is not valid or insufficient other re-

    gression models should be tested. From this statement the following step of

    (linear) regression procedure is derived:

    6. Step: Find out other model types if the linear model is insufficient (fig. 13).

    Figure 14 contains some standard nonlinear regression models computed by

    means of SPSS. The results are presented in table 12.

    7,6

    7,8

    8,0

    8,2

    8,4

    8,6

    8,8

    9,0

    0,0 5,0 10,0 15,0 20,0 25,0 30,0

    Temp

    observed

    linearlogarithmic

    invers

    squared

    cubic

    composed

    power

    S-shaped

    growth

    exponential

    logistic

    pH

    Figure 14: Linear and nonlinear regression curves

    Table 12: Results of nonlinear regression modelsModel B b0 b1 b2 b3LIN 0.243 7.8683 0.0252

    LOG 0.198 7.6798 0.2194

    INV 0.158 8.3568 -1.2875

    QUA 0.308 8.1097 -0.0310 0.0022

    CUB 0.432 7.4486 0.2131 -0.0191 0.0005

    COM 0.238 7.8731 1.0030

    POW 0.194 7.6972 0.0262

    S 0.156 2.1219 -0.1544

    GRO 0.238 2.0635 0.0030

    EXP 0.238 7.8731 0.0030LGS 0.238 0.1270 0.9970

    37

  • 8/3/2019 Stat Mod 1011

    38/67

    When comparing the performance indexes of these standard models the best

    statistical model is the cubic one. But this model represents the data cloud by

    43.2% only. The remaining 56.8% are not described by the model. As an overall

    outcome of this analysis all of these models should be rejected and other types

    of nonlinear models should be investigated.

    5.1 Polynomial regression

    The basic model is given by y= a0+ aixi, i= 1,, where n is called the order

    of the polynomial. Figure 15 shows polynomials of different order. Each of the

    polynomials represents the given data set by a relatively high degree of per-

    formance. For 6th

    and 7th

    order polynomials the performance will be B = 1.

    Figure 15: Examples of polynomial regression

    38

  • 8/3/2019 Stat Mod 1011

    39/67

    By comparing the graphs different interpretations are possible. For the polyno-

    mial of 7th

    order the graph indicates negative values which do not exist. The

    advantage of polynomial regression is to get an algorithm for calculation of the

    existing nonlinear relationship between hydrological variables. Disadvantages

    are the high number of coefficients and sometimes physically not realistic re-

    sults. The best models are not the ones where the graphs are joining all data

    points.

    Other model types used in water quality management are multiple linear or

    nonlinear regression models (e. g. DO(t) = a0 + a1TW + a2Q + a3BSB or DO(t) =

    a0+ a1TW + a2Q + a3BSB + a4TW + a5Q + a6BSB + a7TW) or models derived

    from control theory (e. g. stochastic transfer method). A continuous dynamicprocess is described by a time discrete model applying the z-transformation on

    a difference equation, G(z) = B(z-1

    )/A(z-1

    ) +(z)

    5.2 Periodic regression

    The basic relationship is given by y = a + b1sin x + b2cos x. The equation

    represents the simplest form of periodic regression or so-called Fourier polyno-

    mial. In an extended form this method is called Fourier analysis (see chapter 7).In figure 16 water temperature of a reservoir at three depth levels (0m, 10m,

    25m) and the approximating graphs are presented.

    Figure 16: Periodic regression of water temperature in a reservoir

    39

  • 8/3/2019 Stat Mod 1011

    40/67

    It can clearly be seen that water temperature (and all other hydrological cycling

    variables) can be approximated very well by periodic functions. The advantage

    of this family of regression type functions is the visualisation of a cycling proc-

    ess, the disadvantage is that the functions are valid for fixed cycling periods

    only.

    5.3 Trend functions

    Medium-term and long-term temporal and spatial developments (trends) of hy-

    drological variables can be estimated by simple, explicitly given functions. Pa-

    rameter estimation is done by the method of least squares (MKQ). Figure 17

    shows the development of BOD in along a river stretch following a polynomial of

    2nd order.

    y = 0,0908x2

    - 0,5374x + 2,6386

    R2

    = 0,9501

    0,0

    0,5

    1,0

    1,5

    2,0

    2,5

    TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

    sampling point

    BOD(mg/l)

    Figure 17: Polynomial trend function for BOD in a river

    Other examples of linear and nonlinear trend functions are presented in figures

    18 to 20.

    y = 0,0166x + 0,0854

    R2

    = 0,8938

    0,00

    0,05

    0,10

    0,15

    0,20

    25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

    sampling point

    o-PO4-P(mg/l)

    Figure 18: Linear trend of phosphate phosphorus in a channel

    40

  • 8/3/2019 Stat Mod 1011

    41/67

    The linear function (also denoted as a polynomial of 1st

    order) is able to follow

    the increasing trend of phosphate phosphorus load due to waste water input in

    a low flow channel with acceptable accuracy. The deviations of regression line

    from measurements are small. For the same river stretch, the approximating 2nd

    order polynomial of water flow (fig. 19) shows stronger deviations after conjunc-

    tion of the main river with a channel. The reason for this are changing hydraulic

    conditions and increasing values of water flow. The stationary or uniform flow

    conditions of the first part of the water body are disturbed now. Considering the

    performance index the graph should be acceptable. But the regression model is

    not able to compensate the positive jump in water flow because it works with

    fixed parameters (coefficients). Therefore, another regression model should

    used.

    y = 3,3914x2

    - 18,053x + 51,117

    R2

    = 0,809

    0

    10

    20

    30

    40

    50

    60

    70

    25014 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200

    sampling point

    flow(m3/s)

    Figure 19: Quadratic trend function of water flow

    On the other hand, for the same river stretch the trend of chlorophyll-a is ex-

    pressed by a 2nd

    order polynomial again (fig. 20).

    y = 1,4627x2

    - 6,4221 x + 67,115

    R2

    = 0,6459

    0

    20

    40

    60

    80

    25014 TeK0030 SPK0010 SPK0020 H v0190 H v0200

    sampling point

    Chlorophyll-a(g/l)

    Figure 20: Quadratic trend function of chlorophyll-a

    41

  • 8/3/2019 Stat Mod 1011

    42/67

    The performance index is lower than before in fig. 19 for water flow because of

    some disturbances caused by hydrophysical phenomenon. But the trend fol-

    lows the computed polynomial. Taking into account the variations in chlorophyll

    measurements the trend polynomial is quite acceptable.

    The following table gives a survey on trend functions used to estimate the de-

    velopments of water quality in a river (table 13). All polynomials are of 2nd order.

    The signs in the last column indicate significance on a 95% probability level.

    Table 13: Trend functions of water quality in the River Havel

    Water quality indicator Trend R P (95%)

    Water flowTemperature

    ConductivityChloride

    polynomialpolynomial

    polynomialpolynomial

    0,81260,6177

    0,19710,0382

    ++

    --

    DOBODCSV

    polynomialpolynomialpolynomial

    0,38580,42640,7611

    +++

    NH4-NNO2-NNO3-NO-PO4-PTPSiO2

    exponentialexponentialexponentialexponentialpolynomialpolynomial

    0,56690,48790,47460,86830,08220,8888

    ++++-+

    Suspended matterChlorophyll-aInorg. part of biomassLoss of org. matter

    polynomialpolynomialpolynomialpolynomial

    0,02270,60320,67420,1418

    -++-

    As can be seen from table 13, polynomial and exponential trend functions are

    sufficient to describe the changing water quality mathematically.

    Interpretations of trend functions can be given as follows:

    Linear trend:

    y(t) = a0 (t) + a1 (t) x(t).

    (Interpretation of parameters: (a0) mean initial value, (a1) mean rate of

    change)

    Squared trend:

    y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t).

    (Interpretation of parameters: (a0) - mean initial value, (a1) - mean rate of

    change, (a2) mean process acceleration)

    42

  • 8/3/2019 Stat Mod 1011

    43/67

    Polynomial trend:

    y(t) = a0(t) + a1 (t) x(t) + a2(t) x2(t) + ..... + an (t) x

    n(t).

    (Interpretation of parameters is mostly impossible).

    Exponential trend:x(t) = x(0) e

    - kt+ E.

    (Interpretation according to 1st

    order kinetics:x(0) initial concentration value, k

    rate of change, E random quota).

    5.4 Comparison of regression functions

    To describe one and the same data set different nonlinear models can be ap-

    plied.

    Figure 21: Comparison of different regression functions for the same data set

    43

  • 8/3/2019 Stat Mod 1011

    44/67

    By comparing the initial and the final reach of regression functions the best

    functional relationship will be selected (fig. 21). Also the linear model seems to

    be suitable. As can be seen in part H, the middle range of all computed models

    shows very small variations while the initial and the final part of the graphs show

    a spreading of curves. An evaluation of the quality of fit can be given by:

    Linear coefficient of determination (performance index):

    R2

    = B = ( -y y ) / (y - y ) ,

    Nonlinear performance index: Bnl = 1 - ( (y - )y2

    / (n-1) sy ),

    Residual sum of squares: SR = (yi - )y2, or

    Residual dispersion: s

    2

    = SR/(n m 1) (n number of data, m number ofparameters).

    44

  • 8/3/2019 Stat Mod 1011

    45/67

    6. Time series analysis

    The distinction between discrete and continuous variables is not a clear dichot-

    omy because continuous processes (seen from a physical point of view of un-

    derstanding nature) will be observed at discrete time events. Therefore, mostly

    random variables are observed.

    6.1 Dynamic behaviour of time series

    Freshwater ecosystems may be seen as switching networks where inputs are

    transformed into outputs by an operatorwhich describes the transient behav-

    iour of ecological processes (fig. 22). The overall operator transforms input

    signals into output signals: y(t) = x(t) where the signals will be smoothed

    (damped), and there exists some redundancy between input and output signals.

    x(t) y(t)Figure 22: Schematic diagram of a transfer process

    Therefore, water related processes are represented by time varying signals. In

    figure 23, NO3-N raw data are described by a polynomial trend as follows: NO3-

    N(t) = 1,8987 0,0754 t + 0,0028 t2

    - 0,00003 t3. An exact mathematical (or

    functional) description of random fluctuations is not possible. The function de-

    scribes more or less the mean behaviour of the process.

    706050403020100

    3,5

    3,0

    2,5

    2,0

    1,5

    1,0

    ,5

    0,0

    Figure 23: Approximation of a time varying process by a function

    45

  • 8/3/2019 Stat Mod 1011

    46/67

    6.2 Description of time series in time and in frequency domain

    Hydrological systems can be seen as stochastic transfer systems described by

    system state variables and parameters. They are characterised by measurable

    inputs, not measurable (stochastic) disturbances as well as by measuring er-

    rors. Disturbances, input signals and measurement errors will be overlaid and

    will produce output signals.

    Mathematical descriptions of hydrological time series can be represented by

    time domain functions (cf. transfer functions, pp. 8 and 9).

    In the frequency domain hydrological time series are represented by Fourier-

    transforms of correlation functions, by coherency functions as well as by wave-

    lets.

    6.3 Stationary processes

    Because of time lags between input and output processes stationary processes

    will then be reached when all transient processes are decayed. Therefore,

    some statistical characteristics of signals should only be grasped. If statistical

    characteristics do not change in time, then these processes are called station-

    ary processes. Process averages and dispersions will not change so much in

    time. Therefore, stationary random processes can be investigated on different

    time intervals between - < t < + .

    Statistical characteristics of stationary random processes can be expressed

    by

    1. Probability density functionp(x) of signals X(t),

    2. Auto-correlation functionxx(),

    3. Spectral power density function Sxx()

    A time varying process is expressed by a stochastic signal X(t). For each time

    stroke tn one measured valueXn(t) will be obtained. The further development of

    the process can be predicted only for a short time interval. When the process is

    described by an analytical (deterministic) function f(t) then the time behaviour

    can be predicted completely. Only some statistical statements on the future de-

    velopment of the processX(t) can be given:

    Prob(X(tn+1) x) P(x),

    46

  • 8/3/2019 Stat Mod 1011

    47/67

    or

    Prob(a < X(t) b) = p(x)dx.

    The Gaussian distribution with a bell-shaped density is one of the most impor-

    tant probability density distributions wherep(x) = 1/2exp-(x-x*)2

    /22

    . Impor-

    tant expectations are linear average: E(x) = xp(x) dxand squared average:

    E(x2) = x2p(x) dx.

    6.4 Correlation and spectral functions

    The probability density function gives an information about the probability of the

    processX(t) that the amplitude at time t lies betweenxand (x+ x):

    Prob(x

  • 8/3/2019 Stat Mod 1011

    48/67

    spectrum of a stationary signal which is a distribution of the variance of the sig-

    nal as a function of frequency. The frequency components that account for the

    largest share of the variance are revealed. Each peak represents the part of the

    variance of the signal that is due to a cycle of a different period or length. Sig-

    nificant periodicity in the signal will induce a sharp peak in a periodogram. The

    auto-covariance function is the time domain counterpart of the periodogram.

    The periodogram of water temperature (figure 24) shows a single distinct peak

    which indicates the major cyclic behaviour. The low frequency component is

    responsible for the general tendency of the indicator.

    Figure 24: Periodogram of water temperature of the Lower Havel River

    Figure 25: Periodogram of pH

    The periodogram of pH in figure 25 shows that the highest variance is displayed

    by a low frequency. Small fluctuations are not dominant and can be neglected.

    48

  • 8/3/2019 Stat Mod 1011

    49/67

    Only long term changes are responsible for the overall observed behaviour of

    the indicator.

    The periodogram of pH is similar to that of dissolved oxygen presented in figure

    26. High variances at low frequencies are observed. This means that the gen-eral tendency of this indicator is determined by long term changes.

    Figure 26: Periodogram of dissolved oxygen.

    For the indicator of phytoplankton biomass the periodogram is shown in figure

    27. The periodogram represents low frequency components which exhibit the

    highest variances and some small fluctuation at higher frequencies. They de-

    termine the long term behaviour of the indicator. Two distinct peaks reveal two

    cycles of different periods and amplitudes.

    Figure 27: Periodogram of chlorophyll-a

    The cross-power spectrum Sxy() of two stochastic ecological processes x(t)

    and y(t) is the Fourier transform of the CCF:

    49

  • 8/3/2019 Stat Mod 1011

    50/67

    Sxy() = 1/2xy()e-j d

    It is a complex function.

    The coherency function Co() is a measure of synchronicity of (two) signals. It

    is calculated on the base of periodograms of both signals by

    Coxy() = |Sxy()|2/Sxx()Syy(),

    where |Sxy()| = Re(Sxx())2

    + Re(Syy())2

    and for the phase shift between

    both signals () = arc tan (Im(Sxy())/Re(Sxy())) is valid.

    The limitation of CCF is considered by what is called a window function h():

    ~Sxy() = 1/2xy()h()e

    -j d = Sxy()H( - ),

    where H() is the Fourier transform of h() which distorts Sxy() to~Sxy().

    50

  • 8/3/2019 Stat Mod 1011

    51/67

    7. Analysis of cycling processes

    Cycling processes in hydrology are natural. In fig. 28 some examples of cycling

    processes with different periods and frequencies are presented.

    0

    100

    200

    Q(m3/s)

    200

    600

    1000

    EC(S/cm)

    0

    15

    30

    Tw(C)

    0

    10

    20

    O2(mg/l)

    1985 1987 1989 1991 1993 19956

    8

    10

    time (a)

    pH

    Figure 28: Cycling water quality indicators

    Such processes are caused mostly by natural external driving forces but also by

    natural internal driving forces. They lay out different time and frequency behav-

    iour of the water quality (hydrological) processes. Water quality processes are

    characterised by different time parameters such as time delay, threshold val-

    ues, altering, physiological parameters and others. State transitions take

    place on intervals (ai (t), bi (t)) with probability densities wi(t) of time delays of

    system variables and probabilities pi(t) for each realisation of a state transition:

    Forai(t) wi(t) bi(t):pi(t) = wi(t) dt.

    On the other hand, hydrological variables vary often with high frequencies be-

    cause of random changes of internal system states and/or fluctuations of vari-

    ables. Switching processes of input variables take place at certain different time

    events. Time delays in the courses of action of system components lead to re-

    tardations in the changes of system states and to redundancies in the data

    transfer.

    51

  • 8/3/2019 Stat Mod 1011

    52/67

    A state transition can be characterised by a quadrupel

    i(t) = {ai(t), bi(t), wi (t),pi(t)).

    A classification of hydrological systems can be given by its characteristics of

    signals and by the type of change of dynamic properties (table 14).

    Table 14: Classification of hydrological systems

    Classification Remark

    Characteristics of signals

    Modulation

    Quantification

    Change of amplitudes, frequencies andphases of signals

    Discretisation of time domain of ampli-tudes and duration interval of signals

    Adaptability of system

    adaptive

    non-adaptive

    Change of systems states, change of in-puts and disturbances, change of parame-ters, change of system structure

    fixed parameters, no change of ecosystemstructure

    7.1 Introduction

    Mathematical equations describe either the time dependency (function of time t)

    which is called description in the time domain or the frequency dependency(function of frequency or cycles per time unit) which is called description in

    the frequency domain. Mostly, cycling (or periodic) processes in hydrological

    context are caused by natural external driving forces. On the other hand, aperi-

    odic hydrological processes are mainly influenced by artificial (man-made) ex-

    ternal driving forces.

    Another distinction can be made by the ability to reproduce a time-varying proc-

    ess. In the case of correct reproduction and forecast of a process it is called a

    deterministic one. Otherwise it is called a non-deterministic or stochastic (ran-

    dom) process. Each deterministic process x(t) is characterised by its time de-

    velopment (or behaviour) x = x(t) with - < t < +.

    A harmonic process is described by a trigonometric function

    x(t) = x0cos(i + i) with - < t < +.

    i = 2/Ti is the basic cycling frequency (circle frequency), Ti is the period of

    cycle, and i is the shift of phase.

    52

  • 8/3/2019 Stat Mod 1011

    53/67

    7.2 Fourier analysis

    A periodic process with period T0 is described by a Fourier series of the form

    x(t) = a0/2 + aicos(i0t) + bisin(i0t),

    with - i + , 0 = 2/T0 frequency of the basic cycle, T0 period of cycle.

    The amplitudes ai and bi are calculated as follows: ai = 1/T0x(t)cos(i0t)dt,

    bi = 1/T0x(t)sin(i0t)dtand a0 = 1/2T0x(t)dt.

    The Fourier polynomial is an approximation which represents the minimum

    mean squared deviation of a cycling process. Then, the amplitudes of the ap-

    proximating function are given by Ai = ai2

    + bi2.

    Phase shifts are given in the interval [0, 2] by i = arc tan bi/ai.

    Figure 29 shows a Fourier approximation of global radiation process. It can be

    seen that the approximation is shifted from the real frequencies due to a fixed

    frequency. This fact causes some error.

    1996 1997 1998 1999 2000 2001 2002 20030

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    time (a)

    globalradiation(W/m2)

    raw datacomponent with max. amplitude (f=1/352d)

    Figure 29: Fourier approximation of global radiation

    Fourier approximations can be used to explain the variance of a cycling process

    by its basic frequency. Table 14 gives an example on the usefulness of this

    method for physical, chemical and biological environmental or ecological vari-

    ables respectively. The 3rd

    column of table 14 contains the values of total vari-ance of the time series under consideration. The last column contains the val-

    53

  • 8/3/2019 Stat Mod 1011

    54/67

    ues of variance which are explained by the dominant cycle contained in the

    timw series. The best results will be obtained for physical variables, followed by

    chemical variables. Insufficient results are obtained for biological variables.

    Table 14: Fourier analysis of water quality indicators

    Indicator Reservoir Totalvariance

    (%)

    Aver-age

    Std.dev.

    Variance of theyearly cycle(% of totalvariance)

    TEMP Saidenbach 90.0 12.0 44.43 84.62

    Neunzehnhain 90.0 11.9 33.41 74.35

    Kliava 95.7 11.1 56.80 92.76Slapy 95.9 12.0 52.25 90.63

    DO Saidenbach 71.6 10.5 2.90 22.74

    Neunzehnhain 76.2 10.1 1.98 22.19Kliava 75.9 10.2 4.37 37.35Slapy 76.2 7.7 8.88 34.81

    CHA Saidenbach 36.8 5.4 32.87 1.94

    Neunzehnhain 35.4 1.7 1.59 1.96

    Example:Approximation of water temperature of reservoirs (yearly domi-

    nant harmonic cycle):

    Reservoir Saidenbach

    TEMP(t) = 12.0 + 1.458cos((6/180)t) 4.462sin((6/180)t

    Reservoir Neunzehnhain

    TEMP(t) = 11.9 + 0.693cos((6/180)t) + 4.415sin((6/180)t

    Reservoir Kliava

    TEMP(t) = 11.1 - 6.650cos((9/180)t) - 7.820sin((9/180)t)

    Reservoir Slapy

    TEMP(t) = 12.0 - 7.073cos((10/180)t) - 6.684sin((10/180)t)

    7.3 Digital data filter

    Digital filter function transfer sequences of input signals to sequences of output

    signals by compressing or decompressing noisy information contained in the

    measured signals of hydrological processes. The results of applying digital fil-

    ters are consistent data series which can be used for modelling, simulation and

    optimisation in hydrological sciences.

    Basic filter functions are derived from an ideal low pass filter:

    54

  • 8/3/2019 Stat Mod 1011

    55/67

    Ideal low pass

    )(11|)(| 2

    2

    FH

    +=

    Butterworth filter (power low pass)

    +=1

    1|)(| 22

    nH

    (Amplitude response should be as flat as possible in the pass band).

    Tschebyshev filter, type 1

    )(1

    1

    |)(| 222

    cH n+=

    ( - ripple factor (or eccentricity), = 0.1526. In the pass band a ripple is ac-

    cepted. The transition from pass band to stop band is steeper than for the But-

    terworth filter).

    Tschebyshev filter, type 2 (inverse Chebyshev-Filter)

    )(*11|)(| 22

    2

    cH

    n+= with * = 2/(1-).

    (In the stop band a ripple is accepted.)

    Elliptic filter (Cauer filter)

    )(*1

    1|)(|22

    2

    FH

    n+=

    (Ripples arise in the pass band and in the stop range. One gets the steepest

    transition between both frequency bands).

    To get an acceptable transfer behaviour filters of order 1 to 3 should be used

    only. Figures 30 to 33 represent the transfer behaviours of digital filters for dif-

    ferent water quality time series. Higher order filters show rippling transfer be-

    haviours and cause nonlinear effects in the output sequences of signals. This

    leads to misinterpretations and unexplainable events within the data series.

    55

  • 8/3/2019 Stat Mod 1011

    56/67

    0 50 100 150 200 250 300 3500

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    reciprocal of critical frequency (d)

    standarderrorO2(mg/l)

    order 1order 2order 3order 4order 5order 6

    Figure 30: Selection of filter order of a Butterworth filter for DO

    The higher order filters lead to changing (welling) transfer behaviour during the

    filtering process as can be seen in figs. 30 and 31.

    0 50 100 150 200 250 300 3500

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    reciprocal of critical frequency (d)

    standarderrorpH-value

    order 1order 2

    order 3order 4order 5order 6

    Figure 31: Selection of filter order of a Butterworth filter for pH

    They show this behaviour for a Butterworth filter. Tchebychev 1 filters for chlo-

    rophyll-a and for water temperature (figs. 32 and 33) demonstrate the distur-

    bances within the transfer process.

    56

  • 8/3/2019 Stat Mod 1011

    57/67

    0 50 100 150 200 250 300 350

    0

    0.005

    0.01

    0.015

    0.02

    0.025

    0.03

    0.035

    reciprocal of critical frequency (d)

    standarderrortotalchlorophy

    ll-a(mg/l)

    order 1order 2order 3order 4order 5order 6

    Figure 32: Tschebychev 1 filter for chlorophyll-a

    0 50 100 150 200 250 300 3500

    0.5

    1

    1.5

    2

    2.5

    reciprocal of critical frequency (d)

    standarderrorwatertemperature(C)

    order 1order 2order 3order 4order 5order 6

    Figure 33: Tschebychev 1 filter for water temperature

    The first step of digital data filtering procedures is the selection of a complete

    hydrological time series. If the data series contains some gaps interpolation

    methods should be used to get a time series with equidistant data. This is a

    strong prerequisite for all further steps. Fig. 34 shows such a data series for the

    variable conductivity of the Oder River at Frankfurt.

    57

  • 8/3/2019 Stat Mod 1011

    58/67

    1993 1994 1995 1996 1997 1998 1999 2000

    400

    600

    800

    1000

    1200

    1400

    1600

    time (a)

    conductivity(S/cm

    )

    Figure 34: Original data series of conductivity

    In the next step the critical frequency is calculated from spectral density function

    (fig. 35). As confidence band the 95% - confidence region should be selected.

    For the example a critical frequency fg = 0.053 was used.

    0 0.1 0.2 0.3 0.4 0.510

    0

    101

    102

    103

    104

    105

    106

    frequency (1/d)

    powerdensity(conductivity)

    fg

    power density

    upper bound (confidence interval 95%)lower bound (confidence interval 95%)

    Figure 35: Selection of critical frequency of the filter

    The last step consists of computation of the digital filter and reconstruction of

    the original data series. In case of the Oder River an elliptic filter was used to

    reconstruct the original time series and to get a consistent time series for mod-

    elling (fig. 36).

    58

  • 8/3/2019 Stat Mod 1011

    59/67

    1980 1990 2000200

    300

    400

    500

    600

    700

    800

    900

    1000

    1100

    1200

    time (a)

    con

    duc

    tiv

    ity

    (S/cm

    )

    Raw DataElliptic Filter (1. order, f

  • 8/3/2019 Stat Mod 1011

    60/67

    7.4 Wavelets

    Wavelet analysis has been proven quite useful for time scale based signal

    analysis. It is a solution for the time scale analysis problem because it offers an

    effective approach to extract both the information on the time localization and

    the frequency content of the time series. It has the ability to decompose time

    series into several sub-series which may be associated with particular time

    scales. As a result, the interpretation of features in hydrological time series may

    be facilitated by first applying an appropriate wavelet transform and subse-

    quently interpreting each individual sub-series.

    The following questions can be effectively answered with the help of wavelet

    analysis:

    1. What is the dominant scale of variation influencing the observed gen-

    eral tendency of the indicator?

    2. Are the variations from one day to the next more prominent than the

    variations from one week to the next?

    3. Are the statistical variations in the hydrological indicatorhomogenous

    across time?

    4. What are the time dependent variations such as the presence oftrends?

    5. How are two indicators related on a scale by scale basis? How do they

    covary at different scales?

    The wavelet analysis imitates the windowed Fourier analysis by using basis

    functions (wavelets) that are better suited to capture local behaviour of non-

    stationary signals. The wavelet transformation is a function of two variables

    W(u,s) obtained by projecting a signal X(t) on to a particular wavelet and is

    given by

    ,)()(),( , dtttXsuW su

    =

    =s

    ut

    stsu

    1)(,

    which gives a translated and dilated version of the original wavelet function. The

    coefficients that are obtained are a function of the location and scale parame-

    ters. Applying shifted and scaled versions of a wavelet function decomposes the

    signal into simpler components. It is the effect of the shifting and scaling proc-

    60

  • 8/3/2019 Stat Mod 1011

    61/67

    ess what makes this representation possible and is referred to as multiresolu-

    tion analysis.

    The wavelet transform is usually applied in the form of a filter bank, comprising

    two filters. The scaling filter known as the father wavelet is a low pass filterwhile the wavelet filter known as the mother wavelet is a high pass filter. Given

    a signal X(t) of length n = 2j, the filtering procedure can be performed a maxi-

    mum ofjtime, giving rise tojdifferent wavelet scales. The wavelet coefficients

    or detail coefficients are produced by the wavelet filter while the scaling filter

    gives rise to the smooth version of the signal used at the next scale. Given the

    respective father and mother wavelets,

    =

    J

    JJ

    kJ kt222 2,

    = 1)( dtt

    and

    =

    j

    jj

    kj

    kt

    2

    22 2,

    = 0)( dtt

    where J,k is the father wavelet and j,k is the mother wavelet with the scale

    parameter s being restricted to the dyadic scale 2j. If a signal is projected onto

    a given basis function

    = kJkJ tfS ,, )( ,

    then

    = kjkj tfd ,, )(

    will be obtained with SJ,k being the coefficients for the father wavelet at a maxi-

    mum scale of 2j (the smooth coefficients) and dj,k being the detail coefficients

    from the mother wavelet at all scales from 1 toj, to the maximal scale. Based on

    these coefficients, the function f(t) can be represented by

    )(....)()()( ,1,1,,,, tdtdtStf kk

    kkJ

    k

    kjkJ

    k

    kJ+++=

    and can be equally represented by

    f(t)=Sj+ Dj + Dj-1+ + Dj + D1

    61

  • 8/3/2019 Stat Mod 1011

    62/67

    where

    )(,, tSS kJk

    kJJ =

    and

    )(,, tdD kjk

    kjj = .

    Multiresolution decomposition (MRD) reveals the variations at different scales

    denoted by d. Figure 37 shows the details of the multiresolution analysis of

    dissolved oxygen sampled at daily interval.

    Figure 37: Multiresolution analysis details of dissolved oxygen signal

    sampled at daily intervals

    The details reveal the high frequency variations present in the dissolved oxygen

    time series or provide an additive decomposition of the high frequency variation

    on a scale by scale basis. The notations d1, d2, d3, d4, d5, d6 and d7 reveal the

    variations occurring at one day, 2 days, 4 days, 8 days, 16 days 32 and 64 days

    respectively. This progressive decomposition reveals the differences in fluctua-

    tions from one scale to another. It effectively shows that the lower scales are

    less important compared to the higher scales of variation.

    62

  • 8/3/2019 Stat Mod 1011

    63/67

    Multiresolution analysis (MRA) filters information in the signal at different scales

    represented by a. In fig. 38 an example of a MRA and MRD is given for long-

    term observations of dissolved oxygen in eutrophic freshwater ecosystem. Tak-

    ing of from the original signal s all high frequent events the basic nature of the

    cycling process comes out. This can be seen at level a7.

    Figure 38: Wavelet analysis of DO

    Figure 38 reveals that the variations occurring at a time scale of 1 day are

    equally of relatively low intensity and are not able to influence the general ten-

    dency observed in the dissolved oxygen signal. However, the fluctuations oc-curring at higher time scales such as scale 8 are strong enough to influence the

    long term behaviour of the signal. Hence, the long term tendency observed in

    the dissolved oxygen time series is significantly influenced only by the fluctua-

    tions occurring at the higher scales and not the lower scales. At the lower

    scales, the fluctuations are higher during the warmer months than during the

    colder months. At the higher scales such as scale 32, the fluctuations are high

    throughout the year. It is quite interesting to examine the variance at different

    scales to effectively quantify these variations.

    63

  • 8/3/2019 Stat Mod 1011

    64/67

    An overview on the respective frequencies is given in table 15.

    Table 15: Frequencies and scales of MRA and MRD

    MRA scale MRD scale Frequency

    a1 d1 1a2 d2 2

    a3 d3 4

    a4 d4 8

    a5 d5 16

    a6 d6 32

    a7 d7 64

    a8 d8 128a9 d9 256

    a10 d10 512

    a11 d11 1024

    a12 d12 2048

    The variance of a signal can equally be decomposed by using this technique.

    For a signal Xt the time varying variance of the scale Sj of a wavelet coefficient

    Wj,t can be calculated by

    )var(2

    1)( ,

    2

    , tj

    j

    jtx wS

    S = .

    Similar to the wavelet variance of a univariate signal, the wavelet covariance

    decomposes the covariance between two signals on a scale by scale basis by

    ),cov()( ,,11

    txtj

    j

    x xxS =

    =

    .

    The wavelet variance shown in figure 39 reveals the intensity of variation from

    one scale to the next of the dissolved oxygen time series. This graphical repre-

    sentation of the wavelet variance enables the researcher to answer questions

    concerning the dominant scale of variation in the time series, the homogeneity

    of variations from one scale to the next, the importance of the variations at one

    scale compared to the variations occurring at another scale.

    64

  • 8/3/2019 Stat Mod 1011

    65/67

    *

    *

    *

    * **

    0.05

    0.10

    0.20

    0.50

    1.00

    2.00

    Wavelet Scale

    L

    L

    L LLU

    U

    U

    U UU

    1 2 4 8 16

    32

    Figures 39: Wavelet variance of dissolved oxygen with db4

    65

  • 8/3/2019 Stat Mod 1011

    66/67

    Literature

    Adorf, H.-M., 1995: Interpolation of Irregularly Sampled Data Series - A Survey.

    In: Shaw, R. A., H. E. Payne and J. J. E. Hayes (eds.): Astronomical Data

    Analysis Software and Systems IV. ASP Conference Series, Vol. 77,

    Academic Press, New York, pp. 1-4.

    Box, G. E. P., G. M. Jenkins and G. C. Reinsel, 1994: Time Series Analysis. 3rd

    ed., Prentice Hall, Englewood Cliffs.

    Brmaud, P., 2002: Mathematical Principles of Signal Processing. Springer,

    New York, 2002.

    Brockwell, P. J. and R. A. Davis, 1998: Introduction to Time Series and Fore-

    casting. Springer, Berlin.

    Franses, P. H., 1999: Periodicity and Structural Breaks in Environmetric Time

    Series. In: Mahendrarajah, S., A. J. Jakeman and M. McAleer (eds.):

    Modelling Change in Integrated Economic and Environmental Systems.

    Wiley, New York.

    Gentili, S., Magnaterra, L., and G. Passerini, 2004: An Introduction to the statis-

    tical filling of environmental data time series. In: Latini, G. and G.

    Passerini (eds.): Handling Missing Data. WIT Press, Southampton, pp. 1-

    27.

    Han, J. and M. Kamber, 2006: Data Mining Concepts and Techniques. Mor-

    gan Kaufmann, New York.

    Hipel, K. W. and A. I. McLeod, 1994: Time Series Modelling of Water Re-

    sources and Environmental Systems. Elsevier, Amsterdam.

    Jrgensen, S. E. und W. J. Mitsch (eds.), 1983: Application of Ecological Mod-

    elling in

    Keith, L. H. (ed.), 1988: Principles of Environmental Sampling. ACS Profes-

    sional Reference Book, ASC, Salem.

    Latini, G. and G. Passerini (eds.), 2004: Handling Missing Data. WIT Press,

    Southampton.

    Little, R. J. A. and D. B. Rubin, 1983: Missing Data in Large Data Sets. In:

    Wright, T. (ed.): Statistical Methods and the Improvement of Data Quality.

    Academic Press, London, pp. 73-82.

    Little, R. J. A. and D. B. Rubin, 1987: Statistical Analysis with Missing Data.Wiley, Chichester.

    66

  • 8/3/2019 Stat Mod 1011

    67/67

    Mallat, S. (1998): A wavelet tour of signal processing. Academic Press, New

    York.

    Mller, W. G., 2001: Collecting Spatial Data. Springer, Berlin.

    Pollock, D. S. G., 1999: A Handbook of Time-Series Analysis, Signal Process-

    ing and Dy

    Powell, T. M. and J. H. Steele (eds.), 1995: Ecological Time Series. Chapman &

    Hall, New York.

    Rebecca, M., 1998: Spectral analysis of time-series data. Guilford Press, New

    York.

    Reckhow, K. H. und S. C. Chapra, 1983: Engineering Approaches for Lake

    Management. Vol. 1: Data Analysis and Empirical Modelling. Butterworth,

    Woburn.

    Shumway, R. H. and D. S. Stoffer, 2000: Time Series Analysis and Its Applica-

    tions. Springer, New York.

    Stein, M. L., 1999: Interpolation of Spatial Data. Springer, Berlin.

    Strakraba, M. und A. Gnauck, 1985: Freshwater Ecosystems Modelling and

    Simulation. Elsevier, Amsterdam.