Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares • Need a method that gives consistent estimates in presence of coloured noise •

Alternatives to Least-Squares

• Need a method that gives consistent estimates in presence of colourednoise

• Generalized Least-Squares

• Instrumental Variable Method

• Maximum Likelihood Identification

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 61

Generalized Least Squares

• Noise model

w(t) =e(t)

C(q−1)where {e(t)} = N(0, σ) and C(q−1) is monic and of degree n.

• System described as

A(q−1)y(t) = B(q−1)u(t) +1

C(q−1)e(t)



• Defining filtered sequences

y(t) = C(q−1)y(t)

u(t) = C(q−1)u(t)

• System becomes

A(q−1)y(t) = B(q−1)u(t) + e(t)



• If C(q−1) is known, then least-squares gives consistent estimates of Aand B, given y and u

• The problem however, is that in practice C(q−1) is not known and {y}and {u} cannot be obtained.

• An iterative method proposed by Clarke (1967) solves that problem



1. Set C(q−1) = 1

2. Compute {y(t)} and {u(t)} for t = 1, . . . , N

3. Use least–squares method to estimate A and B from y and u

4. Compute the residuals

w(t) = A(q−1

)y(t)− B(q−1

)u(t)

5. Use least-squares method to estimate C from

C(q−1

)w(t) = ε(t)

i.e.

w(t) = −c1w(t− 1)− c2w(t− 2)− · · ·+ ε(t)

where {ε(t)} is white

6. If converged, then stop, otherwise repeat from step 2.



• For convergence, the loss function and/or the whiteness of the residualscan be tested.

• An advantage of GLS is that not only it may give consistent estimatesof the deterministic part of the system, but also gives a representationof the noise that the LS method does not give.

• The consistency of GLS depends on the signal to noise ratio, theprobability of consistent estimation increasing with the S/N.

• There is, however no guarantee of obtaining consistent estimates


Instrumental Variable Method (IV) 4

• As seen previously, the LS estimate

θ = [XTX]−1XTY

is unbiased if W is independent of X.

• Assume that a matrix V is available, which is correlated with X but notwith W and such that V TX is positive definite, i.e.

E[V TX] is nonsingular

E[V TW ] = 0

4T. Soderstrom and P.G. Stoica,Instrumental Variable Methods for System Identification, Spinger-Verlag,Berlin, 1983


Instrumental Variable Method (IV)

• Then,V TY = V TXθ + V TW

and θ estimated by

θ = [V TX]−1V TY

V is called the instrumental variable matrix.



• Ideally V is the noise-free process output and the IV estimate θ isconsistent.

• There are many possible ways to construct the instrumental variable. Forinstance, it may be built using an initial least–squares estimates:

A1y(t) = B1u(t)

and the kth row of V is given by

vTk = [−y(k − 1), . . . ,−y(k − n), u(k − 1), . . . , u(k − n)]



• Consistent estimation cannot be guaranteed in general.

• A two-tier method in its off–line version, the IV method is more usefulin its recursive form.

• Use of instrumental variable in closed-loop.

– Often the instrumental variable is constructed from the input sequence.This cannot be done in closed-loop, as the input is formed from oldoutputs, and hence is correlated with the noise w, unless w is white.In that situation, the following choices are available.



• Instrumental variables in closed-loop

– Delayed inputs and outputs. If the noise w is assumed to be a movingaverage of order n, then choosing v(t) = x(t− d) with d > n gives aninstrument uncorrelated with the noise. This, however will only workwith a time-varying regulator.

– Reference signals. Building the instruments from the setpoint willsatisfy the noise independence condition. However, the setpoint mustbe a sufficiently rich signal for the estimates to converge.

– External signal. This is effect relates to the closed-loop identifiabilitycondition (covered a bit later...). A typical external signal is a whitenoise independent of w(t).


Maximum-likelihood identification

The maximum-likelihood method considers the ARMAX model belowwhere u is the input, y the output and e is zero-mean white noise withstandard deviation σ:

A(q−1)y(t) = B(q−1)u(t− k) + C(q−1)e(t)

whereA(q−1) = 1 + a1q

−1 + · · ·+ anq−n

B(q−1) = b1q−1 + · · ·+ bnq−n

C(q−1) = 1 + c1q−1 + · · ·+ cnq−n

The parameters of A, B, C as well as σ are unknown.


Maximum-likelihood identificationDefining

θT

= [ a1 · · · an b1 · · · bn c1 · · · cn]

xT

= [ −y(t) · · · u(t− k) · · · e(t− 1) · · · ]

the ARMAX model can be written as

y(t) = xT (t)θ + e(t)

Unfortunately, one cannot use the least-squares method on this model sincethe sequence e(t) is unknown.

In case of known parameters, the past values of e(t) can be reconstructedexactly from the sequence:

ε(t) = [A(q−1)y(t)−B(q−1)]/C(q−1)


Maximum-likelihood identification

Defining the performance index below, the maximum-likelihood method is then

summarized by the following steps.

V =1

2

NXt=1

ε2(t)

• Minimize V with respect to θ, using for instance a Newton-Raphson algorithm. Note

that ε is linear in the parameters of A and B but not in those of C. We then have to

use some iterative procedure

θi+1 = θi − αi(V′′(θi))

−1V′(θi)

• Initial estimate θ0 is usually obtained from a least-squares estimate

• Estimate the noise variance as

σ2=

2

NV (θ)


Properties of the Maximum-likelihood Estimate (MLE)

• If the model order is sufficient, the MLE is consistent, i.e. θ → θ asN →∞.

• The MLE is asymptotically normal with mean θ and standard deviationσθ.

• The MLE is asymptotically efficient, i.e. there is no other unbiasedestimator giving a smaller σθ.


Properties of the Maximum-likelihood Estimate (MLE)

• The Cramer-Rao inequality says that there is a lower limit on the precisionof an unbiased estimate, given by

cov θ ≥ M−1θ

where Mθ is the Fisher Information Matrix Mθ = −E[(log L)θθ] For theMLE

σ2θ = M−1

θ

i.e.σ2

θ = σ2V −1θθ

if σ is estimated, then

σ2θ =

2V

NV −1

θθ


Identification In Practice

1. Specify a model structure

2. Compute the best model in this structure

3. Evaluate the properties of this model

4. Test a new structure, go to step 1

5. Stop when satisfactory model is obtained


MATLAB System Identification Toolbox


MATLAB System Identification Toolbox

• Most used package

• Graphical User Interface

• Automates all the steps

• Easy to use

• Familiarize yourself with it by running the examples


RECURSIVE IDENTIFICATION

• There are many situations when it is preferable to perform theidentification on-line, such as in adaptive control.

• Identification methods need to be implemented in a recursive fashion,i.e. the parameter estimate at time t should be computed as a functionof the estimate at time t− 1 and of the incoming information at time t.

• Recursive least-squares

• Recursive instrumental variables

• Recursive extended least-squares and recursive maximum likelihood


Recursive Least-Squares (RLS)

We have seen that, with t observations available, the least-squaresestimate is

θ(t) = [XT (t)X(t)]−1XT (t)Y (t)

withY T (t) = [ y(1) · · · y(t)]

X(t) =

xT (1)...

xT (t)

Assume one additional observation becomes available, the problem is thento find θ(t + 1) as a function of θ(t) and y(t + 1) and u(t + 1).



Defining X(t + 1) and Y (t + 1) as

X(t + 1) =[

X(t)xT (t + 1)

]Y (t + 1) =

[Y (t)

y(t + 1)

]and defining P (t) and P (t + 1) as

P (t) = [XT (t)X(t)]−1 P (t + 1) = [XT (t + 1)X(t + 1)]−1

one can write

P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

θ(t + 1) = P (t + 1)[XT (t)Y (t) + x(t + 1)y(t + 1)]


Matrix Inversion Lemma

Let A, D and [D−1 + CA−1B] be nonsingular square matrices. ThenA + BDC is invertible and

(A + BDC)−1 = A−1 −A−1B(D−1 + CA−1B)−1CA−1

Proof The simplest way to prove it is by direct multiplication

(A + BDC)(A−1 − A

−1B(D

−1+ CA

−1B)−1

CA−1

)

= I + BDCA−1 − B(D

−1+ CA

−1B)−1

CA−1

−BDCA−1

B(D−1

+ CA−1

B)−1

CA−1

= I + BDCA−1 − BD(D

−1+ CA

−1B)(D

−1+ CA

−1B)−1

CA−1

= I


Matrix Inversion Lemma

An alternative form, useful for deriving recursive least-squares is obtainedwhen B and C are n× 1 and 1× n (i.e. column and row vectors):

(A + BC)−1 = A−1 − A−1BCA−1

1 + CA−1B

Now, consider

P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

and use the matrix-inversion lemma with

A = XT (t)X(t) B = x(t + 1) C = xT (t + 1)



Some simple matrix manipulations then give the recursive least-squaresalgorithm:

θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]

K(t + 1) =P (t)x(t + 1)

1 + xT (t + 1)P (t)x(t + 1)

P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

1 + xT (t + 1)P (t)x(t + 1)

Note that K(t + 1) can also be expressed as

K(t + 1) = P (t + 1)x(t + 1)



• The recursive least-squares algorithm is the exact mathematical equivalent of the batch

least-squares.

• Once initialized, no matrix inversion is needed

• Matrices stay the same size all the time

• Computationally very efficient

• P is proportional to the covariance matrix of the estimate, and is thus called the

covariance matrix.

• The algorithm has to be initialized with θ(0) and P (0). Generally, P (0) is initialized

as αI where I is the identity matrix and α is a large positive number. The larger α,

the less confidence is put in the initial estimate θ(0).


RLS and Kalman Filter

There are some very strong connections between the recursive least-squares algorithm and the Kalman filter. Indeed, the RLS algorithm has thestructure of a Kalman filter:

θ(t + 1)︸︷︷︸new

= θ(t)︸︷︷︸old

+K(t + 1) [y(t + 1)− xT (t + 1)θ(t)]︸︷︷︸correction

where K(t + 1) is the Kalman gain.


RLS

The following Matlab code is a straightforward implementation of theRLS algorithm:function [thetaest,P]=rls(y,x,thetaest,P)

% RLS

% y,x: current measurement and regressor

% thetaest, P: parameter estimates and covariance matrix

K= P*x/(1+x’*P*x); % Gain

P= P- (P*x*x’*P)/(1+x’*P*x); % Covariance matrix update

thetaest= thetaest +K*(y-x’*thetaest); %Parameter estimate update

% end


Recursive Extended Least-Squares and RecursiveMaximum-Likelihood

Because the prediction error is not linear in the C–parameters, it is not possible to an

exact recursive maximum likelihood method as for the least–squares method.

The ARMAX model

A(q−1

)y(t) = B(q−1

)u(t) + C(q−1

)e(t)

can be written as

y(t) = xT(t)θ + e(t)

with

θ = [a1, . . . , an, b1, . . . , bn, c1, . . . , cn]T

xT(t) = [−y(t− 1), . . . ,−y(t− n), u(t− 1),

. . . , u(t− n), e(t− 1), . . . , e(t− n)]T


Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

• If e(t) was known, RLS could be used to estimate θ, however it isunknown and thus has to be estimated.

• It can be done in two ways, either using the prediction error or theresidual.

• The first case corresponds to the RELS method, the second to the AMLmethod.



• The one–step ahead prediction error is defined as

ε(t) = y(t)− y(t | t− 1)

= y(t)− xT (t)θ(t− 1)

x(t) = [−y(t− 1), . . . , u(t− 1), . . . , ε(t− 1), . . . , ε(t− n)]T

• The residual is defined as

η(t) = y(t)− y(t | t)

= y(t)− xT (t)θ(t)

x(t) = [−y(t− 1), . . . , u(t− 1), . . . , η(t− 1), . . . , η(t− n)]T



• Sometimes ε(t) and η(t) are also referred to as a–priori and a–posterioriprediction errors.

• Because it uses the latest estimate θ(t), as opposed to θ(t− 1) for ε(t),η(t) is a better estimate, especially in transient behaviour.

• Note however that if θ(t) converges as t −→∞ then η(t) −→ ε(t).



The two schemes are then described by

θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]

K(t + 1) = P (t + 1)x(t + 1)/[1 + xT(t + 1)P (t)x(t + 1)]

P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

[1 + xT (t + 1)P (t)x(t + 1)]

but differ by their definition of x(t)



• The RELS algorithm corresponds uses the prediction error. This algorithmis called RELS, Extended Matrix or RML1 in the literature. It hasgenerally good convergence properties, and has been proved consistentfor moving–average and first–order auto regressive processes. However,counterexamples to general convergence exist, see for example Ljung(1975).

• The AML algorithm uses the residual error. The AML has betterconvergence properties than the RML, and indeed convergence can beproven under rather unrestrictive conditions.


Recursive Maximum-Likelihood

• Yet another approach.

• The ML can also be interpreted in terms of data filtering. Consider theperformance index:

V (t) =12

t∑i=1

ε2(i)

with ε(t) = y(t)− xT (t)θ(t− 1)


Recursive Maximum-Likelihood

• Because V is a nonlinear function of C, it has to be approximated by aTaylor series truncated after the second term. The resulting scheme isthen:

θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]

K(t + 1) =P (t + 1)xf(t + 1)

[1 + xTf (t + 1)P (t)xf(t + 1)]

P (t + 1) = P (t)−P (t)xf(t + 1)xT

f (t + 1)P (t)

[1 + xTf (t + 1)P (t)xf(t + 1)]


Properties of AML and RML

Definition 1. A discrete transfer function is said to be strictly positive realif it is stable and

ReH(ejw) > 0 ∀w − π < w ≤ π

on the unit circle.

This condition can be checked by replacing z by 1+jw1−jw and extracting the

real part of the resulting expression.

For the convergence of AML, the following theorem is then available.


Properties of AML and RML

Theorem. [Ljung & Soderstrom, 1983)] Assume both process andmodel are described by ARMAX with order model ≥ order process, then if

1. {u(t)} is sufficiently rich

2. 1C(q−1)

− 12$isstrictlypositivereal then θ(t) will converge such that

E[ε(t, θ)− e(t)]2 = 0

If model and process have the same order, this implies

θ(t) −→ θ as t −→∞


A Unified Algorithm

Looking at all the previous algorithms, it is obvious that they all havethe same form, with only different parameters. They can all be representedby a recursive prediction - error method (RPEM).

θ(t + 1) = θ(t) + K(t + 1)ε(t + 1)

K(t + 1) = P (t)z(t + 1)/[1 + xT (t + 1)P (t)z(t + 1)]

P (t + 1) = P (t)− P (t)z(t + 1)xT (t + 1)P (t)[1 + xT (t + 1)P (t)x(t + 1)]


Tracking Time-Varying ParametersAll previous methods use the least–squares criterion

V (t) =1t

t∑i=1

[y(i)− xT (i)θ]2

and thus identify the average behaviour of the process. When the parametersare time varying, it is desirable to base the identification on the most recentdata rather than on the old one, not representative of the process anymore.This can be achieved by exponential discounting of old data, using thecriterion

V (t) =1t

t∑i=1

λt−i[y(i)− xT (i)θ]2

where 0 < λ ≤ is called the forgetting factor.


Tracking Time-Varying Parameters

The new criterion can also be written

V (t) = λV (t− 1) + [y(t)− xT (t)θ]2

Then, it can be shown (Goodwin and Payne, 1977) that the RLS schemebecomes

θ(t + 1) = θ(t) + K(t + 1)[y(t + 1)− xT(t + 1)θ(t)]

K(t + 1) = P (t)x(t + 1)/[λ + xT(t + 1)P (t)x(t + 1)]

P (t + 1) =

(P (t)−

P (t)x(t + 1)xT (t + 1)P (t)

[λ + xT (t + 1)P (t)x(t + 1)]

)1

λ

In choosing λ, one has to compromise between fast tracking and long termquality of the estimates. The use of the forgetting may give rise to problems.


Tracking Time-Varying Parameters

The smaller λ is, the faster the algorithm can track, but the more theestimates will vary, even the true parameters are time-invariant.A small λ may also cause blowup of the covariance matrix P , since inthe absence of excitation, covariance matrix update equation essentiallybecomes

P (t + 1) =1λP (t)

in which case P grows exponentially, leading to wild fluctuations in theparameter estimates.One way around this is to vary the forgetting factor according to theprediction error ε as in

λ(t) = 1− kε2(t)

Then, in case of low excitation ε will be small and λ will be close to 1. Incase of large prediction errors, λ will decrease.


Exponential Forgetting and Resetting AlgorithmThe following scheme due Salgado, Goodwin and Middleton5 is

recommended:

ε(t + 1) = y(t + 1)− xT (t + 1)θ(t)

θ(t + 1) = θT (t) +αP (t)x(t + 1)

λ + xT (t + 1)P (t)x(k + 1)ε(t)

P (t + 1) =1λ

[P (t)− P (t)x(t + 1)xT (t + 1)P (t)

λ + x(t + 1)TP (t)x(t + 1)

]+βI − γP (t)2

where I is the identity matrix, and α, β and γ are constants.5M.E. Salgado, G.C. Goodwin, and R.H. Middleton, “Exponential Forgetting and Resetting”, International

Journal of Control, vol. 47, no. 2, pp. 477–485, 1988.


Exponential Forgetting and Resetting Algorithm

With the EFRA, the covariance matrix is bounded on both sides:

σminI ≤ P (t) ≤ σmaxI ∀t

where

σmin ≈β

α− ησmax ≈

η

γ+

β

η

with

η =1− λ

λWith α = 0.5, β = γ = 0.005 and λ = 0.95, σmin = 0.01 and σmax = 10.


Documents

Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares • Need a method that gives consistent estimates in presence of coloured noise •