1 Independent Component Analysis Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen, http:

1

Independent Component Analysis

Reference:Independent Component Analysis: A Tutorial

by Aapo Hyvarinen, http:www.cis.hut.fi/projects/ica

2

Motivation of ICA

The Cocktail-Party Problem

• 在 Party 上三個人在不同的位置 , 同時在說話 (S)

• 三人的聲音混雜在一起 , 無法分辨出誰說了什麼• 利用三隻麥克風，在不同的地點收聽會場中的

聲音 (X)

• 是否可以將麥克風所錄到的聲音 (X) 還原回三個人原來的講話聲音 (S)

Demo

http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi

http://www.cis.hut.fi/projects/ica

3

Formulation of ICA• two speech signal s1(t) and s2(t), received by

two microphones, the mixed signals are: x1(t) and x2(t):

• It will be very useful if we could estimate the original signals s1(t) and s2(t), from only the recorded signals x1(t) and x2(t)

)1( ----- )( 2121111 sasatx )2(-----)( 2221212 sasatx

4

Formulation of ICA• Suppose aii’s are known, then solving the linear

Equations 1 and 2 can retrieve the s1(t), s2(t)• the problem is we don’t know the aii.• One approach is to use some information on the

statistical properties of signals si(t) to estimate aii

• Assume s1(t) and s2(t) are statistical independent, then Independent Component Analysis techniques can retrieve s1(t) and s2(t), from the mixture x1(t) and x2(t).

5

Original signals s1(t), s2(t)

Mixturesignals x1(t), x2(t)

Recovered signals fors1(t), s2(t)

6

Definition of ICA• For n linear mixtures x1, …,xn from n independent

components

• The independent component si are latent variables, meaning that they can not be directly observed, and the mixing matrix A is assumed to unknown.

• We would like to estimate both A and s using the observable random vector x and some statistical assumption

)3(--- allfor ,...)( 2211 jsasasatx njnjjj

-(4)---------Asx

)5(1

n

i iisax

7

Definition of ICA

• X=AS ; Y=BX ; y is a copy of s

• If C is non-mixing then y=Cs is a copy of s

• A square matrix is said to be non-mixing if it has one and only one non-zero entry in each row and each column

A BS S = YX

8

Illustration of ICA• We use two independent components with the following

uniform distributions to illustrate the ICA model:

– The distribution has zero mean and the variance equal to one– Let us mixing these two independent components with the following

mixing matrix

– This gives us two mixed variable x1 and x2.

– The mixed data has a uniform distribution on a parallelogram.– But x1 and x2 are not independent any more. Since when x1 attains to its

maximum, or minimum, then this also determine the value of x2

)7(

otherwise , 0

3 if , 32

1

)(

i

i

ssp

21

210A

9

Illustration of ICAS2

Fig 5. Joint density distribution of the original signal s1 and s2

Fig 6. Joint density distribution of the observed mixtures x1 and x2

10

Illustration of ICA• The problem of estimating the data model of ICA is now

to estimate the mixing matrix A0 using only information contained in the mixtures x1 and x2 .

• We can see from Fig 6 an intuitive way of estimating A:• the edges of the parallelogram are in the directions of the

columns of A. That is estimate the ICA model by – first estimating the joint density of x1 and x2 , and then– locating the edges.

• However, this only works for random variables with uniform distributions

• We need a method that works for any types of distribution

11

Ambiguities of ICA Because y=Bx is just a copy of S:

–we can not determine the variance (energies) of the independent components.

–we can not determine the order of the independent components.

•applying a permutation matrix P to x=As, i.e., x=AP-

1Ps, then Ps is still like the original signals, and

•AP -1 is just a new unknown mixing matrix, to be solved by the ICA algorithms,

• the order of s will be changed.

12

Properties of ICAIndependence• the variables y1 and y2 are said to be independent if

information on the value of y1 does not give any information the value of y2, and vice versa.

• Let p(y1, y2 ) be the joint probability density function (pdf) of y1 and y2, and let p(y1 ) be the marginal pdf of y1 then

• y1 and y2 are independent if and only if the joint pdf is factorizable.

• Thus, given two functions h1 and h2 , we always have

)9_(),()( 22111 dyyypyp

)10_()()(),( 221121 ypypyyp

)11_()()()()( 22112211 yhEyhEyhyhE

13

Properties of ICAUncorrelated variables are only partly

independent– Two variables y1 and y2 are said to be

uncorrelated if their covariance is zero

– If the variables are independent, they are uncorrelated, but the reverse is not true!

• For example: sin(x) and cos(x) is dependent on x, but cov(sin(x),cos(x))=0

)13_(02121 yEyEyyE

14

Gaussian variables are forbidden • The fundamental restriction in ICA is that the independent

components must be nongaussian for ICA to be possible

• assume the mixing matrix is orthogonal and si are gaussian, then x1 and x2 are gaussian, uncorrelated, and of unit variance.

• The joint pdf is

• the distribution is completely symmetric (shown in figure next page), it does not contain any info on the direction of the columns of the mixing matrix A.

• Thus A can not be estimated

)15_()2

exp(2

1),(

22

21

21

xxxxp

15

Fig 7. Multivariate distribution of two independent gaussian variables

16

ICA Basic• source separation by ICA must go beyond

second order statistics• ignoring any time structure because the

information contained in the data is exhaustively represented by the sample distribution of the observed vector

• source separation can be obtained by optimizing a ‘contrast function’– i.e.,: a function that measure independence.

17

Measures of independence • Nongaussian is independent

– The key to estimate the ICA model is the nongaussianity– The central limit theorem (CLT) tells us that the

distribution of a sum of independent random variables tends toward a gaussian distribution. In other words,

– a mixing of two independent signals usually has a distribution that is closer to gaussian than any of the two original signals

• Suppose, we want to estimate y, one of the independent components of s from x,

• let us denotes this by y=WTx=iwixi, w is a vector to be determined

• How can we use CLT to determine w so that it would equal to one of the rows of the inverse of A ?

18

Nongaussian is independent

• let us make a change of variables,z = ATw• then we have y = wTx = wTAs = zTs = izisi

• thus y=zTs is more gaussian than the original variables si

• y becomes least gaussian, when it equals to one of si, a trivial way is to

• let only one of the elements zi of z be nonzero

• Maximizing the nongaussianity of wTx, gives us one of the independent components.

19

Measures of nongaussianity • To use nongaussianity in ICA, we must have a

quantitative measure of nongaussianity of a random variable yi

Kurtosis• the classical measure of nongaussianity is kurtosis or the

fourth-order cumulant

)16_(}){(3}{)( 224 yEyEykurt

• Assume y is of unit variance, then kurt(y)= E{y4}-3.• A kurtosis is simply a normalized fourth moment E{y4}• For a gaussian y, the fourth moment equals to 3(E{y2})2

• thus, kurtosis is zero for a gaussian random variable.

20

Kurtosis• Kurtosis can be both positive and negative

• RV have a negative kurtosis are called subgaussian

• subgaussian RV have typically a flat pdf, which is rather constant near zero, and very small for larger values– uniform distribution is a typical example for subgaussian

• supergaussian RV have a spiky pdf, with a heavy tail– Laplace distribution is a typical example for supergaussian

)17_()2exp(2

1)( yyp

21

Kurtosis (c)• Typically nongaussianity is measured by the absolute value

of kurtosis.

• Kurtosis can be measured by using the fourth moments of the sample data

• if x1 and x2 are two independent RV, it holds

)18_()()()( 2121 xkurtxkurtxxkurt

)19_()()( 14

1 xkurtxkurt

• To illustrate a simple example what optimization landscape for kurtosis looks like, let us look at a 2-d model x=As.

• We seek for one of the independent components as y = wTx• let z = ATw, then y = wTx = wATs = zTs = z1 s1 + z2 s2 ,

22

Kurtosis (c)• Using additive property of kurtosis, then we have

kurt(y) = kurt (z1 s1)+ kurt(z2 s2)= z14 kurt (s1)+ z2

4 kurt (s2)

• let’s apply a constraint on y that the variance of y is equal to 1, that is the same assumption concerning s1 and s2.

• Thus, z: E{y2}=z12 + z2

2 =1, this means that the vector z is constrained to the unit circle on a 2-d plane.

• The optimization becomes “what are the maxima of the function | kurt(y)| =| z1

4 kurt (s1)+ z24 kurt (s2)| on the unit

circle”?

• The maxima are the points where vector z is (0,1) or (0,-1).

• These points correspond to where y equals one of si and -si.

23

Kurtosis (c)• In practice we could start from a weight vector w, and

compute the direction in which the kurtosis of y=wTx is growing or decreasing most strongly based on the available sample x(1),…, x(T) of mixture vector x, and use a gradient method for finding a new vector w.

• However, kurtosis has some drawbacks,

• the main problem is that kurtosis can be very sensitive to outliers, in other words kurtosis is not a robust measure of nongaussianity.

• In the following sections, we would like to introduce negentropy, whose properties are rather opposite to those of kurtosis.

24

Negentropy• Negentropy is based on the information-theoretic entropy.

• The entropy of a RV is a measure of the degree of randomness of the observed variables.

• The more unpredictable and unstructured the variable is, the larger is its entropy.

• Entropy is defined for a RV Y as:

)21( ---Ydensity with RVfor )(log)()( )f(ydyyfyfyH )20(---- RV discrete afor )(log)()( ii aYPaYPYH

• A fundamental property of information theory for gaussian variable is it has the largest entropy among all random variables of equal variance.

• Thus, entropy can be used to measure nongaussianity.

25

Negentropy• To obtain a measure the nongaussianity that is zero for a gaussian variable

and always nonnegative, one often uses Negentropy J, which is defined as:

J(y)= H(yGauss)-H(y) --------------------(22)

• where ygauss is a gaussian RV of the same covariance matrix as y.

• the advantage of using Negentropy is it is in some sense the optimal estimator of nongaussianity, as far as statistical properties are concerned.

• The problem in using negentroy is that it is still computationally very difficult.

• Thus simpler approximations of negentropy seems necessary and useful.

26

Approximations of negentropy• The classical method of approximating negentropy is using higher-

order-moments, for example

)23_()(48

1

12

1)( 223 ykurtyEyJ

• The RV y is assumed to be of zero mean and unit variance.

• This approach still suffer from the nonrobustness as kurtosis

• Another approximation were developed based on the maximum-entropy principle:

)25()}]({G(y)}{[E c )( 2 vGEyJ• Where is a Gaussian variable of zero mean and unit

variance, and G is a nonquadratic function

27

Approximations of negentropy

• Taking G (y) = y4, then (25) becomes (23)• suppose G is chosen to be slow growing as the following contrast functions:

.2a1

)26_()2

exp()( ,coshlog1

)(

1

2

211

1

where

uuGua

auG

• This approximation is conceptually simple, fast to compute, and especially robustness.

• A practical algorithm based on these contrast function will be presented in Section 6

28

Preprocessing - centering• Some preprocessing techniques make the problem of ICA

estimation simpler and better conditioned.

Centering• Center variable x, i.e., subtract its mean vector m=E{x},

so as to make x a zero-mean variable.• This preprocessing is solely to simplify the ICA

algorithms• After estimating the mixing matrix A with centered data,

we can complete the estimation by adding the mean vector of s back to the centered estimates of s.

• the mean vector of s is given by A-1m, m is the mean vector that was subtracted in the preprocessing

29

Preprocessing - whitening

• Another preprocessing is to whiten the observed variables.

• Whitening means to transform the variable x linearly so that the new variable x~ is white, i.e., its components are uncorrelated, and their variances equal unity.

• In other words, variable x~ is white means the covariance matrix of x~ equals identity matrix:

)33_(~~ IxxE T

)3,3()2,3()1,3(

)3,2()2,2()1,2(

)3,1()2,1()1,1(

CovCovCov

CovCovCov

CovCovCov

30

Preprocessing - whitening• The correlation between two variables x and y is

yx

yxCov

),(y)(x,

• The covariance between x and y is

i

yixi yxnyxCov ))(()/1(),(

• The covariance Cov (x, y) can be computed by ][][][)/1()/1()/1(),( yxxyynxnyxnyxCov

ii

ii

iii EEE

• If two variable are uncorrelated then (x, y)= Cov(x, y) =0• Covariance matrix = I means that if x not equal to y, then

Cov(x,y)=0.• if a matrix’s covariance matrix is white, then it is uncorrelated.

31


• Although uncorrelated variables are only partly independent, decorrelation (using second-order information) can be used to reduce the problem to a simpler form.

• Unwhitened matrix A needs n2 parameters, but whitened matrix needs lesser (about half) parameters

A~

32

Fig 5 Fig 6

The graph to the right shows data in Fig 6 has been whitened. The square depicts the distribution is clearly a rotated version of original square in Fig 5. All that is left is the estimation of a single angle that gives rotation.

Fig 10

33


• Whitening can be computed by eigenvalue decomposition (EVD) of the covariance matrix E{xxT}=EDET

– E is the orthogonal matrix of eigenvectors of E{xxT}

– D is a diagonal matrix of its eigenvalues, D=diag(d1,…,dn)

– note that E{xxT} can be estimated in a standard way from the available sample of x(1), …, x(T).

34


• Whitening can now be computed by • Where D-1/2 can be computed by D-1/2 =diag(d1

-1/2 ,…, dn-1/2 ).

• It is easy to show , Using (34) and E{xxT}=EDET, then QED.• According to x=As, thus whitening transform the mixing

matrix to a new , and

• Since• the new mixing matrix is orthogonal

)34_(~ 2/1 xEEDx T

)35_(~~ 2/1 sAAsEEDx TA

~

)36_(~~~

}{~~~ IAAAssAxx TTT TEE

A~

Ixx }~~{ TE

35

The FastICA Algorithm- FastICA for one unit

• The FastICA learning rule finds a direction, i.e., a unit vector w such that the projection wTx maximizes nongaussianity, which is measured by the approximation of negentropy J(wTx).

• The variance of y=wTx must be constained to unity, for the whitened data, this is equivalent to constraining the norm of w to unity, i.e., E{(wTx)}2}=||w||2 =1.

• In the following algorithm, g denotes the derivative of the derivative of the nonquadratic function G.

)25()}]({G(y)}{[E c )( 2 vGEyJ

36

FastICA for one unit• The FastICA algorithm1) choose an initial (e.g., random) weight vector w.2) Let W+ =E{xg(wTx)}-E{xg’ (wTx)}w3) Let W = W+ / || W+ || , normalization improve stability4) if not converged, go back to 2.

• The derivation is as follows:– the optima of E{G(wTx)} under the constraint E{(wTx)}2}=||w||2 =1

are obtained at points, where F=E{xg(wTx)}-w = 0-----(40)

– Solving this equation by Newton’s method w+=w– (F/F) – The Jacobian matrix is F = ∂F/∂w = E{xxTg’ (wTx)}-I And the Hessian matrix of F(w) is

)41_(β)(β)( 2 Ixwxx

w

wxwx TTT

gE

gE

37

FastICA for one unit

• In order to simplify the inversion of the Hessin matrix, the first part of the the Hessian is aproximated, as

Since the data is in a unit sphere, thus E{xxTg’ (wTx)} E{xxT} E{g’ (wTx)}= E{g’ (wTx)}I

• thus the Hessian matrix becomes diagonal, and it can be easily inverted,

• Then, the vector w can be approximated by Newton

method: w+ = w – (F/F) =

• By multiplying -E{g’(wTx)} on both side, then after algebraic simplification, it gives the FastICA iteration

)42_(])}({[

])}({['

xw

wxwxw

T

T

gE

gE

38

FastICA for one unit (c)

• Discussion:– Expectations must be replaced as estimates, which are

sample means

– to compute sample mean, ideally all the data available should be used, but for the computational complexity, only part of or small size of samples are used,

– If convergence is not satisfactory, one may then increase the sample size.

39

FastICA for several units

• To etsimate several independent components, we need to run FastICA algorithm using several units, with weight vectors w1,…, wn.

• To prevent different vectors from converging to the same maxima we need to decorrelate the outputs w1

T x,…, wn

T x

after every iteration.

• A simple way of decorrelation is to estimate the independent components one by one.– when p independent components are estimated, i.e., w1,…, wp,

– run the one-unit fixed-point algorithm for wp+1 , and

– subtract from wp+1 the “projection” wTp+1 Cwj wj , j=1,…,p of the

previously estimated p vectors, and then renormalize wp+1 :

40

FastICA for several units

)43(2.Let

1.Let

11

11

1 111

pTp

pp

p

j jjTppp

Cww

ww

wCwwww

The covariance matrix C=E{xxT} is equal to I, if the data is sphered.

41

Applications of ICA- Finding Hidden Factors in Financial Data

• Some financial data such as currency exchange rates, or daily returns of stocks, that may have some common underling factors.

• ICA might reveal some driving mechanisms that otherwise remain hidden.

• In a recent study of a stock portfolio, it was found that ICA is a complementary tool to PCA, allowing the underlying structure of the data to be more readily observed.

42

Term project• Using PCA, JADE, and FastICA to analyzes the

Taiwan stocks returns for underling factors.

• JADE and FastICA packages can be found by searching on the www.

• Data are available at course web site.

• Due:

Documents

1 Independent Component Analysis Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen, http: