Multivariate Statistics

Multivariate Statistics

Principal Component Analysis

W. M. van der Veld

University of Amsterdam

Overview

• Eigenvectors and eigenvalues

• Principal Component Analysis (PCA)

• Visualization

• Example

• Practical issues

Eigenvectors and eigenvalues

• Let A be a square matrix of order n x n.

• It can be shown that vectors exist, such that Ak=λk, with λ some scalar. Where k the eigenvector, and λ the eigenvalue.

• The eigenvectors k and eigenvalues λ have many applications, but we will only use it in this course for principal component analysis.

• But they also play a role in cluster analysis, canonical correlations, and other methods.


• So, for the system of equations Ak=λk, only A is known.

• We have to solve for k and λ to find the eigenvector and eigenvalue.

• It is not possible to solve this set of equations straightforward with the method described last week, since m<n.

• The trivial solution k=0 is excluded.

• A solution can however be found under certain conditions.

• First an example to a feeling for the equation.


• An example: Ak=λk. Let A be

41

23

1

11k• One solution for k is:

kAk

5

5

1

1

41

235

1

21k• One solution for k is:

kAk

2

4

1

2

41

232


• How did we find the eigenvectors?

• Before that we first have to find the eigenvalues!

• From Ak=λk it follows that Ak - λk = 0, which is:

• (A - λI)k = 0.

• Since k = 0 is excluded, there seems no solution.

• However, for homogeneous equations, a solution can be found when rank(A)<n, and this is only the case when |A - λI|=0. which can be rewritten as the characteristic equation:

• This (|A - λI|=0) is what I meant with certain conditions!

• We can now easily solve for λ.

0011

1 aaaa n

nn

n IA


• This determinant gives an equation in λ.

0 IA

041

23

10

01

41

23

IA

01243

010724312 22

25025 21


• It is now a matter of substitution, of λ; start with λ1 = 5.

0kIAkAk

0kkIA

0

0

41

23

0kIA

2

1

541

253

k

k

0

0

11

22

21

21

kk

kkkIA

1

1

2

1

k

k

• Note that any multiplier of k would also satisfy the equation!


• The same for λ2 = 2.

0kIAkAk

0kkIA

0

0

41

23

0kIA

2

1

241

223

k

k

0

0

2

2

21

21

kk

kkkIA

1

2

2

1

k

k


• This was the 2 x 2 case, but in general the matrix A is of order n x n.

• In that case we will find– n different eigenvalues, and

– n different eigenvectors.

• The eigenvectors could be collected in a matrix K, with k1, k2, …, kn as the eigenvectors.

• The eigenvalues could be collected in a matrix Λ., with on the diagonal the eigenvalues, λ1, λ2, …, λn.

• Hence the generalized form of Ak=λk is: AK=ΛK


Harold Hotelling (1895-1973)

• PCA was introduced Harold Hotelling (1933).

• Harold Hotelling was appointed as a professor of economics at Columbia.

• But he was a statistician first, economist second.

• His work in mathematical statistics included his famous 1931 paper on the Student's t distribution for hypothesis testing, in which he laid out what has since been called "confidence intervals".

• In 1933 he wrote "Analysis of a Complex of Statistical Variables with Principal Components“ in the Journal of Educational Psychology.

•


• Principal components analysis (PCA) is a technique that can be used to simplify a dataset*.

• More formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on.

• PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision). These characteristics may be the "most important", but this is not necessarily the case, depending on the application.


• The data reduction is accomplished via a linear transformation of the observed variables, such that :

yi = ai1x1 + ai2x2 + … + aipxp; where i=1..p

• Where the y’s are the principal components, which are uncorrelated to each other.

Digression

• The equation: yi = ai1x1 + ai2x2 + … + aipxp; What does it say?

• Let’s assume for a certain respondent his answers to items x1, x2, …, xp are known, and we also know the coefficients aij.

• What does that imply for yi?

• The function is a prediction of y.

• How does the path model for this equation look like?

• What is so special about this equation?

• In multiple regression we have an observed y and observed x’s; this allows the estimation of the constants ai.

• PCA is a different thing. Although the equation is the same!


• The equation: yi = ai1x1 + ai2x2 + … + aipxp

• In the PCA case:– The y variables are not observed (unknown), and

– The constants a are unknown.

• So there are too many unknowns, to solve the system.

• We can do PCA, so we must be able to solve it. But how?

• The idea is straightforward.– Choose a’s such that each principal component has maximum variance.

– Express the variance of y in terms of the observed (x) and unknown (a).

– Make a constraint to limit the number of solutions.

– Then we must set the derivative to zero, to find a maximum of the function.


• The basic equation: yi = ai1x1 + ai2x2 + … + aipxp;

• Let x be a column vector with p random x variables– The x variables are, without loss of generality, expressed as deviations

from the mean.

• We usually worked with the data matrix, now suddenly a vector with p random variables?

Digression

• The vector x containing random variables is directly related to the data matrix X.

npnn

p

p

xxx

xxx

xxx

21

22221

11211

X

px

x

x

x2

1

X

X

of column valuesobserved withx variablerandom

of 1 column valuesobserved withx variablerandom

p

1

p


• The basic equation: yi = ai1x1 + ai2x2 + … + aipxp;• Let x be a column vector with p random x variables

– The x variables are, without loss of generality, expressed as deviations from the mean.

• Let a be a p component column vector,• Then y = a’x.• Because this function is unbounded, we can always find a

vector a’ for which the variance of the principal component is larger, and for which the equation is satisfied; hence.

• Make a constraint on the unknowns so that a’a = 1.– This (=1) is an arbitrary choice,– but it will show that this makes the algebra simpler.

• Now the number of solutions for y is constrained (bounded).


• The variance of y is var(y) = var(a’x) = E((a’x)(a’x)’).

• Which is: E((a’x)(x’a)) = a’E(xx’)a = a’Σa;– Because E(xx’) = X’X/n = variantie-covariantie matrix.

• Thus f: a’Σa.

• We have to find a maximum of this function.


• So, we have to solve the derivative function equal to zero.• Don’t forget the constraint (a’a = 1) in the function, that

should be accounted for when finding the maximum. • This is can be achieved using the Lagrange multiplier, a

mathematical shortcut.• h: ∂f/∂a’ – λ ∂g/∂a’ = 0; where g: a’a - 1 • ∂h/∂a’ = 2Σa – 2λa• This is the derivative which we need to find the maximum of

the variance function.• 2Σa – 2λa = 0 => divide both sides by 2• Σa – λa = 0 => get the factor a out.• (Σ – λI)a = 0 => this should look familiar!


• (Σ – λI)a = 0 can be solved via |Σ – λI| = 0, and a ≠ 0.– Here λ is eigenvalue of the eigenvector a.

• Rewrite (Σ – λI)a = 0 so that: Σa = λIa Σa = λa• If we premultiply both sides with a’; then• a’Σa = a’λa a’Σa = λa’a = λ;

– Because a’a = 1.

• It follows that var(y) = λ;– because var(y) = a’Σa,

• So the eigenvalues are the variances of the principal components.

• And the largest eigenvalue is the variance of the first principal component, etc.

• The elements of the eigenvector a, which are found by substitution of the largest λ, are called loadings of y.


• The next principal component is found taking away the variance of the first principal component, which is:– var(y1) = a1’x

• In order to find the y2 we assume that it is uncorrelated with y1, next to the other assumption that a2

’a2 = a’a = 1.

• Therefore:– cor(y2, y1) = 0– E(y2, y1’) = 0– E((a2’x) (a1’x)’) = 0– E(a2’xx’a1) = 0– a2’E(xx’)a1 = 0– a2’Σa1 = 0– a’λ1a1 = λ1a’a1 = 0 => Because Σa1 = λ1a1, since (Σ – λ1I)a1 = 0


• Since y2 = a2’x

• The variance of y2 is f2: a2’Σa2

• So, we have to solve the derivative function equal to zero.

• Don’t forget to take the constraints into account:– (a2’a2 = 1), and

– λa2’a1 = 0.

• when finding the maximum.

• This is can be achieved using the Lagrange multiplier, a mathematical shortcut.


• The result: ∂h2/∂a2’ = 2Σa2 – 2λ2a2 – 2ν2Σa1 = 0

• ν2 = 0 as a consequence of the constraints.

• Thus:– 2Σa2 – 2λ2a2 = 0 => (Σ – λ2I) a2 = 0

– Which can be solved via |Σ – λ2I| = 0, and a2 ≠ 0

• We solve this equation, then take the highest eigenvalue (λ2),

• Solve for the eigenvector (a2) that corresponds to this eigenvalue.

• Et cetera, for the other components.


• Estimation is now rather straightforward.

• We use the ML estimate of Σ, which is .

• Then we simply have to solve for in .

• Where is the ML estimate of λ.

• Then we simply have to solve for â inby substitution of the solution for .

• Where â is the ML estimate of a.

0ˆˆ I

0ˆ ˆˆ aI

Visualization

Visualization

• Let’s consider R, which is that standardized Σ.

• And start simple with only two variables, v1 and v2.

nvv

nvv

nvv

nvv

)vv(E2221

1211

21

0.17.0

7.00.1Σ assume sLet'

• How can this situation be visualized?

• In a 2-dimensional space, one dimension for each variable.

• Each variable is a vector with length 1.

• The angle between the vector represents the correlation.

Visualization

• Intuitive proof that “the angle between the vector represents the correlation”; cos(angle)=cor(v1, v2)

• If there is no angle, then they are actually the same (except for a constant). – In that case cos(0)=1, and cor(v1, v2) = 1

• Now if they are uncorrelated, then the correlation is zero.– In that case cor(v1, v2) = 0,

– and if cos(angle)=cor(v1, v2), then cos(angle) = 0,

– So angle = ½π, since cos(½π) = 0

• So we can visualize the correlation matrix.

• The correlation between v1 and v2 = 0.7; thus angle = ¼π.

Visualization

Variable 2

Variable 1

First Principal

Component

Visualization

V2

V1

1st PC

Projection of V1 on principal component

(equals constant a11 in equation)

Projection of V2 on principal component

(equals constant a12 in equation)

Visualization

V2

V1

1st PC

Total projection, which is the variance of the 1st PC, and thus

λ1.

Visualization

V2

V1

1st PC

2nd PC

Projection (=0) of V1 on 2nd PC

Projection of V2 on 2nd

PC

Visualization

V2

V1

1st PC

2nd PC

Total projection, which is the variance of the 1st PC, and thus λ1.

Total projection, which is the

variance of the 2nd PC, and thus λ2.

Visualization

• Of course, PCA is concerned with finding the largest variance of the first component, etc.

• In this example, there is possibly a better alternative.

• So, what I said where solution for λ’s and a, where in fact non-optimal solutions.

• Let’s find an optimal solution.

Visualization

Variable 2

Variable 1

First Principal

Component

Visualization

V2

V1

1st PC

Maximized projection of V1

on principal component

(a11 in equation)

Maximized projection of V2

on principal component

(a12 in equation)

VisualizationTotal maximum

projection, which is the variance of the 1st

PC, and thus λ1. V2

V1

1st PC 2nd PC

‘Minimized’ projection of V1

and V2 on 2nd PC

Visualization

V2

V1

1st PC 2nd PC

Total maximum projection, which is

the variance of the 1st PC, and thus λ1.

Total maximum projection, which is the

variance of the 2nd PC, and thus

λ2.

Example

An example

• How does a PCA solution look like?

CD D M A CA

15.1 Ik voel me buitengesloten van anderen 1 2 3 4 5

15.2 Ik heb het gevoel dat niemand me echt goed kent 1 2 3 4 5

15.3 Ik heb het gevoel dat er niemand is bij wie ik terecht kan 1 2 3 4 5

15.4 Ik heb het gevoel dat er mensen zijn die me echt goed begrijpen

1 2 3 4 5

15.5 Ik voel me alleen 1 2 3 4 5

15.6 Ik heb het gevoel dat ik bij niemand echt hoor 1 2 3 4 5

15.7 Ik heb het gevoel dat ik met mensen verbonden ben 1 2 3 4 5

15.8 Ik heb het gevoel dat er mensen zijn bij wie ik terecht kan

1 2 3 4 5

• It might make sense to say that the weighted sum of these items is something that we could call loneliness.

An example

• The loneliness items not only seem to be related on ‘face value’, but the variables are also correlated.

V15_1 V15_2 V15_3 V15_4 V15_5 V15_6 V15_7 V15_8

V15_1 1.000

V15_2 .523 1.000

V15_3 .432 .554 1.000

V15_4 .217 .348 .346 1.000

V15_5 .606 .530 .434 .265 1.000

V15_6 .520 .567 .464 .258 .596 1.000

V15_7 .199 .282 .252 .374 .212 .304 1.000

V15_8 .186 .342 .341 .476 .237 .271 .566 1.000

Correlations between the loneliness items (n=679)

An example

• What can we expect with PCA?– There are 8 items, so there will be 8 PC’s.– On face value the items are related to one thing: loneliness.– So there should be one PC (interpretable as loneliness), that accounts

for most variance in the nine observed variables.

An example

This is the variance explained by the principal components. Note that it is never 1, due to

the fact that only the PC’s with eigenvalue >1 are used.

An example

The complete PCA solution. With all 8 variables and 8 PC’s.

The ‘practical’ solution, that has thrown away all PC’s with a eigenvalue smaller than 1.

This is an arbitrary choice, which is called the Kaiser criterion

First 2 PC’s have EV>1

They absorbed

63.6% of all variance

An exampleThese are the constants ai from the matrix A’. Also they are the

projections on the principal component axes.

The square is the variance contributed to the component

by the observed variable.

When you add the squared loadings,

you will obtain the eigenvalue of the

component.

Σ(squared numbers)=3.746%

17%

An example

• What can we expect with PCA?– There are 8 items, so there will be 8 PC’s.

– On face value the items are related to one thing: loneliness.

– So there should be one PC (interpretable as loneliness), that accounts for most variance in the nine observed variables.

• We find 1 PC that absorbs almost 50% of the variance, that one might be called loneliness.

• However we loose 50% of the variance. The second PC hardly does anything. Let alone the other components.

• So the number of variables can be reduced from 10 to 1.

• However, at a huge loss.

Practical issues

Practical issues

• PCA is NOT factor analysis!!– neither exploratory nor confirmatory.– In factor analysis it is assumed that the structure in the data are the

result of an underlying factor structure. So, there is a theory. – In PCA the original data are linearly transformed into a set of new

uncorrelated variables, with maximum variance property. This is a mathematical optimization procedure, that lacks a theory about the data.

• Many people think they use PCA. However, they use a rotated version of the PCA solution, for which the maximum variance property does not necessarily hold any more.

• The advantage is that such rotated solutions are often better interpretable, because the PCA solution has too often no substantive meaning.

Practical issues

• In PCA the PC’s are uncorrelated, beware of that when interpreting the PC’s.

• I have often seen that the PC’s are interpreted as related constructs, eg. loneliness and shyness, but I assume that such constructs are related, so a different interpretation should be found.

• Many times the solutions are rotated, to obtain results that are better interpretable.

Practical issues

• Method of rotation: – No rotation is the default in SPSS, unrotated solutions are hard to

interpret because variables tend to load on multiple factors. – Varimax rotation is an orthogonal rotation of the factor axes to

maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix. Each factor will tend to have either large or small loadings of any particular variable. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option.

– Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable.

– Direct oblimin rotation is the standard method when one wishes a non-orthogonal solution -- that is, one in which the factors are allowed to be correlated. This will result in higher eigenvalues but diminished interpretability of the factors.

– Promax rotation is an alternative non-orthogonal rotation method which is computationally faster than the direct oblimin method.

Practical issues

We had this one solution

This is the Varimax rotated solution. Notice that the loadings are high for either component 1 or 2. Because the loadings are used to interpret the PC’s, this should

make it easier. Although it seems to be that now there are clearly two

PC’s, but they can be interpreted as positive (4,7,8) and negative (others).

An example

• How does a PCA solution look like?

CD D M A CA

15.1 Ik voel me buitengesloten van anderen 1 2 3 4 5

15.2 Ik heb het gevoel dat niemand me echt goed kent 1 2 3 4 5

15.3 Ik heb het gevoel dat er niemand is bij wie ik terecht kan 1 2 3 4 5

15.4 Ik heb het gevoel dat er mensen zijn die me echt goed begrijpen

1 2 3 4 5

15.5 Ik voel me alleen 1 2 3 4 5

15.6 Ik heb het gevoel dat ik bij niemand echt hoor 1 2 3 4 5

15.7 Ik heb het gevoel dat ik met mensen verbonden ben 1 2 3 4 5

15.8 Ik heb het gevoel dat er mensen zijn bij wie ik terecht kan

1 2 3 4 5

• It might make sense to say that the weighted sum of these items is something that we could call loneliness.

Practical issues

• PCA is useful, when you have constructs by definition.

• You can force there to be one component.

• You can calculate the PC scores, which are weighted using the constants aij.

• And these scores can then be used in further analysis.

• Use it with care, and think about and look at your data before doing any analysis.

Documents

Multivariate Statistics