23
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10 Frontiers of GPU Computing 2010 1 Efficient Independent Component Analysis on a GPU Rui Ramalho, Pedro Tomás, Leonel Sousa 1

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Embed Size (px)

Citation preview

Page 1: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

29-06-10Frontiers of GPU Computing 20101

Efficient Independent Component Analysis on a GPU

Rui Ramalho, Pedro Tomás, Leonel Sousa

1

Page 2: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

29-06-10Frontiers of GPU Computing 20102

Outline

• Motivation• Independent Component Analysis• FastICA Algorithm• Experimental Results• Conclusions

2

Page 3: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Blind Source Separation

• Blind Source Separation (BSS) is a signal processing technique that separates a set of signals (sources) from a set of mixed signals.

• Little is known about the original signals or the mixing process, only that the original signals are uncorrelated.

29-06-10Frontiers of GPU Computing 201033

0 50 100 150 200 250 300 350 400 450 500-2

0

2Original signals

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-2

0

2

0 50 100 150 200 250 300 350 400 450 500-10

0

10

0 50 100 150 200 250 300 350 400 450 500-5

0

5Mixed signals

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5

0 50 100 150 200 250 300 350 400 450 500-5

0

5Independent components

0 50 100 150 200 250 300 350 400 450 500-2

0

2

0 50 100 150 200 250 300 350 400 450 500-10

0

10

0 50 100 150 200 250 300 350 400 450 500-2

0

2

mix sep

Page 4: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Blind Source Separation: Cocktail Party

• A classical example of blind source separation is the cocktail party problem.

– A number of people are talking simultaneously in a crowded room (at cocktail party).

– Despite all the noise and cross talking, a human brain has little difficulty following a conversation.

– Machines have to rely on blind source separation.

29-06-10Frontiers of GPU Computing 201044

Page 5: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Blind Source Separation: Applications

• Blind Source Separation has also been used in several other domains:

– EEG/MEG measurements (each sensor picks up a mixture of brain electrical activity and BSS can be used to separate and identify them).

– Denoising images (by treating the noise as an independent source it is possible to separate it from the image’s original components).

– Financial analysis (BSS can be used to uncover hidden factors in financial data).

29-06-10Frontiers of GPU Computing 20105

Page 6: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Independent Component Analysis

• Independent Component Analysis (ICA) is a special case of Blind Source Separation.

• The mixed signal’s sources are assumed to be statistically independent (BSS only assumes the sources are statistically uncorrelated).

29-06-10Frontiers of GPU Computing 20106

Page 7: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Independent Component Analysis: ICA Model

• Under the ICA model, the observed variables are assumed to be a linear combination of several independent sources/signals.

• The objective of ICA is to find the matrix W that inverts the mixing operation performed by the matrix A, without knowledge of A or s.

29-06-10Frontiers of GPU Computing 20107

Page 8: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Independent Component Analysis: Measuring Statistical Independence

• One of the ways of measuring statistical independence is through negentropy:

– H(y) is the differential information entropy of y:

• In practice J(y) needs to be estimated. The estimator used by FastICA is:

– G is a nonquadratic nonlinear function– is a Gaussian variable of zero mean and unit variance

29-06-10Frontiers of GPU Computing 20108

Page 9: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

FastICA Algorithm

• The procedure for computing the independent components can be divided in 3 stages:

– pre-processing Allows a number of simplifications on the FastICA algorithm.

– weight vector computationThe FastICA algorithm itself.

– decorrelationPrevents the algorithm from converging to the same solutions.

29-06-10Frontiers of GPU Computing 20109

Page 10: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

FastICA Algorithm:Preprocessing & Weight Vector Computation

• Preprocessing includes general tasks such as centering, whitening or filtering the data.

• The computation of each of the weight vectors is done by:

– g is the derivative of the nonlinear contrast function J– This algorithm can be modified to compute all the ICs

simultaneously (a symmetric approach).

29-06-10Frontiers of GPU Computing 201010

Page 11: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

FastICA Algorithm:GPU Implementation

• The preprocessing stage is generally inexpensive and was implemented on the CPU.

• The FastICA algorithm is composed mostly of matrix operations that can be efficiently implemented using CUBLAS.– The computation of the non-linear function

g and g’ have no dependencies.– The expected value is computed using

hierarchical additions, storing the intermediate results in the GPU’s shared memory.

29-06-10Frontiers of GPU Computing 201011

Page 12: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

FastICA Algorithm:Decorrelation

• To keep the estimated weight vectors from converging to the same results, they need to be decorrelated:

– After estimating p independent components, subtract the projections of the previous p components from the p+1 estimate:

– An alternative is to apply a symmetric decorrelation after every iteration:

29-06-10Frontiers of GPU Computing 201012

Page 13: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Decorrelation:The Tricky Bit

• The computation of (WWT)1/2 is complex and can be done using the eigenvalues of (WWT).

– This can be done using the already available CPU-based high performance libraries (LAPACK).

– Alternatively, the eigenvalues can be computed directly on the GPU

29-06-10Frontiers of GPU Computing 201013

Page 14: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Jacobi Eigenvalue Algorithm

• The Jacobi Eigenvalue Algorithm successively uses Jacobi rotations to annihilate the off-diagonal elements of a given matrix A.

• A Jacobi rotation is given by:

– J is a Jacobi rotation matrix– c = cos()– s = sin()

29-06-10Frontiers of GPU Computing 201014

Page 15: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Jacobi Eigenvalue Algorithm

• Each Jacobi rotation only changes two columns and two rows of the matrix A. By carefully choosing the order of the rotations, up to N/2 rotations can be done simultaneously.

• The matrix J is a very sparse matrix, making CUBLAS unsuitable for this algorithm.

29-06-10Frontiers of GPU Computing 201015

Page 16: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Decorrelation:Iterative Algorithm

• Another alternative to the eigenvalue problem is to avoid its computation altogether. Algorithm 4 converges to the decorrelation expression presented earlier.

29-06-10Frontiers of GPU Computing 201016

Page 17: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Decorrelation:Comparison of Decorrelation Algorithms

• Experimental results show that the proposed GPU-based Jacobi eigenvalue algorithm is outperformed by a CPU based LAPACK eigenvalue algorithm using multiple relatively robust representations (MRRR).

• However, avoiding the explicit computation of the eigenvalues is still the fastest process.

29-06-1017 Frontiers of GPU Computing 2010

Page 18: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Experimental Results:Experimental Setup

• Experimental Setup

– A hyperbolic tangent was chosen as a typical non-linear function g– The iterative decorrelation algorithm that avoids the explicit

computation of the eigenvalues is used in the decorrelation step.

29-06-10Frontiers of GPU Computing 201018

CPU GPU

AMD Opteron 170 NVidea GeForce 8800 GTX

Number of cores 2 128

Clock Frequency 2 GHz 1.35 GHz

Main Memory 2 GB 768 MB

Page 19: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Experimental Results:Single Core CPU Vs GPU

• The accelerated portion of the algorithm (loop) is spedup up to 110x, for estimating 256 ICs with 10 000 samples.

• As the accelerated portion gets faster, so grows the influence of the unaccelerated part of the algorithm (the preprocessing stage). This noticeably reduces the global speedup.

29-06-10Frontiers of GPU Computing 201019

Page 20: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Experimental Results:Single Core CPU Vs GPU

• The accelerated loop component ceases to be the bottleneck.

• The additional penalty of transferring data to and from the GPU is negligible.

29-06-10Frontiers of GPU Computing 201020

Page 21: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Experimental Results:Multicore CPU Vs GPU

• The parallelized GPU algorithm was also tested on a more powerful Geforce GTX 285, with 240 cores. This implementation was compared with a CPU based implementation on an Intel Core 2 Quad Q9950 (@2.83GHz) using Intel’s high performance MKL library.

• It was possible to attain a speedup of around 12x

29-06-10Frontiers of GPU Computing 201021

Page 22: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

Conclusions

• By using a GPU it was possible to speedup the FastICA algorithm by 55x for estimating 256 ICs with 1000 samples each, in comparison with a serial version running on a single core of a CPU.

• These results can be further improved as the current bottleneck lies in the preprocessing stage, which is still done on the CPU.

29-06-10Frontiers of GPU Computing 201022

Page 23: Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101

01-06-08IV Jornadas sobre Sistemas Reconfiguráveis - REC'200823

technologyfrom seed

29-06-10 23Frontiers of GPU Computing 2010