5dahnoun

8/3/2019 5dahnoun

1/6

8/3/2019 5dahnoun

2/62

environment where all the data is gathered prior tooptimisation of the filter coefficients.

These methods are then used to derive the two recursivealgorithms implemented.

Minimum Mean Square Error (MMSE)

The idea of the LMS algorithm is to minimise the Mean-Square Error (MSE) resulting from the estimation of thedesired response. The cost function, J , is therefore [5]:

( )[ ]2ne E J = (1)

Where E is the expectation. It can be shown that theoptimal solution for the tap coefficients in the MMSEsense is [5]:

PRH = (2)

Where R and P are respectively the auto- and cross-correlation matrices. A recursive method for optimallysolving Equation 2 is presented in the form of the LMSAlgorithm.

The LMS algorithm is based on the method of steepestdescent. The basic aim is to minimise the MSE byadjusting the tap weights based on the gradient of theerror surface, an example of which is shown in Figure 2.(This also shows why FIR filters are always used so thatthere is only one well-defined minima.)

FIR Filter IIR Filter

Error Error

Tap Weight

Local Minima

Tap Weight

Figure 2 - Error Surface of a FIR and an IIR Adaptive Filter

The tap weights are therefore updated recursively usingthe following equation:

= HH (3)

(Where is a constant that controls the rate and stabilityof convergence).

The steepest descent algorithm requires prior knowledgeof both the auto- and cross-correlation matrix, R and P ,which is not available in a real time adapting situation.The LMS overcomes this by using instantaneousestimates based on the currently available data. Thisdifference is highlighted in Equation 4 which shows thedifferent gradient calculations.

X

HXXX

RHPT

e2

LMS 22

DescentSteepest22

=+=+=

(4)

Least Squares Error (LSE)

The idea of Least Squares (LS) filtering is to minimisethe sum of all the estimation error squares, with thefollowing cost function [5]:

( )=

= N

M i

ie J 2 (5)

It can be shown that the optimal solution for the tapcoefficients in the LSE sense is [5]:

PRH = (6)

Although we are effectively trying to solve the sameequation as for the MMSE case, we are using twodifferent methods. In the LSE method the definitions arebased on time summations, whereas for the MMSE casethey were based on expectations.

The Recursive Least Squares (RLS) algorithm computesupdated estimates of the tap weight vector based on theleast-square error and is a special case of a Kalman filter.The RLS algorithm exploits the matrix inversion lemmato overcome the need to compute lengthy matrixinversions required to update tap weight vector.

First consider the update equation for the auto-correlation matrix:

Tnn1nn XXRR += (7)

A recursion for the inverse of the auto-correlation cannow be derived by using Equation 7 and the matrixinversion lemma, as given below:

( ) BCBCCDBCBA HH1 1 += (8)

Where: nRA = , 11= nRB , nXC = and 1=D .

Using the recursive form for auto-correlation matrix, asimilar recursion for the cross-correlation matrix and

Equation 6, the update equation for the tap weights canbe shown to be [5]:

enn XRHH1

1

+= (9)

Summary of Recursive Adaptive Algorithms

Table 1 summarises the LMS and RLS in terms of theequations required and the initialisation conditionsrequired. It also shows the equations required for acomplex version of the LMS.

8/3/2019 5dahnoun

3/63

UpdateEquation 1 neXHH nn += Error [ ] [ ] [ ]n ynd ne =Output [ ] nTn XH=n yInitialisation 0=TH

Table 1a Summary of the LMS Algorithm

UpdateEquations

( ) ( ) ) 1 QQ I I I I eenn XXHH ++= ( ) ( ) ( ) 1 I QQ I QQ eenn XXHH +=

QHHH j I +=Error [ ] [ ] [ ]

[ ] [ ]( ) [ ] [ ]( ) j n ynd n ynd n jenene

QQ I I

Q I

+=+=

Output [ ] XH H=n yQ XXX j I +=

Initialisation 0=HH

Table 1b Summary of the Complex LMS Algorithm

CoefficientUpdateEquation

( ) ( ) ( ) ( ) ( ) 1 1 nennnn XRHH XX+=

Inverse MatrixUpdateEquation

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( )nnn

nnnnnn

XRXRXXR

RR1

XXT

1XX

T1XX1

XX1

XX11

111

+=

Error [ ] [ ] [ ]n ynd ne =Output [ ] nTn XH=n yInitialisation(I is the identitymatrix, is asmall constant.)

0=THIR

1=

Table 1c Summary of the RLS Algorithm

3 SOFTWARE OPTIMISATION TECHNIQUES

The software tools supplied with the TMS320C62xallow code to be written in three different formats, C,linear assembly and assembly. Texas Instruments quote

that the optimising C compiler is about 80% efficientcompared to hand optimised assembly [1] and linearassembly is 95-100% efficient.

However, writing code directly in assembly is oftendifficult and time consuming. Hence the followingoptimisation procedure is recommended [1][2][3][4][11]and was followed in the development in this paper.

OptimiseAlgorithm

Program in 'C'and compilewithout anyoptimisation

CodeFunctioning

?

Make thenecessary

correction(s)

Profile Code

ResultSatisfactory

?

Use intrinsics

Profile Code

Result

Satisfactory?

Set n=0 (-On)

Compile codewith

-On option

CodeFunctioning

?

Make thenecessary

correction(s)

Profile Code

ResultSatisfactory

?

N

8/3/2019 5dahnoun

4/64

of the processor in mind, then improvements can bemade. Intrinsics can also be used to improve C code.

To summarise, the overall procedure for developing ahighly optimised algorithm was adopted:

[1] MATLAB[2] C for PC[3] C using fixed point arithmetic (include intrinsics)[4] Linear Assembly[5] Assembly (Un-pipelined)[6] Assembly (Pipelined)

4 IMPLEMENTATION AND OPTIMISATION

To allow implementation of the algorithms developed inC, the matrix-based equations were converted intonormal equations.

( ) ( ) ( )

==

1

0

N

k

k n xk hn y (10)

LMS Equations:

( ) ( ) ( ) 1,,2,1,0 ,1 =+= N k k nexk hk h nn (11)

RLS Equations:

( ) ( )( ) ( ) ( ) ( )

( ) ( ) ( )

= =

=

=

+

= N

i

N

j j jii

N

jm j

N

iiil j

mlml

n xn Rn x

n Rn xn Rn xn Rn R

1 1,

1

1,

1

1,

1,

1,

11

111

(12)

( ) ( ) ( ) ( )=

+= N

iiik k k n xn Ren H n H

1

1,1 (13)

The flow charts shown in Figure 4 were then used tooutline the required logical flow of both algorithms [8].

The two algorithms were implemented within aninterrupt function so that whenever a new value wassampled by the CODEC the function was called so thatthe tap weights could be updated and the filtering beperformed. All the filters implemented comprised of

eight taps.

Initialise elementsof H and X to zero

Get New Value of Xand d(n)

Filter X

Calculate Error

Calculate e

Update TapCoefficients

Shift elements of Xin time, discarding

oldest element

LMS Algorithm

Initialise elementsof H and X

Get New Value of Xand d(n)

Filter X

Calculate Error

Update inverse of Auto-Correlation

Matrix

Update TapCoefficients

Shift elements of Xin time, discarding

oldest element

RLS Algorithm

Figure 4 LMS and RLS Algorithm Flow-Charts

The procedures for optimising the code highlightedearlier were followed, and benchmarking performed ateach stage. The result of this benchmarking issummarised in Table 2, followed by an evaluation of theefficiency of the C compiler in Table 3.

OptimisationLevel

LMS ComplexLMS

RLS

C 1020 - 44035C+o3 196 - 7947C+intrinsics 979 2261 38417C+intrinsics+o3 197 495 10149Linear Assm 302 - 24732Linear Assm +Pipelining

76 - 12626

Assembly 42 - -Fully OptimisedAssembly

16 - -

Table 2 Benchmarking for one Iteration of each Eight TapAlgorithm

Efficiency of C CompilerComparison LMS RLSLinear Assm +Pipeling

39% 159%

Assembly 21% -Fully OptimisedAssembly

8% -

Table 3 Evaluation of the Efficiency of the Optimising CCompiler

8/3/2019 5dahnoun

5/65

5 TESTING OF THE IMPLEMENTEDALGORITHMS

A simple set-up where the desired signal was also theinput to the filter was used. The implementedalgorithms were tested with a variety of input andinitialisation conditions. Figure 5(a) and (b) shows theLMS and RLS algorithms adapting and Figure 5(c)shows the output from the complex LMS adaptive filterin the form of an I-Q adapting constellation diagram.Note that the testing environments were effectivelynoise-free.

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Iteration

E r r o r

( % o

f d e s

i r e d s i g n a

l )

Figure 5(a) Estimation Error as a Function of Iteration numberfor the LMS Algorithm

0

10

20

30

40

50

60

70

80

90

100

110

120

0 50 100 150 200 250 300 350 400 450 500

Iteration

E r r o r

( % o

f d e s i r e

d s i g n a

l )

Figure 5(b) Estimation Error as a Function of Iteration numberfor the RLS Algorithm

-100

-80

-60

-40

-20

0

20

40

60

80

100

-100 -50 0 50 100

In Phase Component (% of desired signal)

Q u a d r a

t u r e

P h a s e

C o m p o n e n t

( % o

f d e s i r e

d s i g n a

l )

Figure 5(c) Adapting constellation diagram for the ComplexLMS Adaptive Filter

6 COMPARISON OF IMPLEMENTED LMSAND RLS ALGORITHMS

Besides the usual comparison made in various adaptivesignal processing books [5][6][11][15], the results foundallow the LMS and RLS algorithms to be compared interms of speed, total time to converge and code size.(Based on a clock speed of 200MHz)

Algorithm andImplementation

MaxSample

Rate

TotalTime to

Converge

CodeSize

(bytes)

DataMem.(bytes)

LMS (C) 1.02MHz 980 s 1120 36LMSLinear Assembly

2.63MHz 380 s 192 70

LMS OptimisedAssembly

12.5MHz 80s 288 0*

RLS (C) 25.2kHz 1192 s 2720 164RLSLinear Assembly

15.8kHz 1894 s 864 324

*Only registers were used and these were not saved or restored inbetween interrupts as the processor was idle.

Table 4 Quantitative Comparison of the LMS and RLSAlgorithms

The total time to converge is effectively the amount of processing time that is required to achieve a pre-definederror level. The values given in Table 4 are based on thealgorithms working at the maximum possible samplingrate and the LMS taking 1000 cycles to converge and theRLS 30, both of which are typical values in moderatenoise conditions [10].

In the linear assembly implementations the required datamemory was larger as integers were used instead of shorts for X and H (and R in the RLS) to reduce thenumber of shift instructions required in fixed pointarithmetic.

A comparison of the level of optimisation achievedshowed that the LMS was highly optimised by followingthe procedures outline in Section 3. However, theprocedures did not increase the optimisation of the RLSin terms of cycles required to complete one iteration.

To achieve higher optimisation for the RLS, the linearassembly code would have to be rewritten with a moreintricate pre-planning.

The investigations also indicate that for the case of theLMS, the optimising C compiler was only 8% efficientcompared to hand optimised assembly.

7 CONCLUSION

The results presented in this paper give an insight intosome of the implementation factors that need to beconsidered when using the TMS320C62x.

8/3/2019 5dahnoun

6/66

Although the actual processing time to converge for bothof the algorithms implemented in C is very similar, ingeneral, the RLS algorithm is preferred as it is morereliable with change in the noise environment, althoughit can suffer blow up due to several factors. The RLSalgorithm also requires less iterations to converge henceallowing shorter training sequences than that required forthe LMS.

The LMS does have the advantage that it will always beable to complete one iteration faster than the RLS, asshown. Hence in high bit rate systems it may be the onlyalgorithm that can be used. It has also proved to bemuch simpler to implement, requiring less CPUresources, program and data memory than the RLSalgorithm.

However, the LMS algorithm presented has beenimplemented in a highly optimised form, whereas thereis still scope for optimising the RLS further and it istherefore possible that the implemented RLS couldpossibly out perform the LMS in terms of time to

converge.

REFERENCES

[1] Dahnoun, N., DSP Implementation using theTMS320C62x Processors, Addison Wesley, to bepublished late 1999.

[2] Dahnoun, N., Tong, H.K., Software Optimisationfor Modems using TMS320C62xx Digital SignalProcessor (DSP), 9th International Conference onSignal Processing Applications and Technology

(ICSPAT), Toronto 1998.

[3] Dahnoun, N., Tong, H.K., Implementation of JPEGImage Coding standard on the TMS320C62xx,Texas Instruments Second European DSPEducation and Research Conference. ESIEE ParisSeptember 1998.

[4] Dahnoun, N., Tong, H.K., Viterbi Decoder Designand Implementation using TMS320C62xx DigitalSignal Processor (DSP), Texas InstrumentsSecond European DSP Education and ResearchConference. ESIEE Paris September 1998.

[5] Haykin, S., Adaptive Filter Theory, Prentice-Hall,1991.

[6] Haykin, S., Adaptive Signal Processing, Prentice-Hall, 1986.

[7] Haykin, S., Communication Systems, Wiley, 3 rd

Edition, 1994.

[8] Ifeachor, E., Jervis, B., Digital Signal Processing:A Practical Approach, Addison Wesley, 1998.

[9] McClellan, J.H., Schafer, R.W., Yoder, M.A., DSPFirst A Multimedia Approach, Prentice-Hall,1998.

[10] Mulgrew, B., Grant, G., Thompson, J., DigitalSignal Processing: Concepts and Applications,Macmillian Press Ltd., 1999.

[11] Texas Instruments, TMS320C62x/C67xProgrammers Guide, Texas Instruments(SPRU198B), 1998.

[12]Texas Instruments, TMS320C62x/C67x CPU andInstruction Set, Texas Instruments (SPRU189C),1998.

[13]Texas Instruments, TMS320C6x EvaluationModule, Texas Instruments (SPRU269), 1998.

[14]Texas Instruments, TMS320C6x Optimizing CCompiler, Texas Instruments (SPRU187C), 1998.

[15] Widrow, B., Adaptive Signal Processing,Prentice-Hall, 1985.

[16] Widrow, B., McCool, J., Ball, M., The ComplexLMS Algorithm, Proceedings of the IEEE, pp719-720, 1975.

ACKNOWLEDGEMENTS

The authors would like to thank Robert Owen, HansPeter Blaettel, Neville Bulsara, Greg Peake and MariaHo of Texas Instruments for their continuous help and

support.

M. Hart would also like to thank everyone concerned atFujitsu Telecom R&D Centre, especially SunilVadgama, for sponsoring his MSc at the University of Bristol.

Documents

5dahnoun