Akesson Rantzow Kandidatarbete 20100531

8/12/2019 Akesson Rantzow Kandidatarbete 20100531

1/42

Bachelor thesis inComputer ScienceMay 2010

______________________________________Performance evaluation of multithreading in a

Diameter Credit Control Application

______________________________________

Pontus RantzowGustav kesson

School of ComputingBle inge !nstitute of "echnologyS#eden


2/42

Performance evaluation of multithreading in a Diameter Credit Control Application

"his thesis is su$mitted to the School of Computing at Ble inge !nstitute of "echnology in partialfulfillment of the re%uirements for the Bachelor degree in Computer Science& "he thesis is e%uivalen2 ' 10 #ee s of full time studies&

Contact Information:Author(s)*Pontus +ant,o#-ustav . esson/ mail* -ustav . esson gustava esson hotmail&com3 4 Pontus +ant,o# p&rant,o# gmail&com3

University Advisor:Professor 56 an -rahnSchool of ComputingPhone* 789 8:: ;< :< 08

School of ComputingBle inge !nstitute of "echnologyS/ ;=1 => ?arls rona!nternet *http*@@###&$th&se@comPhone * 789 8:: ;< :0 00

a'* 789 8:: ;< :0 :=

2
http://www.bth.se/comhttp://www.bth.se/com


3/42


AcknowledgementsWe would like to thank Professor Hkan Grahn for contributing with invaluable knowledge and at atimes a positive attitude

We would also like to thank !apgemini "arlskrona for taking the time to give continuous feedback onthe #iameter !redit !ontrol Application protot$pe %hat made this thesis become even better

-ustav . esson Pontus +ant,o#

;


4/42


ABS"+AC"Moore s la# states that the amount of computational po#er availa$le at a given cost dou$les every 1months and indeed4 for the past 20 years there has $een a tremendous development in microprocessors

5o#ever4 for the last fe# years4 Moore s la# has $een su$ ect for de$ate4 since to manage heat issu processor manufacturers have $egun favoring multicore processors4 #hich means parallel computatiohas $ecome necessary to fully utili,e the hard#are& "his also means that soft#are has to $e #ritten #ithmultiprocessing in mind to ta e full advantage of the hard#are4 and #riting parallel soft#areintroduces a #hole ne# set of pro$lems&

or the last couple of years4 the demands on telecommunication systems have increased and to manathe increasing demands4 multiprocessor servers have $ecome a necessity& Applications must fuutili,e the hard#are and such an application is the Diameter Credit Control Application (DCCA)& "heDCCA uses the Diameter net#or ing protocol and the DCCA s purpose is to provide a frame#or foreal time charging& "his could4 for instance4 $e to grant or deny a user s re%uest of a specific neactivity and to account for the eventual use of that net#or resource&

"his thesis investigates #hether it is possi$le to develop a Diameter Credit Control Application thatachieves linear scaling and the eventual pitfalls that e'ist #hen developing a scala$le DCCA server&"he assumption is $ased on the o$servation that the DCCA server s connections have little to nothing common (i&e& little or no synchroni,ation)4 and introducing more processors should therefore give linscaling&

"o investigate #hether a DCCA server s performance scales linearly4 a prototype has $een developeAlong #ith the development of the prototype4 constant performance analysis #as conducted to see#hat affected performance and server scala$ility in a multiprocessor DCCA environment& As the resulsho#4 %uite a fe# factors $esides synchroni,ation and independent connections affected scala$ility othe DCCA prototype&

"he results sho# that the DCCA prototype did not al#ays achieve linear scaling& 5o#ever4 even if it#as not linear4 certain design decisions gave considera$le performance increase #hen more processor#ere introduced&

KEYWORDS

#iameter& #!!A& 'ultiprocessing& (erver& Performance& !ache coherence& #$namic memor$

8


5/42


"AB / E CEF"/F"S ABS"+AC"&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& ?/GHE+DS&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& !F"+EDIC"!EF&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

1&1 BAC?-+EIFD&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1&2 +/S/A+C5 EBJ/C"!K/S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1&; 5GPE"5/S!S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1&8 M/"5EDE E-G&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1&: SCEP/ AFD !M!"A"!EFS&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1&9 MA!F CEF"+!BI"!EF AFD +/SI "S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

BAC?-+EIFD&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 2&1 MI "!P+EC/SSE+ SGS"/MS AFD PA+A / CEMPI"!F-&&&&&&&&&&&&&&&&&&

2&2 D!AM/"/+ AFD DCCA&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 2&; P/+ E+MAFC/ /KA IA"!EF&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& /LP/+!M/F"A /FK!+EFM/F"&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

;&1 "5/ DCCA P+E"E"GP/&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& ;&2 P A" E+M&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

/LP/+!M/F"A +/SI "S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 8&1 EK/+A P/+ E+MAFC/ E "5/ DCCA P+E"E"GP/&&&&&&&&&&&&&&&&&&&&& 8&2 DGFAM!C M/ME+G AFD A ECA"E+S&&&&&&&&&&&&&&&&&&&&&&&&&&&

8&2&1 DCCA AFA GS!S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 8&2&2 -/F/+A CAS/ AFA GS!S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

8&; "+AFSPE+" P+E"ECE S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

8&8 "5/ !MPAC" E EF- S/+!A ! A"!EF PE!F"S&&&&&&&&&&&&&&&&&&&&&&&&& 8&: CAC5/ CE5/+/FC/&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 8&:&1 A S/ S5A+!F-&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 8&:&2 P+EC/SSE+ A !F!"G&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

8&9 M/ME+G B EC? CEPG!F-&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& D!SCISS!EF&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& CEFC IS!EFS AFD I"I+/ HE+?&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

9&1 CEFC IS!EFS&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 9&2 I"I+/ HE+?&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

+/ /+/FC/S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& B!B !E-+AP5G&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

H/B +/SEI+C/S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& - ESSA+G&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& APP/FD!L A S/+!A ! A"!EF PE!F"S&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& APP/FD!L B M/ME+G B EC? CEPG!F-&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

:


6/42


Chapter

1 !F"+EDIC"!EF

1.1 BACK ROU!D

Diameter is a net#or ing protocol referred to as AAA Authentication@Authori,ation@Accounting is used in the telecommunication industry for controlling access and the charging of services such amo$ile !nternet& Diameter is a successor to +AD!IS4 and in contrast to +AD!IS4 Diameter runs overelia$le transport protocols ("CP and SC"P)4 ena$les net#or layer security (such as !PSec) and has $etter support for roaming& "hese are only a fe# features that differ the t#o protocols& Diameter $a protocol standard is defined in the + C;:


7/42


1.' #Y(O%#ESIS

!n a DCCA4 the connected nodes have little in common& "hat is4 the connections #ill not need asignificant seriali,ation points $et#een them& "herefore4 the hypothesis is that a DCCA #ill $e a$le toscale near linearly #ith respect to the num$er of processors&

5o#ever4 load im$alance could $e a potential pro$lem in a DCCA& Ene connected node may sensignificantly more re%uests than another node& "his means that the connection itself must paralleli,ed to fully utili,e the processors& Hhat performance issues #ill arise in such a scenario4 andho# can those issues $e resolved

1.) *E%#ODO+O Y

"o $e a$le to dra# conclusions from the performance evaluation of a DCCA4 a prototype has to $edeveloped& "his prototype #ill $e the $asis of performance measurements in this thesis& Briefdescri$ed4 the implemented prototype #ill $e a$le to connect up to :0 simultaneous client nodes& "hesnodes can send Credit Control +e%uests (CC+s) to the DCCA and the server responds #ith CrediControl Ans#ers (CCAs)& "he node s re%uest is al#ays granted (since this is a prototype)4 $ut $efthe DCCA ans#ers the node4 a simulation is performed of the #or a DCCA does in the case of a CC+

Along #ith the prototype server4 a complete data structure for handling Diameter messages (includinall supported Attri$ute Kalue Pairs4 AKPs4 #hich are an essential part of a Diameter message) #ill $developed& "his ma es it easier to manage the $yte se%uence that is received from the net#or & !tma es it easier to handle and send specific Diameter messages over the net#or & inally4 a DCclient #ill $e developed as a helping tool in the performance evaluations and measurements& "hclient s only purpose is to $enchmar the capacity of the server&

"he #or descri$ed in this thesis is done in the form of e'ploratory approach N+o$son 02O& Ence tDCCA prototype is complete4 performance analysis tools are used to find $ottlenec s and hotspot arein a credit control scenario& !nformation #ill also $e gathered to find out e&g& #hat C77 specoptimi,ations can $e done for a DCCA4 in respect to multithreading and scala$ility&

"he evaluations #ill $e conducted $y $enchmar ing the server in one scenario :0 client nodes send atotal of three million CC+4 and receives as many CCA& "he idea is that if a value is e'tracted from onmeasurement4 #ill there $e any improvements if the server is modified and measured under the e'actsame scenario "here #ill also $e general case analysis of certain scenarios& During the #or of thithesis4 the follo#ing saying #ill constantly $e present*

)*f $ou can+t measure it& $ou can+t control it And if $ou can+t control it& $ou can+t improve it )

1., SCO(E A!D +I*I%A%IO!S

"he #ays to increase performance are many there s hard#are4 programming languages4 compileoperating systems and many other aspects that all have different impact on an application

=


8/42


performance& "his thesis #ill4 ho#ever4 try to limit the performance optimi,ations and scala$ilitevaluations to the soft#are in a Diameter Credit Control Application server architecture& or instancan !ntelQ processor using !ntelQ compiler could get a performance increase in comparison to the g77compiler& Aspects such as different compilers #ill not $e ta en into consideration in this thesis&

"his thesis #ill focus on evaluating performance and scala$ility in a Diameter Credit ControlApplication server& Any other Diameter $ased server applications are not covered& "hat also means performance evaluations of DCCA clients #ill not $e covered&

"o fully understand this thesis4 no#ledge a$out programming4 memory4 net#or ing and a certaidegree of parallel computing is re%uired&

1.- *AI! CO!%RIBU%IO! A!D RESU+%S

"his research thesis has evaluated a multithreaded Diameter Credit Control Application (DCCA)4 in

respect to multiprocessing performance and scala$ility& "o $e a$le to properly evaluate DCCA4 prototype that mimics the real #orld as close as possi$le #as developed& "he primary goal #as toevaluate scala$ility4 and #hat affects multiprocessing performance of the DCCA prototype& "he resulof this thesis should aid multithreaded server development #ithin the telecommunication industry4#here Diameter is used as the net#or ing protocol& !t should particularly aid the development omultithreaded DCCA servers4 $ut could also $e of interest for any development of multithreadeapplications&

"he main results sho# that the DCCA prototype did not al#ays achieve linear scaling& Aspects such asmemory management4 synchroni,ation and cache coherence have significant impact on servescala$ility& 5o#ever4 even if scala$ility #as not linear4 certain design decisions gave considera$

performance increase #hen the DCCA prototype #as employed on a multiprocessor system&


9/42


Chapter

2 BAC?-+EIFD

".1 *U+%I(ROCESSOR SYS%E*S A!D (ARA++E+ CO*(U%I!

A multiprocessor system is a computer system that consist of more than one processor that can #orsimultaneously& An e'ample of such a system is Symmetric (shared memory) MultiProcessing& SMPa computer system that consists of multiple identical cache processor su$systems that share the sam physical memory and connect via a $us& "his type of centrali,ed shared memory is $y far the mo popular organi,ation N5ennessey 0;O&

"he term multiprocessing refers to the possi$ility of running a computer system #ith more than one processor& "he idea $ehind multiprocessing is that performance can $e dou$led $y employing t#ice amany processors& But the reality is that this is rarely the case4 even though multiprocessing undcertain circumstances can result in increased performance& Along #ith multiprocessing comes a seriof pro$lems one must consider&

As multiprocessor systems more or less have $ecome the norm of today s computer systems4 it masense to use parallel computing techni%ues to increase performance& or multiprocessor systems tconsist of a single computer (such as SMP)4 parallelism is achieved $y using threads N-er$er 0:O&thread is a single stream of control in the flo# of a program& A thread does not necessarily follo# thse%uential flo# of an application4 #hich means that on a SMP system4 independent threads can e'ecutsimultaneously4 i&e& in parallel&

-aining performance increase $y using threads can $e accomplished either $y a clever compiler thatcan determine #here it s appropriate tothread (i&e& e'tract parallelism)4 $ut in most cases th programmer must e'plicitly e'press a parallel formulation of the application& "hat is4 the programmemust #rite code #here threading is appropriate&

"o $e a$le to #rite a multithreaded application the programmer must identify the tas s than can $ee'ecuted concurrently& Ence identified4 the programmer can use an AP! such as PES!L thread(pthreads) to e'press the identified concurrency&

Pthreads defines an e'tensive AP! for creating and manipulating threads& Pthreads is a so called lo#level thread AP!4 #hich means that the programmer has a greater responsi$ility of the threads than if a

>


10/42


AP! such as EpenMP (#hich is considered high level) is used& "he differences $et#een lo# level anhigh level thread AP! is similar to the differences $et#een assem$ly language and C77 a classtradeoff in control and comple'ity&

!n order to achieve a performance increase4 there are common threading goals to consider N-er$er 0:O ocusing on hotspots "here is no point in optimi,ing the part of an application that consumes

an insignificant part of the e'ecution time& Minimi,e synchroni,ation "hreads that re%uire little synchroni,ation #ill in most cases result

in higher performance4 in contrast to those threads that re%uire more synchroni,ation& Minimi,e memory sharing Multiple processors do not li e to share memory4 since memor

sharing results in e'tra $us transactions and usually additional synchroni,ation&

5o#ever4 there are not only threading goals that the programmer must consider& Along #ith parallecomputing and threading4 there are pitfalls and other issues*

Short loops "hread creation4 termination and synchroni,ation in a paralleli,ed short loop could prove to $e more e'pensive than se%uentially e'ecuting the loop&

alse sharing R +esults in unnecessary $us transactions $y sending a cache line $ac and for $et#een the processors&

Processor affinity Descri$es the specific processor(s) that a thread e'ecutes on& A potenti performance issue is #hen threads are constantly moving $et#een the different processors&

"otal memory $and#idth "he processors collectively share the fi'ed memory $and#idth& ormemory intensive applications4 the total $and#idth can $e a pro$lem even if more processorare introduced&

Synchroni,ation overhead Synchroni,ation may force threads to #ait for each other& "hismeans that threads #ait #ithout doing any #or 4 #ith the conse%uence of reducing parallelismof the application&

As presented4 multiprocessing introduces a #hole ne# set of possi$ilities and pro$lems& "hreadingoals and pitfalls descri$ed earlier #ill $e discussed and evaluated in this thesis& !n particular in regarto the Diameter Credit Control Application server&

"." DIA*E%ER A!D DCCA

Diameter is a net#or ing protocol that controls communication $et#een connected nodes& or instancthis could $e $et#een an authenticator (such as a Diameter $ased server) and a net#or nodere%uesting authentication& Diameter can $e used in the telecommunication industry&

Diameter is a successor to the +AD!IS protocol (the name itself is a pun of +AD!IS4 Diameter ist#ice #hat +AD!IS is) and offers improvements& Amongst those improvements are relia$le transport(either "CP or SC"P)4 error handling4 net#or layer security and a more e'tensive roaming support&

Diameter is thought to $e the ne't generation AAA Authentication@Authori,ation@Accountin

10


11/42


protocol& Authentication refers to authenticate an entity s identity4 authori,ation determines #hethespecific entity is allo#ed to perform the re%uested activity4 and accounting refers to pursuing the useaccess to net#or resources& Since the emergence of ne# technologies such as mo$ile !nternet4 tre%uirements and demands of authentication and authori,ation have increased& "his is #here +AD!IS

is considered to $e insufficient N+ C;:


12/42


"he $enchmar should $e easy to run and produce results that are simple to interpret&

"his definition of a $enchmar and ey attri$utes have $een adopted #hen creating $enchmarapplications and conducting $enchmar s on the DCCA prototype&

igure 2&1 "he soft#are optimi,ation process&

inding hotspots in the DCCA prototype #ill mainly $e done $y using !ntelQ K"une Performanceanaly,er inu' tools& or instance4 using !ntelQ K"une Call -raph analy,er aids in pinpointinghotspot areas during e'ecution&

!n this thesis4 performance evaluation refers to evaluating different solutions in respect to performanc"he evaluations #ill $e $ased on $enchmar s4 #here timing mechanism (stop#atch) and throughputare the main performance measuring tools&

"hroughout this thesis4 certain performance evaluation metrics #ill $e used& "hese are*

Speedup R refers to ho# much faster a parallel algorithm is in contrast to the se%uential versionor instance4 if a se%uential algorithm ta es 19 seconds to e'ecute4 $ut the parallel algorithe'ecutes in 8 seconds4 the speedup is 8 (19@8 T 8)&

Parallel efficiency R is a metric of the time that a processor is fully utili,ed& or instance4 ise%uential algorithm ta es 19 seconds to e'ecute4 #hereas on < processors the parallealgorithm e'ecutes in 8 seconds4 the parallel efficiency is 0&: (19@8@< T 0&:)& "hat is4 the pefficiency is :0U& "he higher efficiency4 the $etter utili,ation of the processors& !f speedup not superlinear (#hich it in some cases may $e)4 ma'imum value is 100U utili,ation of the processors&

Scala$ility R a parallel system is considered scala$le if the speedup increases asymptoticall#ith the num$er of processors&

12


13/42


Chapter

3 /LP/+!M/F"A /FK!+EFM/F"

'.1 %#E DCCA (RO%O%Y(E

"he DCCA prototype has a fi'ed num$er of threads (:0 for this thesis)4 referred to as connectionthreads& "he amount of connection threads (or at least the ma'imum num$er) can $e determined prioto the server e'ecution& Due to the nature of a DCCA4 the different nodes that may connect are usual

no#n& As mentioned4 :0 connection threads #ill $e used during measurements4 #hich ena$les :clients to connect simultaneously&

"hese connection threads listen for incoming connections& Ence a client #ants to connect4 a threa pic s up the re%uest and handles only that single connection& After that4 the client sends a given numof message re%uests to the server4 #hereas the server ans#ers each message re%uest& Ence tconnection is terminated4 the connection thread #ill return to a listening state&

Hhen a connection is esta$lished4 a fi'ed num$er of #or er threads are created (20 for this thesis)&"hese threads #ill #or #ith the specific connection and it s Diameter messages& "hat is4 the #orthreads #ill read from and #rite to the net#or connection4 $ut also simulate DCCA #or and saveeach received message to a %ueue& "his %ueue #ill later $e saved to disc&

or each esta$lished connection4 a single thread is created that handles saving data (Diametemessages) to disc& "his thread sleeps for a given amount of time $efore #a ing up and starting to #oron the message %ueue& "he #or is4 in this case4 to empty the message %ueue and convert the messto one $ig LM chun 4 and save that chun &

5o#ever4 the part of the actual disc access has $een e'cluded& !t #ould have $een too difficult to dra#any conclusions regarding other performance aspects since accessing the disc #ould $e the single

$iggest $ottlenec & "he preparation of disc access (see Appendi' A4 igure A&2) should thereforeconsidered as simulated #or of a real DCCA server&

A $rief use case scenario of the prototype is*

1;


14/42


1& Main application thread creates a given num$er of connection threads&2& /ach connection thread #aits until a client node connects& "here can $e as many simultaneous

connected nodes as the amount of created connection threads&

;& Hhen a client has connected4 the connection thread creates a given num$er of #or er threads&thread for file handling is also created&8& "he #or er threads deal #ith the connected node R receive message (re%uest)4 handle t

re%uest4 save re%uest to a connection shared %ueue and ans#er the node s re%uest&:& A file thread created $y the connection thread #a es up after a given time and handles th

connection s message %ueue& /ach message is de%ueued and merged into a single LM chAfter the %ueue has $een emptied4 the LM chun is meant to $e saved to disc&

9& Step 8 : is repeated until the connection is closed& Ence that happens4 the #or er threads file thread terminate4 and the connection thread returns to a listening state (step 2)&

igure ;&1 is a general illustration of the DCCA prototype&

igure ;&1 A general description of the DCCA prototype& Fote that for simplicity4 this only illustrates one connection&5o#ever4 the illustration can easily $e translated to any given num$er of connections&

"he data structure for Diameter messages is represented $y a num$er of classes ( igure ;&2)& E$ eoriented inheritance is used to represent AKPs4 since an AKP can $e one of < different AKP datatyp(for instance4 ;2 $it integer or an octet string) N+ C;:


15/42


igure ;&2 "he Diameter message data structure& EctetString and Insigned;2 (#hich are 2 out of < e'isting datatypes)inherit from AKP& Fote that su$classes data differs4 in respect to datatypes&

A Diameter message structure may contain any given num$er of AKP o$ ects& A Diameter mesonly has one Message5eader& "he Diameter message design does not only ma e it easy to managDiameter messages4 $ut the data structure has several appropriate mem$er functions for handlindifferent DCCA needs& or instance4 it is possi$le to turn a properly concatenated $yte array (i&e&has $een received from a connected client) into a easily managea$le Diameter message& !t is als possi$le to turn the data structure o$ ect itself into a $yte array that can $e sent to the connected client

Besides the Diameter message structure4 several other user defined classes are commonly used in DCCA prototype& or instance4 a class handling character arrays (string) have $een used& !t is similS" string4 $ut the string class used for the DCCA prototype has features that S" string could not provide& !t is also easier to integrate the user defined string class into other user defined classes& A for net#or handling has also $een used4 #hich ma es net#or management easier& igure ;&; sho#an e'ample of the user defined String4 Fet#or and og classes&

1:


16/42


igure ;&; An e'ample of three commonly used classes R String4 og and Fet#or &"he e'ample is a simple server that a client can connect to&

'." (+A% OR*

"he evaluations #ill only $e conducted on the follo#ing platform*

!ntel(+) Leon(+) CPI /:;;:4 consisting of totally < processors (2 physical processor chips4#ith 8 cores each)& /ach processor core is running at 2&00-5,&

"otal memory capacity is 19 -B& Fet#or is -iga$it /thernet& Eperating system is '98 -FI@ inu' 2&9&28 2; server& Programming language is C774 and applications are compiled using g77 8&2&8 and gli$c 2

During compilation4 optimi,ation level E; is used& Fo other optimi,ation flags are set&

"he thread AP! used #ill $e PES!L threads (pthreads)&

19


17/42


Chapter

4 /LP/+!M/F"A +/SI "S

).1 O&ERA++ (ER OR*A!CE O %#E DCCA (RO%O%Y(E

"he final result of this thesis is sho#n in igure 8&1 as the total throughput the DCAA prototype cahandle& "he throughput has $een measured using 14 24 8 and < processors& igure 8&1 is the result final conclusions and evaluations of this thesis& !n the follo#ing su$ chapters4 the analysis that lead these results are outlined&

igure 8&1 inal result of DCCA throughput&

1=


18/42


)." DY!A*IC *E*ORY A!D A++OCA%ORS

Dynamic memory ena$les an application to allocate memory #hen needed4 and free it #hen it s nlonger needed& or system design and fle'i$ility4 this could often $e a desira$le approach due maintaina$ility and design reasons& 5o#ever4 along #ith the fle'i$ility of dynamic memory comes acost in performance NSmith 01O& "his su$ chapter focuses on dynamic memory4 and the goal is toout ho# dynamic memory affects the DCCA prototype and find possi$le solutions on ho# to addresseventual performance hits&

).".1 DCCA A!A+YSIS

According to the !ntelQ K"une Call -raph4 a$out 10U of the function calls in the DCCA developedfor this thesis are to either C malloc() or free()& re%uent dynamic memory management can plasignificant role in an application s performance4 and the performance penalty often $oils do#n to tdefault memory manager NBerger 00O& "he performance penalty $ecomes even more significant multithreaded environment due to seriali,ation loc s used $y the runtime li$rary to access the sharedmemory heap N!ntelO&

E$viously4 a performance optimi,ation #ould $e to reduce the num$er of dynamic memory calls4 $ut certain degree of dynamic memory is still needed& Due to that reason4 measurements #ere docomparing gli$c $uiltin memory manager and -oogle "CMalloc ("hread Caching Malloc)&

"he default memory manager is $y nature general purpose and cannot ma e any simplifyingassumptions a$out the application& An application may use dynamic memory in a very specific #ay ancould therefore pay penalty for functionality it does not need& "he default memory manager could alhave features that penali,es performance4 such as fre%uent system calls for every ne#() and deleteand seriali,ation points (such as contention for the memory heap) that reduce scala$ility NBul a 0;Ising other allocators than the default could address these issues $y having memory pools and threadcaches& Fot only does this aim to reduce system calls and seriali,ation points4 $ut also aims to reducache coherence pro$lems NBerger 00O& A specially designed memory manager for a multithreenvironment (such as "CMalloc) could therefore $e $eneficial for the performance in the DCCA prototype&

igure 8&2 and igure 8&; sho#s a $enchmar of the DCCA using gli$c malloc and "CMalloc& ig8&2 sho#s throughput4 #hereas igure 8&; sho#s parallel efficiency $ased on throughput&

1


19/42


20/42


"CMalloc #or s $y assigning thread local cache from the central memory heap to each thread& !n thcache4 smaller allocations are done #ithout any seriali,ation& "CMalloc uses gar$age collection fothread local storage that is considered redundant& "his redundant memory migrates from the loccache to the central heap& "hread caching could mean a higher memory use4 and "CMalloc may

more memory hungry than other malloc implementations N-hema#atO&A num$er of measurements #ere conducted $oth on the DCCA prototype4 $ut also in more isolatedsituations& Besides DCCA $enchmar ing4 integer allocation as #ell as Diameter message handling #emeasured&

)."." E!ERA+ CASE A!A+YSIS

5oard is another allocator developed for multithreaded environments& 5oard uses mmap e'clusively $ut handles memory in 98 ilo$yte chun s referred to as superblocks& 5oard s heap is logically dividedinto a common4 glo$al heap $ut also a given num$er of per processor heaps& As #ith "CMalloc4 5oa

also employs thread local cache that holds a limited num$er of super$loc s& "he developers of 5oaclaims the allocator achieves almost linear scala$ility #ith the num$er of threads NBerger 00O&

"he Diameter message $enchmar #as conducted $y letting a given num$er of threads (


21/42


igure 8&8 Diameter message handling $enchmar 4 using different allocators&

!n igure 8&84 5oard performs #ell at < cores4 #ith little respect to the num$er of threads used& Ethreads 5oard achieves a 2&; speedup compared to gli$c malloc and 5oard performs slightly $etter tha"CMalloc& "CMalloc performs similar at ;2 threads4 $ut "CMalloc does not perform as #ell as 5oard#ith < threads&

"he integer $enchmar igure 8&: #as conducted $y letting a given num$er of threads (


22/42


igure 8&: Allocation@Deallocation $enchmar 4 using different allocators&

!n igure 8&:4 5oard performed surprisingly $ad4 as the allocator on one processor is V;0U #orse thagli$c malloc& 5o#ever4 as the num$er of cores increase as do 5oard s performance& At < process5oard outperforms gli$c malloc& /specially #hen using only < threads4 as 5oard has a speedup of 2&compared to gli$c malloc& "CMalloc performs #ell right from the start4 and scales #ell on ;2 threads on < cores4 2< seconds elapsed #ith < threads4 $ut only < seconds #hen using ;2 threads& !ncreasing num$er of threads $eyond ;2 did not give any further performance increase&

!t #as not possi$le to conduct any measurements #ith 5oard on the DCCA prototype& /ven though it#as confirmed the shared li$rary #as properly lin ed4 the server had net#or issues #hen using 5oard&5o#ever4 5oard could still $e of interest on another platform as it supposedly has a $ig theoretical performance increase in multithreaded environments N in 00O& ailing to use 5oard is also a remithat using allocators other than the $uilt in is not only a$out performance increase the implementatiitself must $e relia$le4 something 5oard could not provide for the DCCA prototype&

!t is important to note that free() and malloc() are primitive system calls and in a reasona$lemeasurement should not $e directly compared to "CMalloc or 5oard& "hese allocators reduce calls tfree() and malloc()4 structures memory in an efficient #ay $y logically separating the memory heap anare designed to give performance gains in multithreaded environments& 5o#ever4 this is not primarilythesis on comparing allocators4 $ut as the measurements sho#4 there is a significant performancdifference for a practical implementation& Hhen developing a DCCA4 or for that matter an

22


23/42


multithreaded server that in hotspot areas allocate dynamic memory4 using different allocators can hava significant performance impact&

).' %RA!S(OR% (RO%OCO+S

Inli e it s predecessor +AD!IS4 Diameter operates over relia$le transport protocols "CP or SC"P&"his comes #ith the e'pense of speed4 $ut on the other hand it ena$les flo# control and retransmissionsof lost pac ets& 5aving a relia$le transport protocol is immensely important for a chargintelecommunication system&

"CP is #ell no#n #hereas Stream Control "ransmission Protocol (SC"P) first $ecame a standard in2000 N+ C2>90O& SC"P is similar to "CP4 such as $oth protocols support full duple'4 is connectioriented4 provides relia$le transmission and employs a flo# control #indo#ing mechanism& 5o#ever4SC"P provides capa$ilities that "CP does not&

SC"P ena$les and uses the notion of multihoming& A multihomed host is a host #ith more than onnet#or interface& At connection initiali,ation4 the SC"P peers e'change information a$out theiinterface(s) and if a SC"P message re%uires retransmission4 that message can $e sent to an alternainterface& "his increases the relia$ility of the SC"P session& /ven though this feature primarily is forelia$ility and not for load sharing4 load sharing could still $e a potential side effect of $einmultihomed&

"CP is $yte oriented4 meaning that the upper layer protocols have to count and accumulate $ytes oeach Diameter message& SC"P on the other hand is message oriented4 meaning that SC"P maintamessage $oundaries and delivers complete messages (called chun s) to the upper layer protocolsDealing #ith "CP s $yte oriented nature could introduce certain net#or overhead4 and potential

more system calls due to read and #rite operations of $yte se%uences&Perhaps the most important difference is that SC"P provides multiple data streams $et#een t#o peers4#hereas "CP only allo#s one $yte stream& "his means that independent data e'changes may $edelivered over different streams messages lost in one stream does not affect other streams& !n "Clost segments #ould delay delivery of all su$se%uent messages& "his is referred to head of l $loc ing4 #hich is a phenomenon limiting overall throughput e&g& the total performance scala$ility of the DCCA&

Measurements in igure 8&9 #ere done comparing "CP and SC"P on the prototype developed for thithesis& "he usage of SC"P is rather primitive4 i&e& parallel message streams and multihoming #ere

included&

2;


24/42


igure 8&9 +esult of DCCA throughput using "CP and SC"P4

Ether research measurements have $een done on ho# SC"P impacts server scala$ility NEno 0


25/42


no matter ho# many processors4 only one at a time can access the critical region& !t can $e crucial performance to try and avoid long mutual e'clusions in hotspot areas&

"he DCCA has to save each message in a container4 to later $e #ritten to disc $y another thread&

ma or $ottlenec #as found #hen a S" vector #as used as a container4 $ecause every time a thread#anted to access the vector4 mutual e'clusion had to $e claimed& "his #ould not $e a pro$lem if onlythe threads that inserted messages had to deal #ith the mutual e'clusion& "he $ottlenec occurred #henthe thread dedicated to emptying this vector had to eep claiming mutual e'clusion until every messagehad $een handled and removed (see Appendi' igure A&;)& "his meant no other threads could inseany messages until the emptying operation #as complete& "he result of this solution is sho#n in igur8&=&

igure 8&= +esult of DCCA throughput4 using a loc ed S" vector&

"he results in igure 8&= sho# no increase at all $et#een 8 and < processors& "o counter this pro$lema lin ed list #as used instead of a S" vector& "he graph in igure 8&< demonstrates throughput4 ifaccess of the list #as done during mutual e'clusion&

2:


26/42


igure 8&< +esult of DCCA throughput4 using a loc ed lin ed list&

"he results are #orse almost across the $oard& "hese measurements rule out that the choice of datastructure (vector or list) #as the cause of the pro$lem& 5o#ever4 $y using a specially designed lin elist data structure gives the possi$ility of inserting and removing messages simultaneously (eventhough those respective operations have to $e done during mutual e'clusion)& "he seriali,ation point oemptying the list can therefore $e alleviated4 ma ing it possi$le for the application to still insemessages even if the container is currently $eing emptied& See Appendi' igure A&2 for the complsource code of the emptying operations&

"he list is accessed in t#o #ays one access is to the front of the list4 #hile the other is accessed in therear& !f the application inserts messages at the rear of the list #hile the thread that empties the list tamessages from the front4 $oth can #or on the list simultaneously& Accessing the container alleviating a ma or part of the synchroni,ation in the common case fulfills an important threading goaN-er$er0:O& Minimi,ing synchroni,ation is a ey to gain performance from an SMP environment& impact of minimi,ing synchroni,ation in this hotspot area is sho#n in igure 8&>&

29


27/42


igure 8&> +esult of DCCA throughput4 using an unloc ed lin ed list&

Comparing #ith igure 8& sho# up to 1:0U increase in throughp performance #ith "CMalloc on t#o or more processors& 5o#ever4 despite the minimi,edsynchroni,ation4 gli$c malloc sho#s a performance decrease on t#o or more processors&

)., CAC#E CO#ERE!CE

A common pro$lem in a SMP environment is cache coherence4 since each processor has it s o#

e'clusive cache& !f t#o processors #or on the same memory $loc 4 a protocol to eep the data in thcaches consistent has to $e employed& or instance4 if t#o threads on t#o different processors #or oshared data4 the data must travel $et#een the different processor caches to stay coherent& "he need fcache coherence protocols gives more transactions to the shared memory $us&

).,.1 A+SE S#ARI!

or the DCCA prototype there are three mutual e'clusions that each connection share4 one for readinfrom the net#or connection4 one for #riting to the net#or connection and one for saving the readDiameter message in a container& "he mutual e'clusions are contained inside the shared data structufor each connection (see igure 8&18)& "hese loc s are constantly accessed and changed& !n cescenarios4 this memory setup can cause a performance hit in the form of false sharing&

alse sharing occurs #hen several processors #rite to different data in the same cache line4 #hichmeans that the cache $loc #ill ping pong $et#een the different caches N-rama 0;O4 N-er$er 0:Othe DCCA prototype4 this could for instance $e the mute' loc s& "o demonstrate this pro$lem4 a sm $enchmar application #as developed that creates a given num$er of threads (


28/42


5alf of the amount of threads loc s and unloc s one of the them4 #hile the other half does the same onthe other mute' loc & "he results are sho#n in igure 8&11&

igure 8&10 Data structure #ithout cache padding&

igure 8&11 alse sharing #ithout using cache padding& Min refers to the minimum e'ecution time during themeasurements4 Ma' to the ma'imum and Average to the average e'ecution time&

As sho#n in igure 8&114 there is no difference on a single processor4 since the cache line does not hto ping pong any#here& 5o#ever4 as the num$er of cores increase4 the effects of cache line ping p

$ecomes more visi$le& !t ta es longer to do the same amount of iterations the more processors that aintroduced& Also note that the results can vary $y up to ;00U on t#o or more processors4 #hichhappens since there is a random amount of ho# many cache line ping pongs occur& "o confirm ththere is a false sharing pro$lem a test #ith cache padding is performed& !n this case4 the definitioncache padding in this thesis is that enough data is added $et#een the loc s to push the loc s intdifferent cache lines (see igure 8&12)& Y/nough dataY depends on the platform s cache line si,e&

2


29/42


igure 8&12 Data structure #ith cache padding&

"he result if adding cache padding $et#een the loc s in the data structure is sho#n in igure 8&1;&

igure 8&1; alse sharing using cache padding&

As the results in igure 8&1; sho#4 on average there is an almost dou$le speedup across the $oard ot#o or more processors& Also to note is the results are not fluctuating as much anymore4 at ma'imum:0U& Compared to the non padding in igure 8&114 it is clear that there #as a false sharing pro$l"he cache line ping pong caused $y false sharing is gone4 $ut ping pong still occurs since the lostill have to move $et#een the different processors cache4 as the results sho# an increase incomputation time the more processors are used&

"he DCCA has a shared data structure for each connection #ith a si,e of ;>2 $yte over 11 different

varia$les& !n this case there might not $e a need for padding4 $ut instead the loc s can $e positionedthe data structure so they are not in the same cache line (see igure 8&18)& +esults sho#ed a increase in performance after the ne# positioning of the shared mute' loc s #as performed&

2>


30/42


igure 8&18 /'ample of restructuring of shared resource& Structure to the right is restructured&

).,." (ROCESSOR A I!I%Y

!n inu' there is t#o types of processor affinity& Ene called soft affinity4 #hich means that thescheduler #ill try to eep threads on the same processor as long as possi$le& 5o#ever4 the schedulecould onlytr$ to achieve this $ecause at some point a long lived thread #ill most certainly migrate toanother processor& !n more recent versions of the inu' ernel4 there is support for a second type h processor affinity& 5ard processor affinity is a ind of scheduler property that forces a thread to a give processor set& "he operating system s scheduler #ill then honor this rule $y running the thread on tspecified processor(s)&

"he most o$vious $enefit of processor affinity in a multithreaded server application is optimi,ing thecache performance& Multiprocessor computers go through a lot of trou$le eeping the processocaches valid N-er$er 0:O& !f a thread adds a line of data to it s processor s cache4 coherence action $e ta en to eep the other processors caches coherent& And if threads constantly $ounce $et#different processors there is another t#ist is the data the thread needs availa$le in the specific processor Perhaps not4 as the data in the cache is data that $elongs to another thread4 and not trunning one& !f this occurs in a common case of the application4 it could result in a significant cacmiss ratio&

"o further investigate the impact of processor affinity and cache performance on the DCCA4 eachconnection and it s associated threads #ere forced onto one processor& +esults are sho#n in igu8&1:&

;0


31/42


igure 8&1: "hroughput of DCCA4 using processor affinity

As the results in igure 8&1: sho#4 using processor affinity on 2 and 8 processors had a sligh performance decrease& 5o#ever4 #hen using < processors there #as instead a performance gain #henusing processor affinity&

A third measurement #as conducted4 $ut not included in igure 8&1: allo#ing the connectionassociated threads to use a set of processors (instead of ust one processor)& 5o#ever4 this e'perimenhad no performance impact&

orcing a specific connection4 and all it s associated threads4 to one processor has t#o madisadvantages& !f there is load im$alance4 i&e& if one connected node sends more credit control re%than the other nodes4 that specific hotspot connection #ill not $e a$le to fully utili,e the SMP&

Hith processor affinity there is also the potential ris that t#o hotspot connections end up #anting touse the same processor& Such a scenario could mean an even greater performance decrease& !maginDCCA scenario #here there is only t#o active connections& "hose connections happen to $e undeheavy load4 and $oth end up $eing dedicated to the same processor& "his turns a paralleli,ed applicatiinto a completely se%uential oneZ

A potential solution to the issue #ith hotspot connections #anting the same processor is to design aforced thread migration& !f a processor has a significant amount of #or to do4 #hereas anoth processor does not4 one connection could migrate to that availa$le processor& "his alleviates t pro$lem #ith hotspot connections #anting the same processor4 $ut #ould still optimi,e cache performance due to the reason that the connection s associated threads #ill not run on any othe processor& !t could4 ho#ever4 introduce ne# comple' issues such as an efficient implementation of sucforced thread migration&

;1


32/42


!f processor affinity is used4 careful consideration must $e ta en in order to properly map connectionto processors& As the measurements sho#ed on the DCCA4 affinity could have $oth an increase anddecrease in performance depending on the num$er of processors& "he potential ris s of this solutiomust also $e considered&

).- *E*ORY B+OCK CO(YI!

Another performance optimi,ation #as to use the C function memcpy()4 instead of manually copying $ytes $y iterating through an array& !ntelQ K"une Call -raph pointed out that the DCCA prototype#as using this ind of $yte concatenation a significant part of the e'ecution time& /ach time a Diametermessage #as received4 handled or sent4 character $yte se%uences #ere put together $y concatenatiarrays&

"he memcpy(void[ destination4 const void[ source4 si,e_t num) function copies the values ofnum $ytes from source directly to the memory pointed $ydestination& 5o#ever4 memcpy() does not only

copy characters& !n fact4 memcpy() does not care #hat ind of data is copied4 since the result is a $incopy of the data NCplusplusO& memcpy() can result in a performance increase4 since it is alroptimi,ed and may use special assem$ly instructions to copy $loc s of data NBul a 0;O& "he c $loc s in igure 8&19 are t#o simple e'amples of ho# to concatenate t#o character arrays& "he firse'ample uses memcpy()4 and the other uses a user defined iteration to copy character $y character&

igure 8&19 /'ample of character concatenation using either memcpy() or an iteration&

By using memcpy() the overall performance of the DCCA #as increased $y 10U4 $ut only up to 8 processors& Fo improvement #as o$served #hen using < processors& 5o# memcpy() is primarily usein the DCCA prototype can $e seen in Appendi' B4 igure B&1&

;2


33/42


Chapter

5 D!SCISS!EF"he main focus of this thesis has $een to evaluate the scala$ility of a Diameter Credit ControlApplication& "he hypothesis #as $ased on the o$servation that the connections #or independent oeach other4 and therefore the DCCA prototype #ould scale near linear in respect to the num$er o

processors&As the results sho#4 the DCCA prototype #as not scala$le in all cases and especially not consideringall the aspects (#hich #ere presented in Chapter 8) that ended up affecting scala$ility& Synchroni,ationand independent connections #ere only parts of #hat #ould affect the performance throughput& !sloppy design decisions #ere ta en4 there #as an insignificant performance increase #hen adding more processors&

"he parallel efficiency of 2 processors are =:U4 #hich indicates that the performance does not scalelinearly& 5o#ever4 the DCCA is considered linear scala$le #hen comparing 2 and 8 processors\ $othave a parallel efficiency of =:U&

Ene interesting note is that gli$c malloc performed $etter #ithmore synchroni,ation than #ith less (seedifferences $et#een igure 8&= and igure 8&


34/42


of inheritance and dynamic $inding&

Ether C77 specific optimi,ations #ere also evaluated and em$raced& Most nota$ly the use ofinitiali,ation lists and call $y reference (instead of call $y value) proved to $e good optimi,ati

techni%ues& !n order to ma e the common case fast4 it #as important to reduce the creation adestruction of temporary o$ ects&

"he DCCA prototype #as only measured and tested in a simulated environment& Besides th $enchmar ing client4 the Diameter traffic generator tool Seagull has $een used as a client to generacredit control scenarios& A simulated environment is not an ideal methodology4 as in real #orld ot performance $ottlenec s and relia$ility considerations #ill most certainly arise& Infortunately4 due different reasons4 Capgemini could not provide a real #orld testing environment& 5o#ever4 even if#ould have $een preferred to have tested the prototype in a real #orld environment4 the conclusiondra#n from this thesis should not $e affected $y #hether the environment is simulated or not&

Ene could also argue that the DCCA prototype is simply ust a prototype4 and that it is nrepresentative of a real #orld DCCA server& 5o#ever4 even if it is a prototype4 the development of tapplication has $een done in close contact #ith developers at Capgemini ?arls rona& "hey have givecontinuous feed$ac in order to aid the development of an application that is as representative of threal #orld as possi$le& "he evaluations and conclusions from this thesis #ill $e an integral part of otheir future use of scala$le Diameter $ased server applications&

!f possi$le4 the evaluations and conclusions #ould pro$a$ly have $enefited from including othe platforms than the one used in this thesis (see Chapter ;)4 since e&g& allocators differ& 5o#ever4 it not possi$le #ith the time given&

or performance critical applications4 C77 may $e considered a poor choice in contrast to e&g& C& each statement corresponds to a relatively small num$er of assem$ler instructions& A good generali,ation is that each statement is five to eight instructions& C77 is not that uniform& "he cost C77 statements can fluctuate #idely R one statement can generate three instructions4 #hereas anothecan generate ;00 NBul a 0;O& Ene must navigate through the performance minefield4 trying to avoiroutes that contain the ;00 instruction mines&

So #hy have C77 $een the programming language for the DCCA prototype4 one may as & Mainly $oils do#n to fle'i$ility and maintenance reasons& C77 has o$ ect oriented features that C canno provide& !f used cautiously4 these features can give a system design that is sta$le and that alleviamaintenance efforts& Programming carefully #ith C774 avoiding heavy C77 construct overheads4 coutherefore provide the $est of the C and C77 #orlds R performance4 fle'i$ility4 sta$ility and a goosystem design&

"he prototype itself is meant to mimic an actual real #orld application4 and C77 is commonly used#ithin the telecommunication industry4 due to the mentioned fle'i$ility and maintenance reasons& #ould not $e #ise to develop a prototype that is hard or even impossi$le to maintain4 as that #ould notmimic the real #orld& C77 is also the programming language that Capgemini re%uested&

;8


35/42


A part of this thesis #or and measurements involved different memory allocators& Ene of those #"CMalloc4 #hich is an allocator that assigns each thread a cache of it s o#n and if an allocation needto $e done that is small enough ("CMalloc treats o$ ects #ith a si,e of T;2?B as small)4 then noloc s #ill need to $e ac%uired& "hat can save appro'imately 100 nanoseconds for eac

allocation@deallocation N-hema#atO&"CMalloc therefore scales #ell #ith small o$ ect allocations4 especially up to 1?B N-hema#atO& "hcan $e o$served on the DCCA prototype4 since most allocations are less than 1?B& "his results inma or performance increase #ith "CMalloc compared to gli$c malloc (as can $e seen in Chapter 8&2"CMalloc s performance increase compared to gli$c malloc decreases the larger the allocations ar#hich could mean the performance increase o$served on the DCCA may not occur #ith anotherapplication& 5o#ever the si,e of an allocation in a DCCA conversation rarely e'ceeds 900 $yte andmost allocations are much smaller&

;:


36/42


Chapter

6 CEFC IS!EFS AFD I"I+/ HE+?

-.1 CO!C+USIO!S

"his thesis investigates #hether it is possi$le to develop a Diameter Credit Control Application thatachieves linear scaling and the eventual pitfalls that e'ist #hen developing a scala$le DCCA server& othis reason4 a DCCA prototype #as developed and evaluated $ased on throughput& "he assumption achieving linear scala$ility on the prototype is $ased on the o$servation that the different connectionhave little communication $et#een each other and could therefore utili,e every processor to theirma'imum potential& 5o#ever4 during the #or of this thesis it soon $ecame clear that it #as not thateasy to get a linear speedup $y simply adding more processors& "here are many factors $eside ra processor po#er that have a ma or impact on performance& !ntroducing multiprocessing alintroduces a #hole set of ne# pro$lems&

Memory plays a ma or role and the choice of allocators can play a $ig part in the performance ofmultiprocessor system4 as can $e seen in Chapter 8&2 #here the difference $et#een "CMalloc and gli$malloc #as up to a staggering ;00U $etter performance #ith "CMalloc& -eneral memory issues is a pro$lem& /ven small changes li e moving varia$les around in a shared data structure could givnoticea$le performance increases4 (see Chapter 8&:&1) or employing static processor affinity (Chapter 8&:&2)& Due to the impact memory can have on a multithreaded application4 it is fundamen pay attention on ho# the application handles the shared memory& "his is indeed an important lesson t $e learned #hen developing any high performance application&

Careless use of mutual e'clusions can also ma e a huge impact if used improperly4 as can $e seen iChapter 8&8 #here a different approach increased the performance $y up to 1:0U& Synchroni,ation is pro$lem4 and it is important to $e attentive of #hich parts of the application that have to $e e'ecuted#ith mutual e'clusion4 and #hich parts that do not&

Infortunately4 even #ith optimi,ations in place4 the DCCA prototype #as not capa$le of linearscala$ility4 since it had several issues that affected parallel performance (see Chapter 2)& "hevaluations also point out that careless system design could lead to an insignificant performanceincrease& "hat is #hy a#areness of $ottlenec s in a parallel server system (such as the DCCA) ifundamental&

;9


37/42


!t is not #ise to ta e scala$ility for granted4 $ut instead the parallel performance should constantly $measured and evaluated to see #here further optimi,ations can $e achieved& Ether#ise4 the servecould prove to $e a potential pro$lem for a telecommunication system4 as the demands of applicationand services such as the DCCA #ill most li ely increase& !n particular since more and more fully

featured cellular phones that have need of AAA hit the mar et&

-." U%URE WORK

During the #or of this thesis4 ne# %uestions arose that can $e of interest& Ene of those %uestions further investigation of the SC"P protocol& "his thesis did not focus on all of SC"P s features4 $ut theare features that could gain a performance increase& /ven if multihoming is not thought of as a load $alance feature4 it could still $e a desira$le side effect& An evaluation of the multiple message strea#ould also $e of interest& Perhaps such investigation #ould also lead to $etter net#or relia$ility&

"he DCCA prototype #as developed #ith the standard soc et AP! that comes #ith inu'& 5o#ever4

there are other soc et AP!s that could $e more performance friendly& AP!s that alleviates some of tissues associated #ith !@E operations4 and is designed for a multiprocessor environment& !nvestigaand evaluation of such AP!s could give another lift to the server s scala$ility&

Ene last interesting %uestion that came up is the potential use of several physical servers that cooperaas if there is only one& A replicated server approach to gain performance $y letting different servehandle different connections& "his does not necessarily increase the scala$ility of a specific server4 $it could give $etter throughput in a DCCA system&

;=


38/42


+/ /+/FC/S

BIB+IO RA(#Y

N-er$er 0:O -er$er +&4 Bi A& J& C&4 Smith ?&4 "ian L&4%he (oftware ,ptimization !ookbook(econd -dition High Performance Recipes for *A ./ Platforms4 !ntel Press4 200:

NBul a 0;O Bul a &D4 Mayhe# D&4 -fficient !00 4 Addison Hesley4 200;N in 00O in C&&4 Principles of Parallel programming 4 Pearson /ducation4 200


40/42


APP/FD!L A S/+!A ! A"!EF PE!F"SSeriali,ation point measurements from the DCCA prototype& igure A&1 represents ho# a messafrom a connection is saved in a %ueue& igure A&2 represents a solution #here the %usynchroni,ation has $een alleviated& igure A&; represents a solution #ith a higher degree synchroni,ation (all of the connections threads must #ait until fileEutput"hread is done)&

"he main difference $et#een igure A&2 and igure A&; is the degree of synchroni,ation&

igure A&1

igure A&2

80


41/42


igure A&;

81


42/42


APP/FD!L B M/ME+G B EC? CEPG!F-Ising memcpy() function may have a significant performance impact4 as discussed in Chapter 8&

igure B&1 sho#s ho# memcpy is primarily used in the DCCA prototype& String is a user defined clfor handling a char array4 i&e& a string& "he Add mem$er function addslen characters from source s4 todestinationm3data&

Ising an iteration instead of memcpy is included #ithin C77 comments&

igure B&1

Documents

Akesson Rantzow Kandidatarbete 20100531