CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design

ComputerArchitectureLecture3:MemoryHierarchyDesign(Chapter2,AppendixB)

ChihWeiLiuNationalChiao [email protected]

Since1980,CPUhasoutpacedDRAM

CA-Lec3 [email protected]

Introduction

2

Gap grew 50% per year

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs

Introduction Programmerswantunlimitedamountsofmemorywithlow

latency Fastmemorytechnologyismoreexpensiveperbitthan

slowermemory Solution:organizememorysystemintoahierarchy

Entireaddressablememoryspaceavailableinlargest,slowestmemory Incrementallysmallerandfastermemories,eachcontainingasubset

ofthememorybelowit,proceedinstepsuptowardtheprocessor Temporalandspatiallocalityinsuresthatnearlyallreferences

canbefoundinsmallermemories Givestheallusionofalarge,fastmemorybeingpresentedtothe

processor


Introduction

3

MemoryHierarchyDesign Memoryhierarchydesignbecomesmorecrucialwithrecentmulticoreprocessors: Aggregatepeakbandwidthgrowswith#cores:

IntelCorei7cangeneratetworeferencespercoreperclock Fourcoresand3.2GHzclock

25.6billion64bitdatareferences/second+12.8billion128bitinstructionreferences=409.6GB/s!

DRAMbandwidthisonly6%ofthis(25GB/s) Requires:

Multiport,pipelinedcaches Twolevelsofcachepercore Sharedthirdlevelcacheonchip


Introduction

4

MemoryHierarchy Takeadvantageoftheprincipleoflocalityto:

Presentasmuchmemoryasinthecheapesttechnology Provideaccessatspeedofferedbythefastesttechnology

On-C

hipC

ache

Registers

Control

Datapath

SecondaryStorage(Disk/

FLASH/PCM)

Processor

MainMemory(DRAM/FLASH/PCM)

SecondLevelCache

(SRAM)

1s 10,000,000s (10s ms)

Speed (ns): 10s-100s 100s

100s GsSize (bytes): Ks-Ms Ms

TertiaryStorage(Tape/Cloud

Storage)

10,000,000,000s (10s sec)

Ts

CA-Lec3 [email protected] 5

MulticoreArchitecture

CPU

Local memory hierarchy(optimal fixed size)

Processing Node

CPU


Processing Node

CPU


Processing Node

CPU


Processing Node

CPU


Processing Node

CPU


Processing Node

Interconnection network


ThePrincipleofLocality ThePrincipleofLocality:

Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.

TwoDifferentTypesofLocality: TemporalLocality(LocalityinTime):Ifanitemisreferenced,it

willtendtobereferencedagainsoon(e.g.,loops,reuse) SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items

whoseaddressesareclosebytendtobereferencedsoon(e.g.,straightline code,arrayaccess)

HWreliedonlocalityforspeed


MemoryHierarchyBasics

Whenawordisnotfoundinthecache,amissoccurs: Fetchwordfromlowerlevelinhierarchy,requiringahigherlatencyreference

Lowerlevelmaybeanothercacheorthemainmemory Alsofetchtheotherwordscontainedwithintheblock

Takesadvantageofspatiallocality Placeblockintocacheinanylocationwithinitsset,determinedbyaddress

blockaddressMODnumberofsets


Introduction

8

HitandMiss Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)

HitRate:thefractionofmemoryaccessfoundintheupperlevel HitTime:Timetoaccesstheupperlevelwhichconsistsof

RAMaccesstime+Timetodeterminehit/miss Miss:dataneedstoberetrievefromablockinthelowerlevel

(BlockY) MissRate=1 (HitRate) MissPenalty:Timetoreplaceablockintheupperlevel+

Timetodelivertheblocktheprocessor HitTime

CachePerformanceFormulas


missmisshitacc TfTT (Average memory access time) = (Hit time) + (Miss rate)(Miss penalty) The times Tacc, Thit, and T+miss can be all either:

Real time (e.g., nanoseconds) Or, number of clock cycles

In contexts where cycle time is known to be a constant

Important: T+miss means the extra (not total) time for a miss

in addition to Thit, which is incurred by all accesses

CPU CacheLower levelsof hierarchy

Hit time

Miss penalty

FourQuestionsforMemoryHierarchy

Consideranylevelinamemoryhierarchy. Rememberablockistheunitofdatatransfer.

Betweenthegivenlevel,andthelevelsbelowit

Theleveldesignisdescribedbyfourbehaviors: BlockPlacement:

Wherecouldanewblockbeplacedinthelevel? BlockIdentification:

Howisablockfoundifitisinthelevel? BlockReplacement:

Whichexistingblockshouldbereplacedifnecessary? WriteStrategy:

Howarewritestotheblockhandled?


Q1:Wherecanablockbeplacedintheupperlevel?

Block12placedin8blockcache: Fullyassociative,directmapped,2waysetassociative S.A.Mapping=BlockNumberModuloNumberSets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

FullMapped DirectMapped(12mod8)=42WayAssoc(12mod4)=0


Q2:Howisablockfoundifitisintheupperlevel?

IndexUsedtoLookupCandidates Indexidentifiesthesetincache

Tagusedtoidentifyactualcopy Ifnocandidatesmatch,thendeclarecachemiss

Blockisminimumquantumofcaching Dataselectfieldusedtoselectdatawithinblock Manycachingapplicationsdonthavedataselectfield

Largerblocksizehasdistincthardwareadvantages: lesstagoverhead exploitfastbursttransfersfromDRAM/overwidebusses

Disadvantagesoflargerblocksize? Fewerblocksmoreconflicts.Canwastebandwidth

Blockoffset

Block AddressTag Index

Set Select

Data Select


:0x50

Valid Bit

:

Cache Tag

Byte 320123

:

Cache DataByte 0Byte 1Byte 31 :

Byte 33Byte 63 :Byte 992Byte 1023 : 31

Review:DirectMappedCache DirectMapped2N bytecache:

Theuppermost(32 N)bitsarealwaystheCacheTag ThelowestMbitsaretheByteSelect(BlockSize=2M)

Example:1KBDirectMappedCachewith32BBlocks Indexchoosespotentialblock Tagcheckedtoverifyblock Byteselectchoosesbytewithinblock

Ex: 0x50 Ex: 0x00Cache Index

0431Cache Tag Byte Select

9

Ex: 0x01


DirectMappedCacheArchitecture


Tags Block framesAddress

Decode & Row Select

?Compare Tags

Hit

Tag Frm# Off.

Data Word

Muxselect

Review:SetAssociativeCache Nwaysetassociative:NentriesperCacheIndex

Ndirectmappedcachesoperatesinparallel Example:Twowaysetassociativecache

CacheIndexselectsasetfromthecache Twotagsinthesetarecomparedtoinputinparallel Dataisselectedbasedonthetagresult


Cache Index0431

Cache Tag Byte Select8

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Mux 01Sel1 Sel0

OR

Hit

Compare Compare

Cache Block16

Review:FullyAssociativeCache FullyAssociative:Everyblockcanholdanyline

Addressdoesnotincludeacacheindex CompareCacheTagsofallCacheEntriesinParallel

Example:BlockSize=32Bblocks WeneedN27bitcomparators Stillhavebyteselecttochoosefromwithinblock

:

Cache DataByte 0Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Valid Bit

::

Cache Tag

04Cache Tag (27 bits long) Byte Select

31

=

==

=

=

Ex: 0x01


ConcludingRemarks

Directmappedcache=1waysetassociativecache

Fullyassociativecache:thereisonly1set


19

CacheSizeEquation

Simpleequationforthesizeofacache:(Cachesize)=(Blocksize) (Numberofsets)

(SetAssociativity) Canrelatetothesizeofvariousaddressfields:

(Blocksize)=2(#ofoffsetbits)

(Numberofsets)=2(#ofindexbits)

(#oftagbits)=(#ofmemoryaddressbits) (#ofindexbits) (#ofoffsetbits)

Memory address


Q3:Whichblockshouldbereplacedonamiss?

Easyfordirectmappedcache Onlyonechoice

Setassociativeorfullyassociative LRU(leastrecentlyused)

Appealing,buthardtoimplementforhighassociativity Random

Easy,buthowwelldoesitwork? Firstin,firstout(FIFO)


Q4:Whathappensonawrite?

Write-Through Write-Back

Policy

Data written to cache block

also written to lower-level memory

Write data only to the cache

Update lower level when a block falls out of the cache

Debug Easy Hard

Do read misses produce writes? No Yes

Do repeated writes make it to

lower level?Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (write-allocate).


WriteBuffers

Q. Why a write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesnt stall

Q. Why a buffer, why not just one register ?

A. Bursts of writes arecommon.

Q. Are Read After Write (RAW) hazards an issue for write buffer?

A. Yes! Drain buffer before next read, or check write buffers for match on readsCA-Lec3 [email protected] 22

MoreonCachePerformanceMetrics

Cansplitaccesstimeintoinstructions&data:Avg.mem.acc.time=

(%instructionaccesses) (inst.mem.accesstime)+(%dataaccesses) (datamem.accesstime)

Anotherformulafromchapter1:CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)

cycletime UsefulforexploringISAchanges

Canbreakstallsintoreadsandwrites:Memorystallcycles=

(Reads readmissrate readmisspenalty)+(Writes writemissrate writemisspenalty)


Compulsory (coldstartorprocessmigration,firstreference):firstaccesstoablock Coldfactoflife:notawholelotyoucandoaboutit Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory

Missesareinsignificant Capacity:

Cachecannotcontainallblocksaccessbytheprogram Solution:increasecachesize

Conflict (collision): Multiplememorylocationsmapped

tothesamecachelocation Solution1:increasecachesize Solution2:increaseassociativity

Coherence (Invalidation):otherprocess(e.g.,I/O)updatesmemory

SourcesofCacheMisses


MemoryHierarchyBasics Sixbasiccacheoptimizations:

Largerblocksize Reducescompulsorymisses Increasescapacityandconflictmisses,increasesmisspenalty

Largertotalcachecapacitytoreducemissrate Increaseshittime,increasespowerconsumption

Higherassociativity Reducesconflictmisses Increaseshittime,increasespowerconsumption

Highernumberofcachelevels Reducesoverallmemoryaccesstime

Givingprioritytoreadmissesoverwrites Reducesmisspenalty

Avoidingaddresstranslationincacheindexing Reduceshittime


Introduction

25

1.LargerBlockSizes

Largerblocksize no.ofblocks Obviousadvantages:reducecompulsorymisses

Reasonisduetospatiallocality

Obviousdisadvantage Highermisspenalty:largerblocktakeslongertomove Mayincreaseconflictmissesandcapacitymissifcacheissmall

Dont let increase in miss penalty outweigh the decrease in miss rate


2.LargeCaches

Cachesizemissrate;hittime Helpwithbothconflictandcapacitymisses

MayneedlongerhittimeAND/ORhigherHWcost

Popularinoffchipcaches


3.HigherAssociativity

Reduceconflictmiss 2:1Cacheruleofthumbonmissrate

2waysetassociativeofsizeN/2isaboutthesameasadirectmappedcacheofsizeN(heldforcachesize

4.MultiLevelCaches

2levelcachesexample AMATL1 =HittimeL1 +MissrateL1MisspenaltyL1 AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +MissrateL2 MisspenaltyL2)

Probablythebestmisspenaltyreductionmethod Definitions:

Localmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessestothiscache(MissrateL2)

Globalmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)

GlobalMissRateiswhatmatters


MultiLevelCaches(Cont.) Advantages:

CapacitymissesinL1endupwithasignificantpenaltyreduction ConflictmissesinL1similarlygetsuppliedbyL2

Holdingsizeof1stlevelcacheconstant: Decreasesmisspenaltyof1stlevelcache. Or,increasesaverageglobalhittimeabit:

hittimeL1+missrateL1xhittimeL2 butdecreasesglobalmissrate

Holdingtotalcachesizeconstant: Globalmissrate,misspenaltyaboutthesame Decreasesaverageglobalhittimesignificantly!

NewL1muchsmallerthanoldL1


MissRateExample Supposethatin1000memoryreferencesthereare40missesinthefirstlevel

cacheand20missesinthesecondlevelcache Missrateforthefirstlevelcache=40/1000(4%) Localmissrateforthesecondlevelcache=20/40(50%) Globalmissrateforthesecondlevelcache=20/1000(2%)

AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5memoryreferenceperinstruction.Whatisaveragememoryaccesstimeandaveragestallcyclesperinstructions?Ignorewritesimpact.

AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2MisspenaltyL2)=1+4% (10+50% 200)=5.4CC

Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+MissesperinstructionsL2MisspenaltyL2=(401.5/1000) 10+(201.5/1000)200=6.6CC

Or(5.4 1.0) 1.5=6.6CC


5.GivingPrioritytoReadMissesOverWrites

Inwritethrough,writebufferscomplicatememoryaccessinthattheymightholdtheupdatedvalueoflocationneededonareadmiss RAW conflictswithmainmemoryreadsoncachemisses

Readmisswaitsuntilthewritebufferempty increasereadmisspenalty Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe

memoryaccesscontinue WriteBack?

Readmissreplacingdirtyblock Normal:Writedirtyblocktomemory,andthendotheread Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo

thewrite CPUstalllesssincerestartsassoonasdoread


SW R3, 512(R0) ;cache index 0LW R1, 1024(R0) ;cache index 0LW R2, 512(R0) ;cache index 0

R2=R3 ?

read priority over write

6.AvoidingAddressTranslationduringIndexingoftheCache

Virtuallyaddressedcaches

33

Address Translation

PhysicalAddress Cache

Indexing

VirtualAddress

CPU

TLB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TLB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym (Alias) Problem

VATags

$ means cache

CPU

$ TLB

MEM

VAVA

TagsPAL2 $

Overlap $ access with VA translation: requires $

index to remain invariantacross translationCA-Lec3 [email protected]

WhynotVirtualCache?

Taskswitchcausesthesame VAtorefertodifferentPAs Hence,cachemustbeflushed

Hughtaskswitchoverhead Alsocreateshugecompulsorymissratesfornewprocess

SynonymsorAliasproblemcausesdifferentVAswhichmaptothesamePA Twocopiesofthesamedatainavirtualcache

AntialiasingHWmechanismisrequired(complicated) SWcanhelp

I/O(alwaysusesPA) RequiremappingtoVAtointeractwithavirtualcache


AdvancedCacheOptimizations

Reducinghittime1. Smallandsimplecaches2. Wayprediction

Increasingcachebandwidth3.Pipelinedcaches4. Multibanked caches5. Nonblocking caches


ReducingMissPenalty6.Criticalwordfirst7. Mergingwritebuffers

ReducingMissRate8.Compileroptimizations

Reducingmisspenaltyormissrate viaparallelism

9.Hardwareprefetching10. Compilerprefetching

1.SmallandSimpleL1Cache

Criticaltimingpathincache: addressingtagmemory,thencomparingtags,thenselectingcorrectset

Indextagmemoryandthencomparetakestime Directmappedcachescanoverlaptagcompareandtransmissionofdata Sincethereisonlyonechoice

Lowerassociativityreducespowerbecausefewercachelinesareaccessed


Advanced O

ptimizations

36

L1SizeandAssociativity


Access time vs. size and associativity

Advanced O

ptimizations

37

L1SizeandAssociativity


Energy per read vs. size and associativity

Advanced O

ptimizations

38

2.FastHittimesviaWayPrediction

HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmissesof2waySAcache?

Wayprediction:keepextrabitsincachetopredicttheway, orblockwithintheset,ofnextcacheaccess.

Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthatclockcycleinparallelwithreadingthecachedata

Miss 1st checkotherblocksformatchesinnextclockcycle

Accuracy 85% Drawback:CPUpipelineishardifhittakes1or2cycles

Usedforinstructioncachesvs.datacaches


Hit Time

Way-Miss Hit Time Miss Penalty

WayPrediction

Toimprovehittime,predictthewaytopresetmux Mispredictiongiveslongerhittime Predictionaccuracy

>90%fortwoway >80%forfourway IcachehasbetteraccuracythanDcache

FirstusedonMIPSR10000inmid90s UsedonARMCortexA8

Extendtopredictblockaswell Wayselection Increasesmispredictionpenalty


Advanced O

ptimizations

40

3.IncreasingCacheBandwidthbyPipelining

Pipelinecacheaccesstoimprovebandwidth Examples:

Pentium:1cycle PentiumPro PentiumIII:2cycles Pentium4 Corei7:4cycles

Makesiteasiertoincreaseassociativity

But,pipelinecacheincreasestheaccesslatency Moreclockcyclesbetweentheissueoftheloadandtheuseofthedata

AlsoIncreasesbranchmispredictionpenalty


Advanced O

ptimizations

41

4.IncreasingCacheBandwidth:NonBlockingCaches

Nonblockingcache orlockupfreecache allowdatacachetocontinuetosupplycachehitsduringamiss requiresF/Ebitsonregistersoroutoforderexecution requiresmultibankmemories

hitundermiss reducestheeffectivemisspenaltybyworkingduringmissvs.ignoringCPUrequests

hitundermultiplemiss ormissundermiss mayfurtherlowertheeffectivemisspenaltybyoverlappingmultiplemisses Significantlyincreasesthecomplexityofthecachecontrollerasthere

canbemultipleoutstandingmemoryaccesses Requiresmuliplememorybanks(otherwisecannotsupport) PeniumProallows4outstandingmemorymisses


Nonblocking CachePerformances

L2mustsupportthis Ingeneral,processorscanhideL1misspenaltybutnotL2miss

penalty CA-Lec3 [email protected]

Advanced O

ptimizations

43

6:IncreasingCacheBandwidthviaMultipleBanks

Ratherthantreatthecacheasasinglemonolithicblock,divideintoindependentbanksthatcansupportsimultaneousaccesses E.g.,T1(Niagara)L2has4banks

Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanksmappingofaddressestobanksaffectsbehaviorofmemorysystem

Simplemappingthatworkswellissequentialinterleaving Spreadblockaddressessequentiallyacrossbanks E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;

bank1hasallblockswhoseaddressmodulo4is1;


5.IncreasingCacheBandwidthviaMultibanked Caches

Organizecacheasindependentbankstosupportsimultaneousaccess(ratherthanasinglemonolithicblock) ARMCortexA8supports14banksforL2 Inteli7supports4banksforL1and8banksforL2

Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanks Interleavebanksaccordingtoblockaddress


Advanced O

ptimizations

45

6.ReduceMissPenalty:CriticalWordFirstandEarlyRestart

Processorusuallyneedsonewordoftheblockatatime Donotwaitforfullblocktobeloadedbeforerestartingprocessor

CriticalWordFirst requestthemissedwordfirstfrommemoryandsendittotheprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhilefillingtherestofthewordsintheblock.Alsocalledwrappedfetch andrequestedwordfirst

Earlyrestart assoonastherequestedwordoftheblockarrives,sendittotheprocessorandlettheprocessorcontinueexecution

Benefitsofcriticalwordfirstandearlyrestartdependon Blocksize:generallyusefulonlyinlargeblocks Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen

fetched Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit


7.MergingWriteBuffertoReduceMissPenalty

Writebuffertoallowprocessortocontinuewhilewaitingtowritetomemory

Ifbuffercontainsmodifiedblocks,theaddressescanbecheckedtoseeifaddressofnewdatamatchestheaddressofavalidwritebufferentry.Ifso,newdataarecombinedwiththatentry

Increasesblocksizeofwriteforwritethroughcacheofwritestosequentialwords,bytessincemultiwordwritesmoreefficienttomemory


MergingWriteBuffer Whenstoringtoablockthatisalreadypendinginthewrite

buffer,updatewritebuffer Reducesstallsduetofullwritebuffer


Advanced O

ptimizations

Nowritebuffering

Writebuffering

48

8.ReducingMissesbyCompilerOptimizations

McFarling [1989]reducedcachesmissesby75%on8KBdirectmappedcache,4byteblocksinsoftware

Instructions Reorderproceduresinmemorysoastoreduceconflictmisses Profilingtolookatconflicts(usingtoolstheydeveloped)

Data LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory

(insequentialorder) LoopFusion:Combine2independentloopsthathavesameloopingandsome

variablesoverlap Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.

goingdownwholecolumnsorrows Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks Requiresmorememoryaccessesbutimproveslocalityofaccesses


LoopInterchangeExample/* Before */for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Sequentialaccessesinsteadofstridingthroughmemoryevery100words;improvedspatiallocality


LoopFusionExample/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality


Perform different computations on the common data in two loops fuse the two loops

BlockingExample/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};x[i][j] = r;

}; TwoInnerLoops:

ReadallNxNelementsofz[] ReadNelementsof1rowofy[]repeatedly WriteNelementsof1rowofx[]

CapacityMissesafunctionofN&CacheSize: 2N3+N2 =>(assumingnoconflict;otherwise)

Idea:computeonBxBsubmatrixthatfits

52CA-Lec3 [email protected]

Snapshotofx,y,zwhenN=6,i=1


White: not yet touchedLight: older accessDark: newer access Before.

BlockingExample/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)

for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0;for (k = kk; k < min(kk+B-1,N); k = k+1) {r = r + y[i][k]*z[k][j];};

x[i][j] = x[i][j] + r;};

BcalledBlockingFactor CapacityMissesfrom2N3 +N2 to2N3/B+N2 ConflictMissesToo?


TheAgeofAccessestox,y,zwhenB=3


Note in contrast to previous Figure, the smaller number of elements accessed

Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithoutpenalty

InstructionPrefetching Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive

block. Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis

placedintoinstructionstreambuffer DataPrefetching

Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KBpages

Prefetchinginvokedif2successiveL2cachemissestoapage,ifdistancebetweenthosecacheblocksis

10.ReducingMissPenaltyorMissRatebyCompilerControlled PrefetchingData

Prefetch instructionisinsertedbeforedataisneeded

DataPrefetch Registerprefetch:loaddataintoregister(HPPARISCloads) CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9) Specialprefetchinginstructionscannotcausefaults;

aformofspeculativeexecution

IssuingPrefetch Instructionstakestime Iscostofprefetch issues

Summary


Advanced O

ptimizations

58

MemoryTechnology

Performancemetrics Latencyisconcernofcache BandwidthisconcernofmultiprocessorsandI/O Accesstime

Timebetweenreadrequestandwhendesiredwordarrives Cycletime

Minimumtimebetweenunrelatedrequeststomemory

DRAMusedformainmemory,SRAMusedforcache


Mem

ory Technology

59

MemoryTechnology SRAM:staticrandomaccessmemory

Requireslowpowertoretainbit,sincenorefresh But,requires6transistors/bit(vs.1transistor/bit)

DRAM Onetransistor/bit Mustberewrittenafterbeingread Mustalsobeperiodicallyrefreshed

Every~8ms Eachrowcanberefreshedsimultaneously

Addresslinesaremultiplexed: Upperhalfofaddress:rowaccessstrobe(RAS) Lowerhalfofaddress:columnaccessstrobe(CAS)


Mem

ory Technology

60

DRAMTechnology

Emphasizeoncostperbitandcapacity Multiplexaddresslines cutting#ofaddresspinsinhalf

Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS) Memoryasa2Dmatrix rowsgotoabuffer SubsequentCASselectssubrow

Useonlyasingletransistortostoreabit Readingthatbitcandestroytheinformation Refresheachbitperiodically(ex.8milliseconds)bywritingback

Keeprefreshingtimelessthan5%ofthetotaltime

DRAMcapacityis4to8timesthatofSRAM


DRAMLogicalOrganization(4Mbit)


SquarerootofbitsperRAS/CAS

Column Decoder

Sense Amps & I/O

Memory Array

(2,048 x 2,048)A0A10

11 D

Q

Word LineStorage Cell

DRAMTechnology(cont.)

DIMM:Dualinlinememorymodule DRAMchipsarecommonlysoldonsmallboardscalledDIMMs DIMMstypicallycontain4to16DRAMs

SlowingdowninDRAMcapacitygrowth Fourtimesthecapacityeverythreeyears,formorethan20years Newchipsonlydoublecapacityeverytwoyear,since1998

DRAMperformanceisgrowingataslowerrate RAS(relatedtolatency):5%peryear CAS(relatedtobandwidth):10%+peryear


RASImprovement


Mem

ory Technology

64

QuestforDRAMPerformance1. FastPagemode

Addtimingsignalsthatallowrepeatedaccessestorowbufferwithoutanotherrowaccesstime

Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsforeachaccess

2. SynchronousDRAM(SDRAM) AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould

notbearoverheadtosynchronizewithDRAMcontroller3. DoubleDataRate(DDRSDRAM)

TransferdataonboththerisingedgeandfallingedgeoftheDRAMclocksignal doublingthepeakdatarate

DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offershigherclockrates:upto400MHz

DDR3dropsto1.5volts+higherclockrates:upto800MHz DDR4dropsto1.2volts,clockrateupto1600MHz

ImprovedBandwidth,notLatency


66

DRAMnamebasedonPeakChipTransfers/SecDIMMnamebasedonPeakDIMMMBytes /Sec

Stan-dard

Clock Rate (MHz)

M transfers / second DRAM Name

Mbytes/s/ DIMM

DIMM Name

DDR 133 266 DDR266 2128 PC2100

DDR 150 300 DDR300 2400 PC2400

DDR 200 400 DDR400 3200 PC3200

DDR2 266 533 DDR2-533 4264 PC4300

DDR2 333 667 DDR2-667 5336 PC5300

DDR2 400 800 DDR2-800 6400 PC6400

DDR3 533 1066 DDR3-1066 8528 PC8500

DDR3 666 1333 DDR3-1333 10664 PC10700

DDR3 800 1600 DDR3-1600 12800 PC12800

x 2 x 8

F

a

s

t

e

s

t

f

o

r

s

a

l

e

4

/

0

6

(

$

1

2

5

/

G

B

)


DRAMPerformance


Mem

ory Technology

67

GraphicsMemory

GDDR5isgraphicsmemorybasedonDDR3 Graphicsmemory:

Achieve25XbandwidthperDRAMvs.DDR3 Widerinterfaces(32vs.16bit) Higherclockrate

PossiblebecausetheyareattachedviasolderinginsteadofsocketedDIMMmodules


MemoryPowerConsumption


Mem

ory Technology

69

SRAMTechnology

CacheusesSRAM:StaticRandomAccessMemory SRAMusessixtransistorsperbittopreventtheinformation

frombeingdisturbedwhenread noneedtorefresh

SRAMneedsonlyminimalpowertoretainthechargeinthestandbymode goodforembeddedapplications

NodifferencebetweenaccesstimeandcycletimeforSRAM

Emphasizeonspeedandcapacity SRAMaddresslinesarenotmultiplexed

SRAMspeedis8to16xthatofDRAM


ROMandFlash Embeddedprocessormemory Readonlymemory(ROM)

Programmedatthetimeofmanufacture Onlyasingletransistorperbittorepresent1or0 Usedfortheembeddedprogramandforconstant Nonvolatileandindestructible

Flashmemory: Mustbeerased(inblocks)beforebeingoverwritten Nonvolatilebutallowthememorytobemodified ReadsatalmostDRAMspeeds,butwrites10to100timesslower DRAMcapacityperchipandMBperdollarisabout4to8timesgreater

thanflash CheaperthanSDRAM,moreexpensivethandisk SlowerthanSRAM,fasterthandisk


MemoryDependability

Memoryissusceptibletocosmicrays Softerrors:dynamicerrors

Detectedandfixedbyerrorcorrectingcodes(ECC) Harderrors:permanenterrors

Usesparserowstoreplacedefectiverows

Chipkill:aRAIDlikeerrorrecoverytechnique


Mem

ory Technology

72

VirtualMemory? Thelimitsofphysicaladdressing

Allprogramsshareonephysicaladdressspace Machinelanguageprogramsmustbeawareofthemachine

organization Nowaytopreventaprogramfromaccessinganymachineresource

Recall:manyprocessesuseonlyasmallportionofaddressspace Virtualmemorydividesphysicalmemoryintoblocks(calledpageor

segment)andallocatesthemtodifferentprocesses Withvirtualmemory,theprocessorproducesvirtualaddressthat

aretranslatedbyacombinationofHWandSWtophysicaladdresses(calledmemorymappingoraddresstranslation).


VirtualMemory:AddaLayerofIndirection

CPU Memory

A0-A31 A0-A31

D0-D31 D0-D31

Data

User programs run in an standardizedvirtual address space

Address Translation hardware managed by the operating system (OS)

maps virtual address to physical memory

Physical Addresses

AddressTranslation

Virtual Physical

Virtual Addresses

Hardware supports modern OS features:Protection, Translation, SharingCA-Lec3 [email protected] 74

VirtualMemory


Mapping by apage table

VirtualMemory(cont.) Permitsapplicationstogrowbiggerthanmainmemorysize Helpswithmultipleprocessmanagement

Eachprocessgetsitsownchunkofmemory Permitsprotection of1processchunksfromanother Mappingofmultiplechunksontosharedphysicalmemory Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,

andcanbemovedduringexecution) ApplicationandCPUruninvirtualspace(logicalmemory,0 max) Mappingontophysicalspaceisinvisibletotheapplication

Cachevs.virtualmemory Blockbecomesapage orsegment Missbecomesapageoraddressfault


3AdvantagesofVM Translation:

Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryisscrambled

Makesmultithreadingreasonable(nowusedalot!) Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical

memory. Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary

yetstillgrowlater. Protection:

Differentthreads(orprocesses)protectedfromeachother. Differentpagescanbegivenspecialbehavior

(ReadOnly,Invisibletouserprograms,etc). KerneldataprotectedfromUserprograms Veryimportantforprotectionfrommaliciousprograms

Sharing: Canmapsamephysicalpagetomultipleusers

(Sharedmemory)


VirtualMemory

Protectionviavirtualmemory Keepsprocessesintheirownmemoryspace

Roleofarchitecture: Provideusermodeandsupervisormode ProtectcertainaspectsofCPUstate Providemechanismsforswitchingbetweenusermodeandsupervisormode

Providemechanismstolimitmemoryaccesses ProvideTLBtotranslateaddresses


Virtual Mem

ory and Virtual Machines

78

Page Tables Encode Virtual Address Spaces

A machine usually supports

pages of a few sizes

(MIPS R4000):

PhysicalMemory Space

A valid page table entry codes physical memory frame address for the page

A virtual address spaceis divided into blocks

of memory called pagesframe

frame

frame

frame

A page table is indexed by a virtual address

virtual address

Page Table

OS manages the page table for each ASID


PhysicalMemory Space

Page table maps virtual page numbers to physical frames (PTE = Page Table Entry)

Virtual memory => treat memory cache for disk

Details of Page TableVirtual Address

Page Table

indexintopagetable

Page TableBase Reg

V AccessRights PA

V page no. offset12

table locatedin physicalmemory

P page no. offset12

Physical Address

frame

frame

frame

frame

virtual address

Page Table


PageTableEntry(PTE)? WhatisinaPageTableEntry(orPTE)?

Pointertonextlevelpagetableortoactualpage Permissionbits:valid,readonly,readwrite,writeonly

Example:Intelx86architecturePTE: Addresssameformatpreviousslide(10,10,12bitoffset) IntermediatepagetablescalledDirectories

P: Present(sameasvalidbitinotherarchitectures)W: WriteableU: Useraccessible

PWT: Pagewritetransparent:externalcachewritethroughPCD: Pagecachedisabled(pagecannotbecached)

A: Accessed:pagehasbeenaccessedrecentlyD: Dirty(PTEonly):pagehasbeenmodifiedrecentlyL: L=14MBpage(directoryonly).

Bottom22bitsofvirtualaddressserveasoffset

Page Frame Number(Physical Page Number)

Free(OS) 0 L D A

PCDPW

T U W P

01234567811-931-12


Cachevs.VirtualMemory Replacement

Cachemisshandledbyhardware PagefaultusuallyhandledbyOS

Addresses VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU CachespaceisindependentoftheCPUaddresssize

Lowerlevelmemory Forcaches themainmemoryisnotsharedbysomethingelse Forvirtualmemory mostofthediskcontainsthefilesystem

Filesystemaddresseddifferently usuallyinI/Ospace VirtualmemorylowerlevelisusuallycalledSWAP space


Thesame4questionsforVirtualMemory

BlockPlacement Choice:lowermissratesandcomplexplacementorviceversa

Misspenaltyishuge,sochooselowmissrate placeanywhere Similartofullyassociativecachemodel

BlockIdentification bothuseadditionaldatastructure Fixedsizepages useapagetable Variablesizedsegments segmenttable

BlockReplacement LRUisthebest HowevertrueLRUisabitcomplex souseapproximation

Pagetablecontainsausetag,andonaccesstheusetagisset OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears

themall OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone

WriteStrategy alwayswriteback Duetotheaccesstimetothedisk,writethroughissilly Useadirtybittoonlywritebackpagesthathavebeenmodified


TechniquesforFastAddressTranslation

Pagetableiskeptinmainmemory(kernelmemory) Eachprocesshasapagetable

Everydata/instructionaccessrequirestwomemoryaccesses Oneforthepagetableandoneforthedata/instruction Canbesolvedbytheuseofaspecialfastlookuphardwarecache

calledassociativeregistersortranslationlookasidebuffers(TLBs)

Iflocality appliesthencachetherecenttranslation TLB=translationlookasidebuffer TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,

dirtybit


TranslationLookAsideBuffers(TLB) Cacheontranslations FullyAssociative,SetAssociative,orDirectMapped

TLBsare: Small typicallynotmorethan128 256entries FullyAssociative

TranslationLookAsideBuffers

CPU TLB Cache MainMemory

VA PA miss

hit

data

Trans-lation

hit

missTranslationwithaTLB

V=0 pages either reside on disk or

have not yet been allocated.

OS handles V=0Page fault

Physical and virtual pages must be the

same size!

The TLB Caches Page Table Entries

TLB

Page Table

2

0

1

3

virtual address

page off

2frame page

250

physical address

page off

TLB caches page table

entries.for ASID

Physicalframe

address

CachingAppliedtoAddressTranslation

Data Read or Write(untranslated)

CPU PhysicalMemory

TLB

Translate(MMU)

No

VirtualAddress

PhysicalAddress

YesCached?

VirtualMachines Supportsisolationandsecurity Sharingacomputeramongmanyunrelatedusers Enabledbyrawspeedofprocessors,makingtheoverhead

moreacceptable

AllowsdifferentISAsandoperatingsystemstobepresentedtouserprograms SystemVirtualMachines SVMsoftwareiscalledvirtualmachinemonitororhypervisor Individualvirtualmachinesrununderthemonitorarecalledguest

VMs


Virtual Mem


88

ImpactofVMsonVirtualMemory

EachguestOSmaintainsitsownsetofpagetables VMMaddsalevelofmemorybetweenphysicalandvirtualmemorycalledrealmemory

VMMmaintainsshadowpagetablethatmapsguestvirtualaddressestophysicaladdresses

RequiresVMMtodetectguestschangestoitsownpagetable Occursnaturallyifaccessingthepagetablepointerisaprivilegedoperation


Virtual Mem


89

Documents

CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design