CA Lec03-Chapter 2-Appendix B-Memory Hierachy Design

  • Upload
    biggie

  • View
    226

  • Download
    0

Embed Size (px)

DESCRIPTION

It is really good document about memory hierarchy

Citation preview

  • ComputerArchitectureLecture3:MemoryHierarchyDesign(Chapter2,AppendixB)

    ChihWeiLiuNationalChiao [email protected]

  • Since1980,CPUhasoutpacedDRAM

    CA-Lec3 [email protected]

    Introduction

    2

    Gap grew 50% per year

    CPU60% per yr2X in 1.5 yrs

    DRAM9% per yr2X in 10 yrs

  • Introduction Programmerswantunlimitedamountsofmemorywithlow

    latency Fastmemorytechnologyismoreexpensiveperbitthan

    slowermemory Solution:organizememorysystemintoahierarchy

    Entireaddressablememoryspaceavailableinlargest,slowestmemory Incrementallysmallerandfastermemories,eachcontainingasubset

    ofthememorybelowit,proceedinstepsuptowardtheprocessor Temporalandspatiallocalityinsuresthatnearlyallreferences

    canbefoundinsmallermemories Givestheallusionofalarge,fastmemorybeingpresentedtothe

    processor

    CA-Lec3 [email protected]

    Introduction

    3

  • MemoryHierarchyDesign Memoryhierarchydesignbecomesmorecrucialwithrecentmulticoreprocessors: Aggregatepeakbandwidthgrowswith#cores:

    IntelCorei7cangeneratetworeferencespercoreperclock Fourcoresand3.2GHzclock

    25.6billion64bitdatareferences/second+12.8billion128bitinstructionreferences=409.6GB/s!

    DRAMbandwidthisonly6%ofthis(25GB/s) Requires:

    Multiport,pipelinedcaches Twolevelsofcachepercore Sharedthirdlevelcacheonchip

    CA-Lec3 [email protected]

    Introduction

    4

  • MemoryHierarchy Takeadvantageoftheprincipleoflocalityto:

    Presentasmuchmemoryasinthecheapesttechnology Provideaccessatspeedofferedbythefastesttechnology

    On-C

    hipC

    ache

    Registers

    Control

    Datapath

    SecondaryStorage(Disk/

    FLASH/PCM)

    Processor

    MainMemory(DRAM/FLASH/PCM)

    SecondLevelCache

    (SRAM)

    1s 10,000,000s (10s ms)

    Speed (ns): 10s-100s 100s

    100s GsSize (bytes): Ks-Ms Ms

    TertiaryStorage(Tape/Cloud

    Storage)

    10,000,000,000s (10s sec)

    Ts

    CA-Lec3 [email protected] 5

  • MulticoreArchitecture

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    CPU

    Local memory hierarchy(optimal fixed size)

    Processing Node

    Interconnection network

    CA-Lec3 [email protected] 6

  • ThePrincipleofLocality ThePrincipleofLocality:

    Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.

    TwoDifferentTypesofLocality: TemporalLocality(LocalityinTime):Ifanitemisreferenced,it

    willtendtobereferencedagainsoon(e.g.,loops,reuse) SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items

    whoseaddressesareclosebytendtobereferencedsoon(e.g.,straightline code,arrayaccess)

    HWreliedonlocalityforspeed

    CA-Lec3 [email protected] 7

  • MemoryHierarchyBasics

    Whenawordisnotfoundinthecache,amissoccurs: Fetchwordfromlowerlevelinhierarchy,requiringahigherlatencyreference

    Lowerlevelmaybeanothercacheorthemainmemory Alsofetchtheotherwordscontainedwithintheblock

    Takesadvantageofspatiallocality Placeblockintocacheinanylocationwithinitsset,determinedbyaddress

    blockaddressMODnumberofsets

    CA-Lec3 [email protected]

    Introduction

    8

  • HitandMiss Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)

    HitRate:thefractionofmemoryaccessfoundintheupperlevel HitTime:Timetoaccesstheupperlevelwhichconsistsof

    RAMaccesstime+Timetodeterminehit/miss Miss:dataneedstoberetrievefromablockinthelowerlevel

    (BlockY) MissRate=1 (HitRate) MissPenalty:Timetoreplaceablockintheupperlevel+

    Timetodelivertheblocktheprocessor HitTime

  • CachePerformanceFormulas

    CA-Lec3 [email protected] 10

    missmisshitacc TfTT (Average memory access time) = (Hit time) + (Miss rate)(Miss penalty) The times Tacc, Thit, and T+miss can be all either:

    Real time (e.g., nanoseconds) Or, number of clock cycles

    In contexts where cycle time is known to be a constant

    Important: T+miss means the extra (not total) time for a miss

    in addition to Thit, which is incurred by all accesses

    CPU CacheLower levelsof hierarchy

    Hit time

    Miss penalty

  • FourQuestionsforMemoryHierarchy

    Consideranylevelinamemoryhierarchy. Rememberablockistheunitofdatatransfer.

    Betweenthegivenlevel,andthelevelsbelowit

    Theleveldesignisdescribedbyfourbehaviors: BlockPlacement:

    Wherecouldanewblockbeplacedinthelevel? BlockIdentification:

    Howisablockfoundifitisinthelevel? BlockReplacement:

    Whichexistingblockshouldbereplacedifnecessary? WriteStrategy:

    Howarewritestotheblockhandled?

    CA-Lec3 [email protected] 11

  • Q1:Wherecanablockbeplacedintheupperlevel?

    Block12placedin8blockcache: Fullyassociative,directmapped,2waysetassociative S.A.Mapping=BlockNumberModuloNumberSets

    Cache

    01234567 0123456701234567

    Memory

    111111111122222222223301234567890123456789012345678901

    FullMapped DirectMapped(12mod8)=42WayAssoc(12mod4)=0

    CA-Lec3 [email protected] 12

  • Q2:Howisablockfoundifitisintheupperlevel?

    IndexUsedtoLookupCandidates Indexidentifiesthesetincache

    Tagusedtoidentifyactualcopy Ifnocandidatesmatch,thendeclarecachemiss

    Blockisminimumquantumofcaching Dataselectfieldusedtoselectdatawithinblock Manycachingapplicationsdonthavedataselectfield

    Largerblocksizehasdistincthardwareadvantages: lesstagoverhead exploitfastbursttransfersfromDRAM/overwidebusses

    Disadvantagesoflargerblocksize? Fewerblocksmoreconflicts.Canwastebandwidth

    Blockoffset

    Block AddressTag Index

    Set Select

    Data Select

    CA-Lec3 [email protected] 13

  • :0x50

    Valid Bit

    :

    Cache Tag

    Byte 320123

    :

    Cache DataByte 0Byte 1Byte 31 :

    Byte 33Byte 63 :Byte 992Byte 1023 : 31

    Review:DirectMappedCache DirectMapped2N bytecache:

    Theuppermost(32 N)bitsarealwaystheCacheTag ThelowestMbitsaretheByteSelect(BlockSize=2M)

    Example:1KBDirectMappedCachewith32BBlocks Indexchoosespotentialblock Tagcheckedtoverifyblock Byteselectchoosesbytewithinblock

    Ex: 0x50 Ex: 0x00Cache Index

    0431Cache Tag Byte Select

    9

    Ex: 0x01

    CA-Lec3 [email protected] 14

  • DirectMappedCacheArchitecture

    CA-Lec3 [email protected] 15

    Tags Block framesAddress

    Decode & Row Select

    ?Compare Tags

    Hit

    Tag Frm# Off.

    Data Word

    Muxselect

  • Review:SetAssociativeCache Nwaysetassociative:NentriesperCacheIndex

    Ndirectmappedcachesoperatesinparallel Example:Twowaysetassociativecache

    CacheIndexselectsasetfromthecache Twotagsinthesetarecomparedtoinputinparallel Dataisselectedbasedonthetagresult

    CA-Lec3 [email protected]

    Cache Index0431

    Cache Tag Byte Select8

    Cache DataCache Block 0

    Cache TagValid

    :: :

    Cache DataCache Block 0

    Cache Tag Valid

    : ::

    Mux 01Sel1 Sel0

    OR

    Hit

    Compare Compare

    Cache Block16

  • Review:FullyAssociativeCache FullyAssociative:Everyblockcanholdanyline

    Addressdoesnotincludeacacheindex CompareCacheTagsofallCacheEntriesinParallel

    Example:BlockSize=32Bblocks WeneedN27bitcomparators Stillhavebyteselecttochoosefromwithinblock

    :

    Cache DataByte 0Byte 1Byte 31 :

    Byte 32Byte 33Byte 63 :

    Valid Bit

    ::

    Cache Tag

    04Cache Tag (27 bits long) Byte Select

    31

    =

    ==

    =

    =

    Ex: 0x01

    CA-Lec3 [email protected] 17

  • ConcludingRemarks

    Directmappedcache=1waysetassociativecache

    Fullyassociativecache:thereisonly1set

    CA-Lec3 [email protected] 18

  • 19

    CacheSizeEquation

    Simpleequationforthesizeofacache:(Cachesize)=(Blocksize) (Numberofsets)

    (SetAssociativity) Canrelatetothesizeofvariousaddressfields:

    (Blocksize)=2(#ofoffsetbits)

    (Numberofsets)=2(#ofindexbits)

    (#oftagbits)=(#ofmemoryaddressbits) (#ofindexbits) (#ofoffsetbits)

    Memory address

    CA-Lec3 [email protected]

  • Q3:Whichblockshouldbereplacedonamiss?

    Easyfordirectmappedcache Onlyonechoice

    Setassociativeorfullyassociative LRU(leastrecentlyused)

    Appealing,buthardtoimplementforhighassociativity Random

    Easy,buthowwelldoesitwork? Firstin,firstout(FIFO)

    CA-Lec3 [email protected] 20

  • Q4:Whathappensonawrite?

    Write-Through Write-Back

    Policy

    Data written to cache block

    also written to lower-level memory

    Write data only to the cache

    Update lower level when a block falls out of the cache

    Debug Easy Hard

    Do read misses produce writes? No Yes

    Do repeated writes make it to

    lower level?Yes No

    Additional option -- let writes to an un-cached address allocate a new cache line (write-allocate).

    CA-Lec3 [email protected] 21

  • WriteBuffers

    Q. Why a write buffer ?

    ProcessorCache

    Write Buffer

    Lower Level

    Memory

    Holds data awaiting write-through to lower level memory

    A. So CPU doesnt stall

    Q. Why a buffer, why not just one register ?

    A. Bursts of writes arecommon.

    Q. Are Read After Write (RAW) hazards an issue for write buffer?

    A. Yes! Drain buffer before next read, or check write buffers for match on readsCA-Lec3 [email protected] 22

  • MoreonCachePerformanceMetrics

    Cansplitaccesstimeintoinstructions&data:Avg.mem.acc.time=

    (%instructionaccesses) (inst.mem.accesstime)+(%dataaccesses) (datamem.accesstime)

    Anotherformulafromchapter1:CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)

    cycletime UsefulforexploringISAchanges

    Canbreakstallsintoreadsandwrites:Memorystallcycles=

    (Reads readmissrate readmisspenalty)+(Writes writemissrate writemisspenalty)

    CA-Lec3 [email protected] 23

  • Compulsory (coldstartorprocessmigration,firstreference):firstaccesstoablock Coldfactoflife:notawholelotyoucandoaboutit Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory

    Missesareinsignificant Capacity:

    Cachecannotcontainallblocksaccessbytheprogram Solution:increasecachesize

    Conflict (collision): Multiplememorylocationsmapped

    tothesamecachelocation Solution1:increasecachesize Solution2:increaseassociativity

    Coherence (Invalidation):otherprocess(e.g.,I/O)updatesmemory

    SourcesofCacheMisses

    CA-Lec3 [email protected] 24

  • MemoryHierarchyBasics Sixbasiccacheoptimizations:

    Largerblocksize Reducescompulsorymisses Increasescapacityandconflictmisses,increasesmisspenalty

    Largertotalcachecapacitytoreducemissrate Increaseshittime,increasespowerconsumption

    Higherassociativity Reducesconflictmisses Increaseshittime,increasespowerconsumption

    Highernumberofcachelevels Reducesoverallmemoryaccesstime

    Givingprioritytoreadmissesoverwrites Reducesmisspenalty

    Avoidingaddresstranslationincacheindexing Reduceshittime

    CA-Lec3 [email protected]

    Introduction

    25

  • 1.LargerBlockSizes

    Largerblocksize no.ofblocks Obviousadvantages:reducecompulsorymisses

    Reasonisduetospatiallocality

    Obviousdisadvantage Highermisspenalty:largerblocktakeslongertomove Mayincreaseconflictmissesandcapacitymissifcacheissmall

    Dont let increase in miss penalty outweigh the decrease in miss rate

    CA-Lec3 [email protected] 26

  • 2.LargeCaches

    Cachesizemissrate;hittime Helpwithbothconflictandcapacitymisses

    MayneedlongerhittimeAND/ORhigherHWcost

    Popularinoffchipcaches

    CA-Lec3 [email protected] 27

  • 3.HigherAssociativity

    Reduceconflictmiss 2:1Cacheruleofthumbonmissrate

    2waysetassociativeofsizeN/2isaboutthesameasadirectmappedcacheofsizeN(heldforcachesize

  • 4.MultiLevelCaches

    2levelcachesexample AMATL1 =HittimeL1 +MissrateL1MisspenaltyL1 AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +MissrateL2 MisspenaltyL2)

    Probablythebestmisspenaltyreductionmethod Definitions:

    Localmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessestothiscache(MissrateL2)

    Globalmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)

    GlobalMissRateiswhatmatters

    CA-Lec3 [email protected] 29

  • MultiLevelCaches(Cont.) Advantages:

    CapacitymissesinL1endupwithasignificantpenaltyreduction ConflictmissesinL1similarlygetsuppliedbyL2

    Holdingsizeof1stlevelcacheconstant: Decreasesmisspenaltyof1stlevelcache. Or,increasesaverageglobalhittimeabit:

    hittimeL1+missrateL1xhittimeL2 butdecreasesglobalmissrate

    Holdingtotalcachesizeconstant: Globalmissrate,misspenaltyaboutthesame Decreasesaverageglobalhittimesignificantly!

    NewL1muchsmallerthanoldL1

    CA-Lec3 [email protected] 30

  • MissRateExample Supposethatin1000memoryreferencesthereare40missesinthefirstlevel

    cacheand20missesinthesecondlevelcache Missrateforthefirstlevelcache=40/1000(4%) Localmissrateforthesecondlevelcache=20/40(50%) Globalmissrateforthesecondlevelcache=20/1000(2%)

    AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5memoryreferenceperinstruction.Whatisaveragememoryaccesstimeandaveragestallcyclesperinstructions?Ignorewritesimpact.

    AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2MisspenaltyL2)=1+4% (10+50% 200)=5.4CC

    Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+MissesperinstructionsL2MisspenaltyL2=(401.5/1000) 10+(201.5/1000)200=6.6CC

    Or(5.4 1.0) 1.5=6.6CC

    CA-Lec3 [email protected] 31

  • 5.GivingPrioritytoReadMissesOverWrites

    Inwritethrough,writebufferscomplicatememoryaccessinthattheymightholdtheupdatedvalueoflocationneededonareadmiss RAW conflictswithmainmemoryreadsoncachemisses

    Readmisswaitsuntilthewritebufferempty increasereadmisspenalty Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe

    memoryaccesscontinue WriteBack?

    Readmissreplacingdirtyblock Normal:Writedirtyblocktomemory,andthendotheread Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo

    thewrite CPUstalllesssincerestartsassoonasdoread

    CA-Lec3 [email protected] 32

    SW R3, 512(R0) ;cache index 0LW R1, 1024(R0) ;cache index 0LW R2, 512(R0) ;cache index 0

    R2=R3 ?

    read priority over write

  • 6.AvoidingAddressTranslationduringIndexingoftheCache

    Virtuallyaddressedcaches

    33

    Address Translation

    PhysicalAddress Cache

    Indexing

    VirtualAddress

    CPU

    TLB

    $

    MEM

    VA

    PA

    PA

    ConventionalOrganization

    CPU

    $

    TLB

    MEM

    VA

    VA

    PA

    Virtually Addressed CacheTranslate only on miss

    Synonym (Alias) Problem

    VATags

    $ means cache

    CPU

    $ TLB

    MEM

    VAVA

    TagsPAL2 $

    Overlap $ access with VA translation: requires $

    index to remain invariantacross translationCA-Lec3 [email protected]

  • WhynotVirtualCache?

    Taskswitchcausesthesame VAtorefertodifferentPAs Hence,cachemustbeflushed

    Hughtaskswitchoverhead Alsocreateshugecompulsorymissratesfornewprocess

    SynonymsorAliasproblemcausesdifferentVAswhichmaptothesamePA Twocopiesofthesamedatainavirtualcache

    AntialiasingHWmechanismisrequired(complicated) SWcanhelp

    I/O(alwaysusesPA) RequiremappingtoVAtointeractwithavirtualcache

    CA-Lec3 [email protected] 34

  • AdvancedCacheOptimizations

    Reducinghittime1. Smallandsimplecaches2. Wayprediction

    Increasingcachebandwidth3.Pipelinedcaches4. Multibanked caches5. Nonblocking caches

    CA-Lec3 [email protected] 35

    ReducingMissPenalty6.Criticalwordfirst7. Mergingwritebuffers

    ReducingMissRate8.Compileroptimizations

    Reducingmisspenaltyormissrate viaparallelism

    9.Hardwareprefetching10. Compilerprefetching

  • 1.SmallandSimpleL1Cache

    Criticaltimingpathincache: addressingtagmemory,thencomparingtags,thenselectingcorrectset

    Indextagmemoryandthencomparetakestime Directmappedcachescanoverlaptagcompareandtransmissionofdata Sincethereisonlyonechoice

    Lowerassociativityreducespowerbecausefewercachelinesareaccessed

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    36

  • L1SizeandAssociativity

    CA-Lec3 [email protected]

    Access time vs. size and associativity

    Advanced O

    ptimizations

    37

  • L1SizeandAssociativity

    CA-Lec3 [email protected]

    Energy per read vs. size and associativity

    Advanced O

    ptimizations

    38

  • 2.FastHittimesviaWayPrediction

    HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmissesof2waySAcache?

    Wayprediction:keepextrabitsincachetopredicttheway, orblockwithintheset,ofnextcacheaccess.

    Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthatclockcycleinparallelwithreadingthecachedata

    Miss 1st checkotherblocksformatchesinnextclockcycle

    Accuracy 85% Drawback:CPUpipelineishardifhittakes1or2cycles

    Usedforinstructioncachesvs.datacaches

    CA-Lec3 [email protected] 39

    Hit Time

    Way-Miss Hit Time Miss Penalty

  • WayPrediction

    Toimprovehittime,predictthewaytopresetmux Mispredictiongiveslongerhittime Predictionaccuracy

    >90%fortwoway >80%forfourway IcachehasbetteraccuracythanDcache

    FirstusedonMIPSR10000inmid90s UsedonARMCortexA8

    Extendtopredictblockaswell Wayselection Increasesmispredictionpenalty

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    40

  • 3.IncreasingCacheBandwidthbyPipelining

    Pipelinecacheaccesstoimprovebandwidth Examples:

    Pentium:1cycle PentiumPro PentiumIII:2cycles Pentium4 Corei7:4cycles

    Makesiteasiertoincreaseassociativity

    But,pipelinecacheincreasestheaccesslatency Moreclockcyclesbetweentheissueoftheloadandtheuseofthedata

    AlsoIncreasesbranchmispredictionpenalty

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    41

  • 4.IncreasingCacheBandwidth:NonBlockingCaches

    Nonblockingcache orlockupfreecache allowdatacachetocontinuetosupplycachehitsduringamiss requiresF/Ebitsonregistersoroutoforderexecution requiresmultibankmemories

    hitundermiss reducestheeffectivemisspenaltybyworkingduringmissvs.ignoringCPUrequests

    hitundermultiplemiss ormissundermiss mayfurtherlowertheeffectivemisspenaltybyoverlappingmultiplemisses Significantlyincreasesthecomplexityofthecachecontrollerasthere

    canbemultipleoutstandingmemoryaccesses Requiresmuliplememorybanks(otherwisecannotsupport) PeniumProallows4outstandingmemorymisses

    CA-Lec3 [email protected] 42

  • Nonblocking CachePerformances

    L2mustsupportthis Ingeneral,processorscanhideL1misspenaltybutnotL2miss

    penalty CA-Lec3 [email protected]

    Advanced O

    ptimizations

    43

  • 6:IncreasingCacheBandwidthviaMultipleBanks

    Ratherthantreatthecacheasasinglemonolithicblock,divideintoindependentbanksthatcansupportsimultaneousaccesses E.g.,T1(Niagara)L2has4banks

    Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanksmappingofaddressestobanksaffectsbehaviorofmemorysystem

    Simplemappingthatworkswellissequentialinterleaving Spreadblockaddressessequentiallyacrossbanks E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;

    bank1hasallblockswhoseaddressmodulo4is1;

    CA-Lec3 [email protected] 44

  • 5.IncreasingCacheBandwidthviaMultibanked Caches

    Organizecacheasindependentbankstosupportsimultaneousaccess(ratherthanasinglemonolithicblock) ARMCortexA8supports14banksforL2 Inteli7supports4banksforL1and8banksforL2

    Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanks Interleavebanksaccordingtoblockaddress

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    45

  • 6.ReduceMissPenalty:CriticalWordFirstandEarlyRestart

    Processorusuallyneedsonewordoftheblockatatime Donotwaitforfullblocktobeloadedbeforerestartingprocessor

    CriticalWordFirst requestthemissedwordfirstfrommemoryandsendittotheprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhilefillingtherestofthewordsintheblock.Alsocalledwrappedfetch andrequestedwordfirst

    Earlyrestart assoonastherequestedwordoftheblockarrives,sendittotheprocessorandlettheprocessorcontinueexecution

    Benefitsofcriticalwordfirstandearlyrestartdependon Blocksize:generallyusefulonlyinlargeblocks Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen

    fetched Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit

    CA-Lec3 [email protected] 46

  • 7.MergingWriteBuffertoReduceMissPenalty

    Writebuffertoallowprocessortocontinuewhilewaitingtowritetomemory

    Ifbuffercontainsmodifiedblocks,theaddressescanbecheckedtoseeifaddressofnewdatamatchestheaddressofavalidwritebufferentry.Ifso,newdataarecombinedwiththatentry

    Increasesblocksizeofwriteforwritethroughcacheofwritestosequentialwords,bytessincemultiwordwritesmoreefficienttomemory

    CA-Lec3 [email protected] 47

  • MergingWriteBuffer Whenstoringtoablockthatisalreadypendinginthewrite

    buffer,updatewritebuffer Reducesstallsduetofullwritebuffer

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    Nowritebuffering

    Writebuffering

    48

  • 8.ReducingMissesbyCompilerOptimizations

    McFarling [1989]reducedcachesmissesby75%on8KBdirectmappedcache,4byteblocksinsoftware

    Instructions Reorderproceduresinmemorysoastoreduceconflictmisses Profilingtolookatconflicts(usingtoolstheydeveloped)

    Data LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory

    (insequentialorder) LoopFusion:Combine2independentloopsthathavesameloopingandsome

    variablesoverlap Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.

    goingdownwholecolumnsorrows Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks Requiresmorememoryaccessesbutimproveslocalityofaccesses

    CA-Lec3 [email protected] 49

  • LoopInterchangeExample/* Before */for (k = 0; k < 100; k = k+1)

    for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)

    x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)

    for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)

    x[i][j] = 2 * x[i][j];

    Sequentialaccessesinsteadofstridingthroughmemoryevery100words;improvedspatiallocality

    CA-Lec3 [email protected] 50

  • LoopFusionExample/* Before */for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

    for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

    d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

    d[i][j] = a[i][j] + c[i][j];}

    2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality

    CA-Lec3 [email protected] 51

    Perform different computations on the common data in two loops fuse the two loops

  • BlockingExample/* Before */for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k+1){

    r = r + y[i][k]*z[k][j];};x[i][j] = r;

    }; TwoInnerLoops:

    ReadallNxNelementsofz[] ReadNelementsof1rowofy[]repeatedly WriteNelementsof1rowofx[]

    CapacityMissesafunctionofN&CacheSize: 2N3+N2 =>(assumingnoconflict;otherwise)

    Idea:computeonBxBsubmatrixthatfits

    52CA-Lec3 [email protected]

  • Snapshotofx,y,zwhenN=6,i=1

    CA-Lec3 [email protected] 53

    White: not yet touchedLight: older accessDark: newer access Before.

  • BlockingExample/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)

    for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0;for (k = kk; k < min(kk+B-1,N); k = k+1) {r = r + y[i][k]*z[k][j];};

    x[i][j] = x[i][j] + r;};

    BcalledBlockingFactor CapacityMissesfrom2N3 +N2 to2N3/B+N2 ConflictMissesToo?

    CA-Lec3 [email protected] 54

  • TheAgeofAccessestox,y,zwhenB=3

    CA-Lec3 [email protected] 55

    Note in contrast to previous Figure, the smaller number of elements accessed

  • Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithoutpenalty

    InstructionPrefetching Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive

    block. Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis

    placedintoinstructionstreambuffer DataPrefetching

    Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KBpages

    Prefetchinginvokedif2successiveL2cachemissestoapage,ifdistancebetweenthosecacheblocksis

  • 10.ReducingMissPenaltyorMissRatebyCompilerControlled PrefetchingData

    Prefetch instructionisinsertedbeforedataisneeded

    DataPrefetch Registerprefetch:loaddataintoregister(HPPARISCloads) CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9) Specialprefetchinginstructionscannotcausefaults;

    aformofspeculativeexecution

    IssuingPrefetch Instructionstakestime Iscostofprefetch issues

  • Summary

    CA-Lec3 [email protected]

    Advanced O

    ptimizations

    58

  • MemoryTechnology

    Performancemetrics Latencyisconcernofcache BandwidthisconcernofmultiprocessorsandI/O Accesstime

    Timebetweenreadrequestandwhendesiredwordarrives Cycletime

    Minimumtimebetweenunrelatedrequeststomemory

    DRAMusedformainmemory,SRAMusedforcache

    CA-Lec3 [email protected]

    Mem

    ory Technology

    59

  • MemoryTechnology SRAM:staticrandomaccessmemory

    Requireslowpowertoretainbit,sincenorefresh But,requires6transistors/bit(vs.1transistor/bit)

    DRAM Onetransistor/bit Mustberewrittenafterbeingread Mustalsobeperiodicallyrefreshed

    Every~8ms Eachrowcanberefreshedsimultaneously

    Addresslinesaremultiplexed: Upperhalfofaddress:rowaccessstrobe(RAS) Lowerhalfofaddress:columnaccessstrobe(CAS)

    CA-Lec3 [email protected]

    Mem

    ory Technology

    60

  • DRAMTechnology

    Emphasizeoncostperbitandcapacity Multiplexaddresslines cutting#ofaddresspinsinhalf

    Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS) Memoryasa2Dmatrix rowsgotoabuffer SubsequentCASselectssubrow

    Useonlyasingletransistortostoreabit Readingthatbitcandestroytheinformation Refresheachbitperiodically(ex.8milliseconds)bywritingback

    Keeprefreshingtimelessthan5%ofthetotaltime

    DRAMcapacityis4to8timesthatofSRAM

    CA-Lec3 [email protected] 61

  • DRAMLogicalOrganization(4Mbit)

    CA-Lec3 [email protected] 62

    SquarerootofbitsperRAS/CAS

    Column Decoder

    Sense Amps & I/O

    Memory Array

    (2,048 x 2,048)A0A10

    11 D

    Q

    Word LineStorage Cell

  • DRAMTechnology(cont.)

    DIMM:Dualinlinememorymodule DRAMchipsarecommonlysoldonsmallboardscalledDIMMs DIMMstypicallycontain4to16DRAMs

    SlowingdowninDRAMcapacitygrowth Fourtimesthecapacityeverythreeyears,formorethan20years Newchipsonlydoublecapacityeverytwoyear,since1998

    DRAMperformanceisgrowingataslowerrate RAS(relatedtolatency):5%peryear CAS(relatedtobandwidth):10%+peryear

    CA-Lec3 [email protected] 63

  • RASImprovement

    CA-Lec3 [email protected]

    Mem

    ory Technology

    64

  • QuestforDRAMPerformance1. FastPagemode

    Addtimingsignalsthatallowrepeatedaccessestorowbufferwithoutanotherrowaccesstime

    Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsforeachaccess

    2. SynchronousDRAM(SDRAM) AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould

    notbearoverheadtosynchronizewithDRAMcontroller3. DoubleDataRate(DDRSDRAM)

    TransferdataonboththerisingedgeandfallingedgeoftheDRAMclocksignal doublingthepeakdatarate

    DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offershigherclockrates:upto400MHz

    DDR3dropsto1.5volts+higherclockrates:upto800MHz DDR4dropsto1.2volts,clockrateupto1600MHz

    ImprovedBandwidth,notLatency

    CA-Lec3 [email protected] 65

  • 66

    DRAMnamebasedonPeakChipTransfers/SecDIMMnamebasedonPeakDIMMMBytes /Sec

    Stan-dard

    Clock Rate (MHz)

    M transfers / second DRAM Name

    Mbytes/s/ DIMM

    DIMM Name

    DDR 133 266 DDR266 2128 PC2100

    DDR 150 300 DDR300 2400 PC2400

    DDR 200 400 DDR400 3200 PC3200

    DDR2 266 533 DDR2-533 4264 PC4300

    DDR2 333 667 DDR2-667 5336 PC5300

    DDR2 400 800 DDR2-800 6400 PC6400

    DDR3 533 1066 DDR3-1066 8528 PC8500

    DDR3 666 1333 DDR3-1333 10664 PC10700

    DDR3 800 1600 DDR3-1600 12800 PC12800

    x 2 x 8

    F

    a

    s

    t

    e

    s

    t

    f

    o

    r

    s

    a

    l

    e

    4

    /

    0

    6

    (

    $

    1

    2

    5

    /

    G

    B

    )

    CA-Lec3 [email protected]

  • DRAMPerformance

    CA-Lec3 [email protected]

    Mem

    ory Technology

    67

  • GraphicsMemory

    GDDR5isgraphicsmemorybasedonDDR3 Graphicsmemory:

    Achieve25XbandwidthperDRAMvs.DDR3 Widerinterfaces(32vs.16bit) Higherclockrate

    PossiblebecausetheyareattachedviasolderinginsteadofsocketedDIMMmodules

    CA-Lec3 [email protected] 68

  • MemoryPowerConsumption

    CA-Lec3 [email protected]

    Mem

    ory Technology

    69

  • SRAMTechnology

    CacheusesSRAM:StaticRandomAccessMemory SRAMusessixtransistorsperbittopreventtheinformation

    frombeingdisturbedwhenread noneedtorefresh

    SRAMneedsonlyminimalpowertoretainthechargeinthestandbymode goodforembeddedapplications

    NodifferencebetweenaccesstimeandcycletimeforSRAM

    Emphasizeonspeedandcapacity SRAMaddresslinesarenotmultiplexed

    SRAMspeedis8to16xthatofDRAM

    CA-Lec3 [email protected] 70

  • ROMandFlash Embeddedprocessormemory Readonlymemory(ROM)

    Programmedatthetimeofmanufacture Onlyasingletransistorperbittorepresent1or0 Usedfortheembeddedprogramandforconstant Nonvolatileandindestructible

    Flashmemory: Mustbeerased(inblocks)beforebeingoverwritten Nonvolatilebutallowthememorytobemodified ReadsatalmostDRAMspeeds,butwrites10to100timesslower DRAMcapacityperchipandMBperdollarisabout4to8timesgreater

    thanflash CheaperthanSDRAM,moreexpensivethandisk SlowerthanSRAM,fasterthandisk

    CA-Lec3 [email protected] 71

  • MemoryDependability

    Memoryissusceptibletocosmicrays Softerrors:dynamicerrors

    Detectedandfixedbyerrorcorrectingcodes(ECC) Harderrors:permanenterrors

    Usesparserowstoreplacedefectiverows

    Chipkill:aRAIDlikeerrorrecoverytechnique

    CA-Lec3 [email protected]

    Mem

    ory Technology

    72

  • VirtualMemory? Thelimitsofphysicaladdressing

    Allprogramsshareonephysicaladdressspace Machinelanguageprogramsmustbeawareofthemachine

    organization Nowaytopreventaprogramfromaccessinganymachineresource

    Recall:manyprocessesuseonlyasmallportionofaddressspace Virtualmemorydividesphysicalmemoryintoblocks(calledpageor

    segment)andallocatesthemtodifferentprocesses Withvirtualmemory,theprocessorproducesvirtualaddressthat

    aretranslatedbyacombinationofHWandSWtophysicaladdresses(calledmemorymappingoraddresstranslation).

    CA-Lec3 [email protected] 73

  • VirtualMemory:AddaLayerofIndirection

    CPU Memory

    A0-A31 A0-A31

    D0-D31 D0-D31

    Data

    User programs run in an standardizedvirtual address space

    Address Translation hardware managed by the operating system (OS)

    maps virtual address to physical memory

    Physical Addresses

    AddressTranslation

    Virtual Physical

    Virtual Addresses

    Hardware supports modern OS features:Protection, Translation, SharingCA-Lec3 [email protected] 74

  • VirtualMemory

    CA-Lec3 [email protected] 75

    Mapping by apage table

  • VirtualMemory(cont.) Permitsapplicationstogrowbiggerthanmainmemorysize Helpswithmultipleprocessmanagement

    Eachprocessgetsitsownchunkofmemory Permitsprotection of1processchunksfromanother Mappingofmultiplechunksontosharedphysicalmemory Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,

    andcanbemovedduringexecution) ApplicationandCPUruninvirtualspace(logicalmemory,0 max) Mappingontophysicalspaceisinvisibletotheapplication

    Cachevs.virtualmemory Blockbecomesapage orsegment Missbecomesapageoraddressfault

    CA-Lec3 [email protected] 76

  • 3AdvantagesofVM Translation:

    Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryisscrambled

    Makesmultithreadingreasonable(nowusedalot!) Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical

    memory. Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary

    yetstillgrowlater. Protection:

    Differentthreads(orprocesses)protectedfromeachother. Differentpagescanbegivenspecialbehavior

    (ReadOnly,Invisibletouserprograms,etc). KerneldataprotectedfromUserprograms Veryimportantforprotectionfrommaliciousprograms

    Sharing: Canmapsamephysicalpagetomultipleusers

    (Sharedmemory)

    CA-Lec3 [email protected] 77

  • VirtualMemory

    Protectionviavirtualmemory Keepsprocessesintheirownmemoryspace

    Roleofarchitecture: Provideusermodeandsupervisormode ProtectcertainaspectsofCPUstate Providemechanismsforswitchingbetweenusermodeandsupervisormode

    Providemechanismstolimitmemoryaccesses ProvideTLBtotranslateaddresses

    CA-Lec3 [email protected]

    Virtual Mem

    ory and Virtual Machines

    78

  • Page Tables Encode Virtual Address Spaces

    A machine usually supports

    pages of a few sizes

    (MIPS R4000):

    PhysicalMemory Space

    A valid page table entry codes physical memory frame address for the page

    A virtual address spaceis divided into blocks

    of memory called pagesframe

    frame

    frame

    frame

    A page table is indexed by a virtual address

    virtual address

    Page Table

    OS manages the page table for each ASID

    CA-Lec3 [email protected] 79

  • PhysicalMemory Space

    Page table maps virtual page numbers to physical frames (PTE = Page Table Entry)

    Virtual memory => treat memory cache for disk

    Details of Page TableVirtual Address

    Page Table

    indexintopagetable

    Page TableBase Reg

    V AccessRights PA

    V page no. offset12

    table locatedin physicalmemory

    P page no. offset12

    Physical Address

    frame

    frame

    frame

    frame

    virtual address

    Page Table

    CA-Lec3 [email protected] 80

  • PageTableEntry(PTE)? WhatisinaPageTableEntry(orPTE)?

    Pointertonextlevelpagetableortoactualpage Permissionbits:valid,readonly,readwrite,writeonly

    Example:Intelx86architecturePTE: Addresssameformatpreviousslide(10,10,12bitoffset) IntermediatepagetablescalledDirectories

    P: Present(sameasvalidbitinotherarchitectures)W: WriteableU: Useraccessible

    PWT: Pagewritetransparent:externalcachewritethroughPCD: Pagecachedisabled(pagecannotbecached)

    A: Accessed:pagehasbeenaccessedrecentlyD: Dirty(PTEonly):pagehasbeenmodifiedrecentlyL: L=14MBpage(directoryonly).

    Bottom22bitsofvirtualaddressserveasoffset

    Page Frame Number(Physical Page Number)

    Free(OS) 0 L D A

    PCDPW

    T U W P

    01234567811-931-12

    CA-Lec3 [email protected] 81

  • Cachevs.VirtualMemory Replacement

    Cachemisshandledbyhardware PagefaultusuallyhandledbyOS

    Addresses VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU CachespaceisindependentoftheCPUaddresssize

    Lowerlevelmemory Forcaches themainmemoryisnotsharedbysomethingelse Forvirtualmemory mostofthediskcontainsthefilesystem

    Filesystemaddresseddifferently usuallyinI/Ospace VirtualmemorylowerlevelisusuallycalledSWAP space

    CA-Lec3 [email protected] 82

  • Thesame4questionsforVirtualMemory

    BlockPlacement Choice:lowermissratesandcomplexplacementorviceversa

    Misspenaltyishuge,sochooselowmissrate placeanywhere Similartofullyassociativecachemodel

    BlockIdentification bothuseadditionaldatastructure Fixedsizepages useapagetable Variablesizedsegments segmenttable

    BlockReplacement LRUisthebest HowevertrueLRUisabitcomplex souseapproximation

    Pagetablecontainsausetag,andonaccesstheusetagisset OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears

    themall OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone

    WriteStrategy alwayswriteback Duetotheaccesstimetothedisk,writethroughissilly Useadirtybittoonlywritebackpagesthathavebeenmodified

    CA-Lec3 [email protected] 83

  • TechniquesforFastAddressTranslation

    Pagetableiskeptinmainmemory(kernelmemory) Eachprocesshasapagetable

    Everydata/instructionaccessrequirestwomemoryaccesses Oneforthepagetableandoneforthedata/instruction Canbesolvedbytheuseofaspecialfastlookuphardwarecache

    calledassociativeregistersortranslationlookasidebuffers(TLBs)

    Iflocality appliesthencachetherecenttranslation TLB=translationlookasidebuffer TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,

    dirtybit

    CA-Lec3 [email protected] 84

  • TranslationLookAsideBuffers(TLB) Cacheontranslations FullyAssociative,SetAssociative,orDirectMapped

    TLBsare: Small typicallynotmorethan128 256entries FullyAssociative

    TranslationLookAsideBuffers

    CPU TLB Cache MainMemory

    VA PA miss

    hit

    data

    Trans-lation

    hit

    missTranslationwithaTLB

  • V=0 pages either reside on disk or

    have not yet been allocated.

    OS handles V=0Page fault

    Physical and virtual pages must be the

    same size!

    The TLB Caches Page Table Entries

    TLB

    Page Table

    2

    0

    1

    3

    virtual address

    page off

    2frame page

    250

    physical address

    page off

    TLB caches page table

    entries.for ASID

    Physicalframe

    address

  • CachingAppliedtoAddressTranslation

    Data Read or Write(untranslated)

    CPU PhysicalMemory

    TLB

    Translate(MMU)

    No

    VirtualAddress

    PhysicalAddress

    YesCached?

  • VirtualMachines Supportsisolationandsecurity Sharingacomputeramongmanyunrelatedusers Enabledbyrawspeedofprocessors,makingtheoverhead

    moreacceptable

    AllowsdifferentISAsandoperatingsystemstobepresentedtouserprograms SystemVirtualMachines SVMsoftwareiscalledvirtualmachinemonitororhypervisor Individualvirtualmachinesrununderthemonitorarecalledguest

    VMs

    CA-Lec3 [email protected]

    Virtual Mem

    ory and Virtual Machines

    88

  • ImpactofVMsonVirtualMemory

    EachguestOSmaintainsitsownsetofpagetables VMMaddsalevelofmemorybetweenphysicalandvirtualmemorycalledrealmemory

    VMMmaintainsshadowpagetablethatmapsguestvirtualaddressestophysicaladdresses

    RequiresVMMtodetectguestschangestoitsownpagetable Occursnaturallyifaccessingthepagetablepointerisaprivilegedoperation

    CA-Lec3 [email protected]

    Virtual Mem

    ory and Virtual Machines

    89