Upload
joanna-bradley
View
218
Download
1
Embed Size (px)
Citation preview
Lois Orosa and Rodolfo AzevedoUniversity of Campinas
Revisiting Load Value Speculation: an Approach to Mainstream Processors
2
ProjectsProjects
Load Value Speculation Frequent Value Locality On-chip photonics (Jorge González , PhD
candidate)
3
Introduction to Value Speculation (I)
It was proposed in the 90´s Improve ILP by breaking true data dependencies (RAW) Speculation in all the instructions The prediction is written in the output register Predictors indexed by PC (at fetch time) The proposals were very complex in that time (many
changes in the OoO engine) Recently Perais and Seznec revisited the topic [HPCA´13]
[ISCA´14][HPCA´15] Propose simplifications in the implementation Propose new predictors
4
Introduction to Value Speculation (II)
Confidence counters [Perais'13] (per instruction) to increase precision Only speculates when the confidence is high Reduce mispredictions Decrease coverage Increase when prediction is ok, reset when misprediction
Precision If mispenalty is low, the system could tolerate low precision If mispenalty is high, precision should be high (99% or more)
The prediction have to be available before dispatch time Available cycles: from fetch to dispatch The predictor delay is not critical
5
Introduction to Value Speculation (III)
Validation At execution time (OoO changes, small misprediction penalty) At commit time (no OoO changes, higher misprediction penalty)
Recovering from misprediction: Selective reissue: faster, more complex (validation at execution
time) Pipeline squashing: slower, more simple
Two main problems: Register port pressure
New extra ports (extra writes for predictions, extra reads for validations and predictor updates)
Back-to-back predictions Predictors may depend on previous values
6
Contributions
Analysis of the potential of Value Speculation in a narrow processor for different types of instructions
Reducing complexity in narrow-width-issue processors by speculating only in load instructions
AV predictor: two phase value predictor with prediction of addresses
XLStride predictor: multilevel stride predictor
7
Baseline Processor & Benchmarks
Baseline: real narrow-width-issue processor ZSIM simulator:
Westmere OoO x86-64bit , 4-issue, 2-level branch predictor
128-entry ROB, 32-entry load queue, 32-entry store queue L1I & L1D : 32KB 4-way, LRU, 4-cycle latency L2 Cache : 256KB, 8-way, LRU, 12-cycle latency Pipeline squashing, validation at commit
Benchmarks Splash2, Parsec, SPEC2000, SPEC2006
8
Potential of Value Speculation (I) Six categories of instructions Loads are the 25% of all dynamic micro-instructions High latency micro-instructions (more than 5 cycles) are not representative
(included in “Other”)
9
Potential of Value Speculation (III) Oracle predictor (no mispredictions) Value Speculation in each category of instruction Loads have almost the same potential than speculating in all instructions
ALLLOADS
LOADS ALL
NOTLOADS
Loads (25%) have more potential gains than all the other instructions together (75%)
NOTLOADS
10
Advantages of Speculating only in Loads in a narrow processor
Value Speculation in Narrow-issue processors Reduced back-to-back prediction: less on-flight instructions Approach to mainstream processors Reduced misprediction penalty (smaller pipeline)
Speculation in ¼ of the instructions (loads), with almost the same potential gains: Reduced Register port presure Reduced back-to-back prediction Still need confidence counters to increase precision
“mcf” minimun precision: 76,7 % “tonto” minimum precision: 99,6 %
11
State of the Art Predictors
Last Value Predictor (LVP) Stride predictor
{1,2,3,4,5}, Variants: 2D-Stride
FCM VTAGE DVTAGE, DFCM
12
XLStride Predictor It detects strides between consecutive values, and also between alternating values:
Examples: {2,1,1,4,4,3,6,6,7,8} , {4,0,4,9,4,1,4} It can have several levels X histories, each one containing stride information about the last X occurrencesof the instruction. It requires X^2 strides + last value 16 bit strides X predictions: selection by confidence counters We implemented a 2LStride predictor (good relation performance/cost)
Example: 2LStride, 1 bit confidence counter
13
AV Predictor Some benchmarks exhibit patterns in the addresses, not in the values Address table (AT): index by PC, result: predicted address
Implemented with a state-of-the-art predictor Value Table (VT): index by predicted address, result: predicted value
Implemented with a last value predictor VT is also updated in stores Detect patterns in the addresses: results are totally different from
traditional predictors
14
Evaluation
Load speculation 7 State of the art
predictors 2LStride predictor 3 AV predictors Several Hybrid
Predictors Uses the half of the
entries of state of the art predictors [Perais and Seznec, HPCA'13]
15
Results (I)Individual results:Individual results:
Hybrid Results:Hybrid Results:
Best of the single preditors
Always better than the best of the single predictors
16
Results (II)
Multicore experiments with 24 cores
To check the influence of shared memory in the precision
Precision on the value table => No changes in shared memory by remote processors
17
Conclusions
We simulate a real processor (Intel Westmere) to approximate Value Prediction to general purpose processors (narrow-issue processors)
Speculating in Loads has better cost/benefit than speculating in all the instructions (in narrow processors)
We propose the XLStride predictor Detect more complex stride patterns
We propose the AV predictor Complementary to the traditional predictors: ideal for hybrid predictors
Speed-up up to 33% (average 10%) Shared memory in multicore processors barely affects the precision
of predictors
19
Potential of Value Speculation (II)
20