25
软软软软软软 软软软软软软 软软 软软 2003/3 2003/3

软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Embed Size (px)

Citation preview

Page 1: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

软件调优基础软件调优基础

陈健陈健2003/32003/3

Page 2: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

为什么需要调优?为什么需要调优?相同的代码 相同的代码 >> >> 不同的性能不同的性能

SELF RELEASE OPT :4

IMSL CXML ATLAS MKL50 MKL51

16.676s 5.445s 5.457s 10.996s 3.328s 0.762s 0.848s 0.738s

for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } }}

for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } }}

Page 3: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s
Page 4: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s
Page 5: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

目标目标 明确性能调优的主要任务明确性能调优的主要任务

定义一些重要的性能调优术语定义一些重要的性能调优术语

利用利用 IntelIntel 工具提供帮助工具提供帮助

Page 6: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview–The Performance CycleThe Performance Cycle–When to StartWhen to Start–Performance GainsPerformance Gains–When to StopWhen to Stop–Putting it into PerspectivePutting it into Perspective

Performance Cycle DetailsPerformance Cycle Details SummarySummary

Page 7: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

调优循环调优循环

分析数据并得出结论测试结果

修改代码实现优化

确定修改方法来解决问题

从这里开始 收集性能数据

Page 8: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

When (why) to StartWhen (why) to Start User Requirement?User Requirement? Software Vendor Requirement?Software Vendor Requirement?

Put Performance Requirement into the Put Performance Requirement into the Requirements DocumentRequirements Document

Performance should be considered at every Performance should be considered at every stage of the product life cyclestage of the product life cycle (Requirements Gathering, Design, and (Requirements Gathering, Design, and Testing)Testing)

Exception: Do “code tuning” after the simple/readable Exception: Do “code tuning” after the simple/readable non-optimized version of the application exists.non-optimized version of the application exists.

Page 9: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

工作 工作 vs. vs. 效果效果

Effort

Perfo

rmac

ne

Theoretical Performance Required Performance

Performance Attained w Tools Performance Attained w/o Tools

Page 10: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

When to StopWhen to Stop

Architecture is at Maximum Efficiency?Architecture is at Maximum Efficiency? Be sure you know what this is: Calculate Be sure you know what this is: Calculate

Theoretical Maximum Theoretical Maximum

Performance Requirement is satisfiedPerformance Requirement is satisfied Incrementally do Wide Mesh OptimizationsIncrementally do Wide Mesh Optimizations22

until doneuntil done

Page 11: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

调优原则调优原则We should forget about small efficiencies, say about We should forget about small efficiencies, say about

97% of the time: premature optimization is the root of 97% of the time: premature optimization is the root of all evil.all evil.

Donald KnuthDonald Knuth

Quality Code is:Quality Code is:– PortablePortable

– ReadableReadable

– MaintainableMaintainable

– ReliableReliable

Intelligently Sacrifice Quality for PerformanceIntelligently Sacrifice Quality for Performance

Page 12: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview Performance Cycle DetailsPerformance Cycle Details

–Gather Performance DataGather Performance Data

–Analyze Data and Identify IssuesAnalyze Data and Identify Issues

–Generate Alternatives to Resolve IssuesGenerate Alternatives to Resolve Issues

– Implement EnhancementsImplement Enhancements

SummarySummary

Page 13: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

收集性能数据收集性能数据 TimerTimer

– Use to get wall clock timeUse to get wall clock time

– Accuracy, Low OverheadAccuracy, Low Overhead

Use IntelUse Intel®® VTune™ Performance Analyzer VTune™ Performance Analyzer– Profiler: Gather Information about Code UsageProfiler: Gather Information about Code Usage

– Performance Monitor: Gather Information about Performance Monitor: Gather Information about System Resource UsageSystem Resource Usage

Page 14: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

工作量工作量 A good workload should have these A good workload should have these

characteristics:characteristics:– measurable measurable

– reproducible reproducible

– static static

– representative representative

Page 15: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

分析数据得出结论分析数据得出结论

Baseline Current PerformanceBaseline Current Performance Examine Hot SpotsExamine Hot Spots Identify BottlenecksIdentify Bottlenecks Calculate Potential Maximum Calculate Potential Maximum

PerformancePerformance

Page 16: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Examine Hot SpotsExamine Hot Spots The Pareto Principle, a.k.a. the 80/20 RuleThe Pareto Principle, a.k.a. the 80/20 Rule

– Concentrate on the vital few vs. the trivial manyConcentrate on the vital few vs. the trivial many

Hot Spot: Hot Spot: 应用或系统中占主要运算量的部分应用或系统中占主要运算量的部分 Generally consists of a LoopGenerally consists of a Loop

For Applications that don’t have hot spots, For Applications that don’t have hot spots, examine:examine:– Memory LayoutMemory Layout– ExceptionsExceptions– Effective Compiler UsageEffective Compiler Usage

Page 17: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

额外内容额外内容

Big OBig O Utilization, Efficiency, Throughput, LatencyUtilization, Efficiency, Throughput, Latency BottlenecksBottlenecks

– I/O, Memory, CPUI/O, Memory, CPU

MIPS/FLOPS/CPIMIPS/FLOPS/CPI Concurrency, ParallelismConcurrency, Parallelism ScalabilityScalability Loads/Stores per CalculationLoads/Stores per Calculation

Page 18: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview Performance Cycle DetailsPerformance Cycle Details

–Gather Performance DataGather Performance Data

–Analyze Data and Identify IssuesAnalyze Data and Identify Issues

–Generate Alternatives to Resolve IssuesGenerate Alternatives to Resolve Issues

– Implement EnhancementsImplement Enhancements

SummarySummary

Page 19: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

优化设计层次优化设计层次 问题定义问题定义 系统结构系统结构 算法和数据结构算法和数据结构 代码调优代码调优 系统软件系统软件 系统硬件系统硬件

Page 20: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

代码调优代码调优 汇编指令级汇编指令级 内部函数 内部函数 C++ C++ 向量类库向量类库 多线程多线程 循环转化循环转化 编译器及参数编译器及参数 性能库 性能库

Hardest to develop

and maintain

Easiest to develop,

port and maintain

Hardest to develop

and maintain

Easiest to develop,

port and maintain

Page 21: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Code TuningCode Tuning

If Parallel ProcessingIf Parallel Processing–Break Algorithm up across Clusters Break Algorithm up across Clusters

(Distributed Memory)(Distributed Memory)

–Single Node OptimizationSingle Node Optimization

–Break Algorithm up across Processors Break Algorithm up across Processors (SMP)(SMP)

Page 22: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

修改代码实现优化修改代码实现优化 Use Intel® LibrariesUse Intel® Libraries Use Various Compiler SwitchesUse Various Compiler Switches Find out if the compiler or hardware Find out if the compiler or hardware

does the enhancements automatically does the enhancements automatically - before implementing yourself- before implementing yourself

Modify SourceModify Source(i.e. Loop Transformations, SWP,(i.e. Loop Transformations, SWP,SIMD, OpenMP, Intrinsics,SIMD, OpenMP, Intrinsics,Assembly)Assembly)

Page 23: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Test!Test!

Make sure Applications still runs Make sure Applications still runs correctly (Regression Testing)correctly (Regression Testing)

Make sure enhancement actually Make sure enhancement actually increases performanceincreases performance

Calculate Speed-upCalculate Speed-up

Decide if you’re done optimizingDecide if you’re done optimizing

Page 24: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Speed-UpSpeed-Up

Speed-Up =Optimized TimeBaseline Time

Speed-Up = Optimized Throughput

Baseline Throughput

The Two Basic Formulas

Page 25: 软件调优基础 陈健2003/3. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

SummarySummary Optimization TasksOptimization Tasks

–Gather Performance DataGather Performance Data

–Analyze Data & Identify IssuesAnalyze Data & Identify Issues

–Generate Alternatives to Resolve IssueGenerate Alternatives to Resolve Issue

– Implement EnhancementsImplement Enhancements

–Test ResultsTest Results

Use Intel® Software Development Use Intel® Software Development Tools for every step in the processTools for every step in the process