25
软软软软软软 软软软软软软 2004 2004 2 2 23 23

软件调优基础 2004 年 2 月 23 日. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL51 16.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738s

Embed Size (px)

Citation preview

软件调优基础软件调优基础

2004 2004 年年 22 月月 2323 日日

为什么需要调优?为什么需要调优?相同的代码 相同的代码 >> >> 不同的性能不同的性能

SELF RELEASE OPT :4

IMSL CXML ATLAS MKL50 MKL51

16.676s 5.445s 5.457s 10.996s 3.328s 0.762s 0.848s 0.738s

for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } }}

for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } }}

目标目标 明确性能调优的主要任务明确性能调优的主要任务

定义一些重要的性能调优术语定义一些重要的性能调优术语

利用利用 IntelIntel 工具提供帮助工具提供帮助

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview The Performance CycleThe Performance Cycle When to StartWhen to Start Performance GainsPerformance Gains When to StopWhen to Stop Putting it into PerspectivePutting it into Perspective

Performance Cycle DetailsPerformance Cycle Details SummarySummary

调优循环调优循环

分析数据并得出结论测试结果

修改代码实现优化

确定修改方法来解决问题

从这里开始 收集性能数据

When (why) to StartWhen (why) to Start User Requirement?User Requirement? Software Vendor Requirement?Software Vendor Requirement?

Put Performance Requirement into the Put Performance Requirement into the Requirements DocumentRequirements Document

Performance should be considered at every Performance should be considered at every stage of the product life cyclestage of the product life cycle (Requirements Gathering, Design, and (Requirements Gathering, Design, and Testing)Testing)

Exception: Do “code tuning” after the simple/readable Exception: Do “code tuning” after the simple/readable non-optimized version of the application exists.non-optimized version of the application exists.

工作 工作 vs. vs. 效果效果

Effort

Perfo

rmac

ne

Theoretical Performance Required Performance

Performance Attained w Tools Performance Attained w/o Tools

When to StopWhen to Stop

Architecture is at Maximum Efficiency?Architecture is at Maximum Efficiency? Be sure you know what this is: Calculate Be sure you know what this is: Calculate

Theoretical MaximumTheoretical Maximum

Performance Requirement is satisfiedPerformance Requirement is satisfied Incrementally do Wide Mesh OptimizationsIncrementally do Wide Mesh Optimizations22 until until

donedone

调优原则调优原则We should forget about small efficiencies, say about 97% of We should forget about small efficiencies, say about 97% of

the time: premature optimization is the root of all evil.the time: premature optimization is the root of all evil. Donald KnuthDonald Knuth

Quality Code is:Quality Code is:– PortablePortable

– ReadableReadable

– MaintainableMaintainable

– ReliableReliable

Intelligently Sacrifice Quality for PerformanceIntelligently Sacrifice Quality for Performance

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview Performance Cycle DetailsPerformance Cycle Details

Gather Performance DataGather Performance Data Analyze Data and Identify IssuesAnalyze Data and Identify Issues Generate Alternatives to Resolve IssuesGenerate Alternatives to Resolve Issues Implement EnhancementsImplement Enhancements

SummarySummary

收集性能数据收集性能数据

TimerTimer Use to get wall clock timeUse to get wall clock time Accuracy, Low OverheadAccuracy, Low Overhead

Use IntelUse Intel®® VTune™ Performance Analyzer VTune™ Performance Analyzer Profiler: Gather Information about Code UsageProfiler: Gather Information about Code Usage Performance Monitor: Gather Information about System Performance Monitor: Gather Information about System

Resource UsageResource Usage

工作量工作量

A good workload should have these A good workload should have these characteristics:characteristics: measurable measurable reproducible reproducible static static representative representative

分析数据得出结论分析数据得出结论

Baseline Current PerformanceBaseline Current Performance Examine Hot SpotsExamine Hot Spots Identify BottlenecksIdentify Bottlenecks Calculate Potential Maximum PerformanceCalculate Potential Maximum Performance

Examine Hot Examine Hot SpotsSpots

The Pareto Principle, a.k.a. the 80/20 RuleThe Pareto Principle, a.k.a. the 80/20 Rule Concentrate on the vital few vs. the trivial manyConcentrate on the vital few vs. the trivial many

Hot Spot: Hot Spot: 应用或系统中占主要运算量的部分应用或系统中占主要运算量的部分 Generally consists of a LoopGenerally consists of a Loop

For Applications that don’t have hot spots, For Applications that don’t have hot spots, examine:examine: Memory LayoutMemory Layout ExceptionsExceptions Effective Compiler UsageEffective Compiler Usage

额外内容额外内容

Big OBig O Utilization, Efficiency, Throughput, LatencyUtilization, Efficiency, Throughput, Latency BottlenecksBottlenecks

I/O, Memory, CPUI/O, Memory, CPU

MIPS/FLOPS/CPIMIPS/FLOPS/CPI Concurrency, ParallelismConcurrency, Parallelism ScalabilityScalability Loads/Stores per CalculationLoads/Stores per Calculation

AgendaAgenda

Performance Cycle OverviewPerformance Cycle Overview Performance Cycle DetailsPerformance Cycle Details

Gather Performance DataGather Performance Data Analyze Data and Identify IssuesAnalyze Data and Identify Issues Generate Alternatives to Resolve IssuesGenerate Alternatives to Resolve Issues Implement EnhancementsImplement Enhancements

SummarySummary

优化设计层次优化设计层次 问题定义问题定义 系统结构系统结构 算法和数据结构算法和数据结构 代码调优代码调优 系统软件系统软件 系统硬件系统硬件

代码调优代码调优 汇编指令级汇编指令级 内部函数 内部函数 C++ C++ 向量类库向量类库 多线程多线程 循环转化循环转化 编译器及参数编译器及参数 性能库 性能库

Hardest to develop

and maintain

Easiest to develop,

port and maintain

Hardest to develop

and maintain

Easiest to develop,

port and maintain

Code TuningCode Tuning

If Parallel ProcessingIf Parallel Processing Break Algorithm up across Clusters Break Algorithm up across Clusters

(Distributed Memory)(Distributed Memory) Single Node OptimizationSingle Node Optimization Break Algorithm up across Processors Break Algorithm up across Processors

(SMP)(SMP)

修改代码实现优化修改代码实现优化 Use Intel® LibrariesUse Intel® Libraries Use Various Compiler SwitchesUse Various Compiler Switches Find out if the compiler or hardware does Find out if the compiler or hardware does

the enhancements automatically - the enhancements automatically - before implementing yourselfbefore implementing yourself

Modify SourceModify Source(i.e. Loop Transformations, SWP,(i.e. Loop Transformations, SWP,SIMD, OpenMP, Intrinsics,SIMD, OpenMP, Intrinsics,Assembly)Assembly)

Test!Test!

Make sure Applications still runs correctly Make sure Applications still runs correctly (Regression Testing)(Regression Testing)

Make sure enhancement actually Make sure enhancement actually increases performanceincreases performance

Calculate Speed-upCalculate Speed-up

Decide if you’re done optimizingDecide if you’re done optimizing

Speed-UpSpeed-Up

Speed-Up =Optimized TimeBaseline Time

Speed-Up = Optimized Throughput

Baseline Throughput

The Two Basic Formulas

SummarySummary Optimization TasksOptimization Tasks

Gather Performance DataGather Performance Data Analyze Data & Identify IssuesAnalyze Data & Identify Issues Generate Alternatives to Resolve IssueGenerate Alternatives to Resolve Issue Implement EnhancementsImplement Enhancements Test ResultsTest Results

Use Intel® Software Development Tools Use Intel® Software Development Tools for every step in the processfor every step in the process