Optimizing Android Performancewith GCC Compiler
Mar-12-2010, Fri
• Name - Geunsik Lim • e-Mail - leemgs.at.gmail.com• Nick - invain ( 인베인 )• Blog - http://blog.naver.com/invain/
본 문서는 자유롭게 수정 및 재배포가 가능 하나 , 자료의 재사용시 “ 자료출처” 를 우측하단에 표기해야 합니다 .
CONTENTS
Optimization Strategies for the lightweight android
Android Toolchain Roadmap
Building Android Toolchain
GA Search For Compiler Options
Thoughtful abstraction & specifications
Profile-Guided Optimization
FDO Illustration & Performance
Lightweight IPO (LIPO)
Redundancy Elimination
Optimizing Dalvik Memory Management
Observation of WebView Bench & Fhourstones
Experimental Result
Systematic Optimizations
Android Technology Session
Reference: GCC internals manual, Shih-wei Liao’s Paper, Dan Kegel’s crosstool, Fedora11 documentation(SMP)
http://leemgs.fedorapeople.org
3/435th Korea Android Conference
PerformanceOptimization
• In mathematics and computer science, mathematical programming, refers to choosing the best element from some set of available alternatives.
• The first optimization technique, which is known as steepest descent, goes back to Gauss (mathematician and scientist).
• This means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set.
What is Optimization?
• Studies in optimizing: Code size,
Performance, Power Embedded s/w size
2000 2005 2010 2015 2020 2025
2030
4/435th Korea Android Conference
PerformanceOptimizationWhere is a Hole for
Optimization?
Application ?
Hardware ?
OS Kernel ?
Middleware ?
(Dalvik, Core/Func lib)
(Snapdragon,S5PC1XX)
(Application framework, Application)
(Linux)
5/435th Korea Android Conference
PerformanceOptimization
1) Data-driven tool deployment: Regularly evaluate & then leverage the winner among
optimizing toolchains
2) Judicious abstraction & specifications: A fundamental methodology Visibility of a function should match the API spec in
programmer’s design Tradeoff in splitting into Java and Native: This interface affects
performance PacketVideo(=Opencore/OpenMax; Multimedia framework):
Semiconductor industry looks for APIs to differentiate
3) Systematic parameter setting: A key driver in performance/size
7 Optimization Strategies 1/2
6/435th Korea Android Conference
PerformanceOptimization
4) Profile-guided optimizations: A useful methodology Feedback-Directed Optimizations (FDO): Build-Run-Build
with our arm-xxx-eabi-gcc Class loading profiler (aka Preload profiler): Zygote’s
preloading Trade-off between boot-up time and app init time.
5) Scope-enhancing optimizations: Interprocedural optimizations via arm-xxx-eabi-gcc –fripa In the current implementation, -fripa only turns on cross
module inlining analysis.
6) Redundancy elimination: Identical Code Folding (ICF)
7) Memory management optimization in Dalvik in the interest of time.
7 Optimization Strategies 2/2
7/435th Korea Android Conference
PerformanceOptimization
• Analyze the tools candidates:
Data-driven tool deployment
gcc-4.4.1(open source)
gcc-4.4.0(google)
gcc-4.3.3(cs)
gcc-4.3.3(open source)
gcc-4.3.1(google)
gcc-4.2.1(google)
Without Code Sourcery 2009Q3
Android’s toolchain for eclair
Code Sourcery 2009Q1
Without Code Sourcery 2009Q1
Android toolchain for eclair
Android toolchain for Dount
Size improvement on Dream phone
Speedup on Dream phone (Run 100X)
Google track 13 numbers daily. They got space to show 4 here.
원가경쟁력 제품 차별화
Source: google
8/435th Korea Android Conference
PerformanceOptimization
• Based on Google Android perflab benchmark results, – Baseline: Donut(ver1.6)’s toolchain: gcc-4.2.1 – Size:
Both gcc-4.4.X : 17.8% improvement Both gcc-4.3.3 & gcc-4.3.3 Code Sourcery Version: 15%
better gcc-4.3.1: 3% improvement
– Performance: No significant variance among 6 toolchains - gcc-4.4.3’s size benefit comes with no performance penalty
• Code Sourcery for ARM doesn’t have significant performance / size benefit over Android’s version of gcc.
– Code Sourcery’s strength: Addressing ARM’s hardware errata early. We have to port the fixes to gcc-4.4.3
• gcc-4.4.3 wins Toolchain moved to 4.4.3; Skipping 4.3
Analyze 6 Toolchains
9/435th Korea Android Conference
PerformanceOptimization
• All pieces from open source GCC, binutils, gdb, gmp, mpfr Patch for bug fixing and optimization
• Take patches from upstream• Submit our patches to upstream• Also, native developers can use Android NDK
http://developer.android.com/sdk/android-2.1.html (API Level 7, Jan 2010)
Android Toolchain Roadmap
cupcake donut eclair (armv7) kandroid
gcc 4.2.1 4.2.1 4.4.0 4.4.3
binutils 2.17 2.17 2.19 2.20
gdb 6.6 6.6 6.6 7.0.1
gmp 4.2.2 4.2.2 4.2.4 4.3.2
mpfr 2.3.0 2.3.0 2.4.1 2.4.2
S/WBranch
10/435th Korea Android Conference
PerformanceOptimization
• Google changed default cross-compiler on Nov-16-2009.• Default architecture is still armv5te for compatibility.
Latest Android Toolchain
11/435th Korea Android Conference
PerformanceOptimization
• Android uses Bionic C library BSD license: Keeps GPL out of user’s sphere for Android
market. Small and fast more than glibc , uclibc. . glibc 2.11 : /lib/libc.so 1,208,224 bytes
. uClibc 0.9.30: /lib/libc.so 424,235 bytes
. Bionic éclair : /system/lib/libc.so 243,948 bytes
Bionic has built-in support for important Android specific services, - e.g., system properties, logging
Very limited support for POSIX, C++, etc
• If need libstdc++-v3: Enable libstdc++-v3 when configure the toolchain. Statically link in the necessary components . -/system/lib/libstdc++.so ( 5,124bytes)
Building Android Toolchain (1/2)
Reduce size extremely.
12/435th Korea Android Conference
PerformanceOptimization
• Barebone-style building: Inside Android tree Specify all system and bionic header file paths, shared
library, paths, libgcc.a, crtbegin_*.o, crtend_*.o, etc.
• Standalone-style building: Latest prebuilt gcc-4.4.0 toolchain Convenient for native developers: arm-xxx-eabi-gcc -mandroid --sysroot=<path-to-sysroot > hello.c -o
hello (<path to sysroot> is a pre-compiled copy of Bionic) Download:
Old) http://android.git.kernel.org/?p=platform/prebuilt.git;a=tree;f=linux-x86/toolchain;h=1cf27fca792be850f7b18e0c76762787c7b5c8c9;hb=4b06260a916be762d0dd1b93e97306f1b90e3889
Now) http://android.git.kernel.org/pub/?C=M;O=D
Building Android Toolchain (2/2)
13/435th Korea Android Conference
PerformanceOptimization
• Bionic library includes POSIX C thread libraries with /system/lib/libc.so file.(./bionic/libc/include/pthread.h)
• Android's POSIX thread api don’t support pthread_rwlock_*** , thread_rwlock_attr_*** , pthread_barrior_***, pthread_barrior_attr_***, pthread_spin_*** for POSIX 1003.1J-2000 Standard.• Android toolchain consist of GDB utility using /system/lib/lib_thread_db.so for thread debugging of Android application.
Thread API List
Thread functions according to bionic
eclair
14/435th Korea Android Conference
PerformanceOptimization
• Utilize your Linux Desktop based on multi-core to build Android.
• The purpose of the “make(by Paul Smith)” utility is to determine automatically which pieces of a large program need to be recompiled, and issue the commands to recompile them.
• The `-j' or `--jobs' option tells make to execute many commands simultaneously.
How to compile android source faster 1/2
F11-invain#> vi build-android-kernel.sh#!/bin/bash# created by invain for the best performance when compiling kernel source.realnum=`cat /proc/cpuinfo | grep cores | wc -l `let bestnum=$realnum+$(printf %.0f `echo "$realnum*0.2"|bc`)schedtool –B –n 1 –e make -j `echo $bestnum` uImage
• This is a Bash shell script to compile of android full sources quickly.
15/435th Korea Android Conference
PerformanceOptimization
• Evaluation when compiling android full sources.
How to compile android source faster 2/2
Tested on Intel Core i5 Lynfield 750 (Quad @2.66Ghz) by DeolPooltime make -j4 : 19m 10stime make -j5 : 18m 52s Recommendation time make -j8 : 19m 15stime make -j64 : 19m 54s
ConnectBot
Tested on Intel Core2 Quad Yourkfield Q9400 (Quad @2.66Ghz) by invaintime make -j4 : 22m 49stime make -j5 : 22m 31s Recommendationtime make -j8 : 28m 47stime make -j64 : 51m 19s
16/435th Korea Android Conference
PerformanceOptimization
• CPU Core Specification
How to confirm 32bit/62bit about CPU & Linux
[invain@fedora11 ~]$ grep flag /proc/cpuinfoflags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
• lm flag is abbreviation of “Long Mode(64bit)”.
[invain@fedora11 ~]$ uname -aLinux invain 2.6.33-rt4-smp #1 SMP Tue Feb 26 23:11:04 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
• Linux Kernel Information
17/435th Korea Android Conference
PerformanceOptimization
1. Goal: Visibility of a function should match the API spec in programmer’s design.
2. Solution:First, systematically applying the 5
steps.Fundamentally, need to go through theAPIs of each library:
Consciously decide what should be
“public” and what shouldn’t.
3. Result: ~500 KB savings for Opencore libs
4. Key: The whole hidden functions can be
garbage collected if unused locally: 5. Toolchain’s options:-ffunction-sections, -Wl,--gcsections,
Thoughtful abstraction & specifications
-fvisibility=hidden
Linux-arm.mk
Android.mk+
*.h__attribute__((visibility(“public”))
)
function decl
invain@fedora11$> make -j <???>
/tmp/GoOgLe.o: In function foo
Bar.c: undefined reference to “baz”
__attribute__((visibility(“public”)))
Int baz;
Until no failure
1
2
3
4
5
18/435th Korea Android Conference
PerformanceOptimizationParameter Setting
• Parameters setting is a key driver in performance/size optimizations
• Case study: For Android tree, find the best: Compiler parameters Compiler options
• Parameter space exploration via genetic algorithm. (GA)
Genetic algorithm (GA)? a search technique used in computing to find exact or approximate solutions to optimization and search
problems. Ref http://www.genetic-programming.com
19/435th Korea Android Conference
PerformanceOptimizationGA Search For Compiler
Options
Optimization target Fitness function
Performance Inverse of execution time
Size Inverse of code size
Initial a population of random generated
option sets
Drop a portion of the option Sets that build binaries with
Lower fitness values
An expected result Reaches or we don’t
Have enough time forsearching
Produce new option sets byCrossover and mutation of
The remaining ones
initialization Selection
Termination ReproductionTerminatio
n
20/435th Korea Android Conference
PerformanceOptimizationOptions That Control Optimization
• “-O0”: Reduce compilation time and make debugging produce the expected results. This is the default.
• “-O1”: Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
• “-O2”: Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. For Kernel/App.
• “-O3”: Turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options.
• “-Os”: Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size
These options control various sorts of optimizations.
21/435th Korea Android Conference
PerformanceOptimization
• We search for a configuration that reduces size the most using compiler option search approach
Reduce Code Size by Option Search
• Android default inline options:-finline-functions-fno-inline-functions-called-once
• Options that we found:-finline-fno-inline-functions-finline-functions-called-once--param max-inline-insns-auto=62--param inline-unit-growth=0--param large-unit-insns=0--param inline-call-cost=4
GCC-4.2.1 GCC-4.4.3 GCC-4.4.3(tuned inline options)
Native systemimage
23,839,291
23,027,032
22,087,436
(unit: byte)
GCC-4.2.1 GCC-4.4.3 GCC-4.4.3(tuned)Native system image size
22/435th Korea Android Conference
PerformanceOptimization
Profile-Guided Optimization: Toolchainenables FDO (Feedback-Directed Optimization)
Must spill tmp1 or tmp2
Before defining tmp3
tmp1 = . . . tmp2 = . . .
. . . tmp3 = . . .
tmp1 = . . . tmp2 = . . .
. . . tmp3 = . . .
. . . = tmp1 . . . = tmp1 . . . = tmp2 . . . = tmp2
. . . . . .
23/435th Korea Android Conference
PerformanceOptimization
1. Build twice.2. Find representative input3. Instrumentation run: 2~3X slower but this perturbation is OK, because threading in Android is not that time sensitive (After all, ARM11 or Coretex-A8 core)4. 1 profile per file, dumped at application exit.
Instrumentation Based FDO
arm-xxx-eabi-gcc –fprofile-generate=./profile . . .
arm-xxx-eabi-gcc –fprofile-generate=./profile . . .
arm-xxx-eabi-gcc –fprofile-use=./profile.zip . . .
arm-xxx-eabi-gcc –fprofile-use=./profile.zip . . .
OptimizedBinary
with FDO
OptimizedBinary
with FDO
Run the instrumented binary
Run the instrumented binary Profile.zipProfile.zip
Instrumented Binary
Instrumented Binary
RepresentativeInput Data
RepresentativeInput Data
1
2
3
http://gcc.gnu.org/onlinedocs/gcc-4.4.3/gcc.pdf (Page 102)
24/435th Korea Android Conference
PerformanceOptimization
Global hotness for ARM (HOT_BB_COUNT_FRACTION, Branch prediction routine for the GNU compiler, gcc-4.4.x/gcc/predict.c)
1% improvement on android's skia library as belows. smaller effects on smaller android benchmarks.
FDO Performance
Content Work default fdo-default fdo-modified
Size of libskia 7,879,646 7,396,032 7,319,668
Size reduction 0.00% 6.14% 7.11%
Stdev (over 100 runs)
0.28 0.63 0.26
Speedup 1 0.98 0.97
(unit: bytes)
Source: google
25/435th Korea Android Conference
PerformanceOptimization
• Optimization opportunity
Decided by scope of the code compiler can see
• Scope limited mainly by artificial source boundaries
IPO enhances the scope
Scope-Enhancing OptimizationInter-Procedural Optimizations (IPO)
parent.c:• int foo(int i, int j)• {• return bar (i,j) + bar (j,i);• }
child.c:• int bar(int i, int j)• {• return i - j;• }
26/435th Korea Android Conference
PerformanceOptimization
• Parameters setting is a key driver in performance/size optimizations
• Case study: For Android tree, find the best: Compiler parameters
Problem with Traditional IPO
CMI: Cross Module Inlining
27/435th Korea Android Conference
PerformanceOptimization
• To get the best potential out of IPO Integrate IPO with FDO, seamlessly! perf (IPO + FDO) > perf (IPO) + perf (FDO)
• Move Inter-Procedural Analysis (IPA) to the end of training run execution, into the binary -- make global decisions earlier!
• Write IPA results into profile
• During profile-use compilation, Compile each file, as usual, with augmented profile Read additional IPA results Suck in auxiliary modules and extend scope
Solution: Profile Feedback Based Lightweight IPO (LIPO)
☞ Memo http://gcc.gnu.org/wiki/LightweightIpo
28/435th Korea Android Conference
PerformanceOptimization
• LIPO targets C/C++: Android uses C/C++.
(except for some assembly code)
• Baseline: FDO enabled
• Degradations are in noise range.
LIPO Improves Performance: Use -fripa
We just got the ARM version of LIPO to work: Run: f11#> arm-xxx- eabi-gcc –fprofilegenerate=/data/local/profile
–fripa -mandroid Replace: –fprofile-generate with –fprofile-use at the end of optimization
29/435th Korea Android Conference
PerformanceOptimization
SPEC2000 on x86
Improvement SPEC2006 on x86
Improvement
177.mesa 1.33% 433.milc 1.75%
164.gzip 3.83% 477.dealII 2.03%
175.vpr 3.43% 453.povray 12.74%
253.perlbmk 1.94% 445.gobmk 1.26%
254.gap 3.76% 458.sjeng 5.45%
255.vortex 21.12% 464.h264ref 8.51%
252.eon 3.42% 473.astar 0.72%
The Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. (http://www.spec.org/)
Performance Evaluation Result
30/435th Korea Android Conference
PerformanceOptimization
• Identify identical functions and merge them at link time.
• Implemented in the binutils gold linker. Triggered with option --icf.
• Debug support available through call tables.
• ICF on gold yields 5% on x86-64 binaries
• We are still getting gold linker to work with AndroidARM. We estimate ~5% further Android size reductionon top of garbage collection. Stay tuned.
Redundancy Elimination: Identical Code Folding (ICF)
31/435th Korea Android Conference
PerformanceOptimization
• Each Dalvik(by Dan Bornstein) Virtual Machine has its own heap
• Dalvik use dlmalloc API to manage its heapAllocate memory by mspace_callocRelease memory by mspace_free
Optimizing Memory Management
DalvikDalvik
Dalvik HeapDalvik Heap
lease object
mspace_free
new object
mspace_calloc
32/435th Korea Android Conference
PerformanceOptimization
• Various Headrooms for Memory Management Optimizations. Some of them have the same size
Various Headrooms forMemory Management Optimizations
Size Count Ratio
24 16,435 34.40%
20 5,464 11.40%
36 4,474 09.40%
. . . . . . . . .
High ratio objectSizes in
WebViewBench
ObjectsAllocation log inWebViewBench
. . .
[Ljava/util/HashMap$Entry;:24
Ljava/util/HashMap$Entry;:24
Landroid/webkit/PerfChecker;:16
Landroid/webkit/LoadListener;:156
Landroid/webkit/ByteArrayBuilder;:20
Ljava/util/LinkedList;:20
Ljava/util/LinkedList$Link;:20
Ljava/util/LinkedList;:20
Ljava/util/LinkedList$Link;:20
Ljava/lang/String;:24
Ljava/lang/String;:24
Landroid/webkit/FrameLoader;:48
Ljava/lang/String;:24
. . .
33/435th Korea Android Conference
PerformanceOptimization
• The size ratio between allocation and release is almost same
Observation of WebView Bench
34/435th Korea Android Conference
PerformanceOptimization
• This integer benchmark solves positions in the game of connect-4, as played on a vertical 7x6 board.
• Ratio of Size = 44 is extremely high in this case• http://homepages.cwi.nl/~tromp/c4/Fhourstones.tar.gz
Observation of Fhourstones (FreeBSD benchmarks)
35/435th Korea Android Conference
PerformanceOptimization
• Optimization: Add a buffer cache of memory chunks
Many Objects Alloc/Released in Short Time
DalvikDalvik
Dalvik HeapDalvik Heap
Buffer Cache
Buffer Cache
Memory Chunk(size = 24)
Memory Chunk(size = 24)
Release a String Object.(size = 24)
Release a String Object.(size = 24)
Buffer Cache: Release
36/435th Korea Android Conference
PerformanceOptimizationBuffer Cache: Allocate
DalvikDalvik
Dalvik HeapDalvik Heap
Buffer Cache
Buffer Cache
Memory Chunk(size = 24)
Memory Chunk(size = 24)
I need String Object.(size = 24)
I need String Object.(size = 24)
Do you have memory chunk which
size = 24 ?
Do you have memory chunk which
size = 24 ?
Memory Chunk(size=24)
Memory Chunk(size=24)
37/435th Korea Android Conference
PerformanceOptimization
• Release Performance Improvement in Fhourstones
Experimental Result 1/2
• Allocation Performance Improvement in Fhourstones
Source: googleBuffer cache
slotsBuffer cache
slots
No Pool 16,384 65,536 No Pool 16,384 65,536
38/435th Korea Android Conference
PerformanceOptimization
• Release Performance Improvement in WebViewBench
Experimental Result 2/2• Allocation Performance
Improvement in WebViewBench
Source: googleBuffer cache
slots
No Pool 16,384 65,536
Buffer cache slots
No Pool 16,384 65,536
39/435th Korea Android Conference
PerformanceOptimization
1. Toolchain: Regularly evaluate and leverage E.g., leverage the newest lightweight IPO and ICF
2. There is no substitute for thoughtful abstraction & Specifications
3. Systematic parameter setting: A key driver to performance
4. Data-driven: Profile it
5. Optimizing memory time for Android/Dalvik is important.
Summary
• Systematic Optimizations
40/435th Korea Android Conference
PerformanceOptimization
THANKS
41/435th Korea Android Conference
PerformanceOptimization
Quiz#1) Throughput according to /init daemon
./android-2.1/system/core/sh/init.c
(minimal bootable environment )
1) Static build 하여 만든 init 을 실행하면 ,
2) Shared(Dynamic) build 하여 만든 init 을 실행하면
3) Shared Build 한 후 Pre-link 기술 적용 후 init 을 실행하면
4) Toolbox 소스 사이즈가 작을 때는 Static build 를 , 소스가 클 때는 Shared build 를 하여 init 을 실행하면
• 안드로이드 플랫폼에서 프로세스들의 조상인 /init 실행 파일의 경우
Power On 시 , 가장 이상적으로 QuickBoot 를 할 수 있다 .
42/435th Korea Android Conference
PerformanceOptimization
Quiz#2) License Issue of C++ standard lib
• Android Platform 의 rootfs 에 사용되는 C++ 표준 라이브러리(/system/lib/libstdc++.so) 는 GPL 라이센스입니다 . 그렇다면 ,
이 라이브러리내의 함수들을 링크하여 동작하는 Userspace 의 코드
( 예 : *.apk) 들은 고객이 요청시 소스가 모두 공개되어야 할까요 ?
1) 당연하다 . 고객이 요청한다면 해당 상용 애플리케이션은 공개 해야 한다 .
2) 공식적으로 안드로이드는 Apache License 이므로 , 공개 하지 않아도 된다 .
3) C++ 표준 Lib 가 GPL 이라 하더라도 , 예외 조항 전문을 제품매뉴얼에 표기하여 애플리케이션의 소스를 고객에게 공개를 하지 안해도 된다 .
4) 애플리케이션 구매자에게는 공개해야 할 의무가 있고 , 비구매자의 요청에 대해서는 공개하지 않아도 된다 .
5) 애플리케이션 판매자가 재빨리 전화번호 변경 후 , 잠시 隱遁하면 되는 일이다 .
43/435th Korea Android Conference
PerformanceOptimization
Quiz#3) How to get free memory maximumly
• 아래 그림에서 사용 가능한 RAM 용량을 봅시다 . Before(54MB) 이고 , After(93MB) 입니다 . 대략 2 배 정도의 차이를 보이고 있습니다 . 그 원인을 무엇일까요 ?
AfterBefore