Upload
joel-hoover
View
213
Download
0
Embed Size (px)
Citation preview
Copyright 2013, Toshiba Corporation.
DAC2013 Designer/User Track
Scalability Achievementby Low-Overhead, Transparent Threadson an Embedded Many-Core Processor
Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto
Center for Semiconductor Research and DevelopmentToshiba Corporation
2DAC2013
Background
• Requirements for embedded processors– Various types of processing
• Video Codecs (HEVC, H.264 , MPEG-2 , WMV , ...)• Face Detection/Recognition, Audio/Video playback, Mobile TV
– Wide range of required processing performance• Should deal with various types of products from mobile phone to
Tablets or more– Example: video decoding from QVGA 15fps to 1080p 60fps or
more
– Low cost and short time development that meets market requirement• Reuse existing software to reduce development cost
3DAC2013
Challenges
• What kind of hardware architecture to employ?– The number of cores should be easily increased/decreased
• How can we realize the scalable performance?– Parallelized application program that utilizes multiple cores
efficiently
• How can we realize the transparency?– Hiding the number of cores from application program
Multiple Core Architecture[xu2012low]
Our Proposed Scheduler
[xu2012low] A low power many-core SoC with two 32-core clusters connected by tree based NoC for multimedia applications, H. Xu, et al. VLSI Symposium 2012
4DAC2013
Our approach
A simple multiple core architecture + An application program independent of # of cores + An efficient parallel processing scheme Achieving Scalable performance
5DAC2013
Strategy to realize our approach• Strategy
– Developing an application independent of # of cores transparency
– Running the developed application on a multiple-core processor and achieving scalable performance proportional to # of cores scalable performance
• Scheme– Designed an efficient thread scheduler
• efficient management of threads may achievescalable performance
• the number of cores may be hidden
if a thread scheduler abstracts the cores
• Challenges– Minimizing overheads for execution– Hiding the number of cores from application program
6DAC2013
How to minimize overheads
• Defined unique properties for threads– A Thread never suspends to wait for
data• eliminate the overhead of thread
switching– A Thread becomes ready to run when
necessary data are all available• Managed a thread status using simple
counters– Simplify the dependency into
“the number of dependency“• this can be realized by simple
operations
7DAC2013
How to hide the number of cores
• Designed a distributed scheduler with a shared queue– ONLY ready threads are placed in a shared queue– A Thread dispatcher runs on each core– The dispatcher fetches a thread from the shared queue and executes
it
• To reduce access conflict for a shared queue• We use CAS (Compare And Swap) instruction
Core
sear
chThreadThreadThread Thread
fetch & executeCore
ThreadThreadThread
Thread
fetch & execute
Core
Thread
sear
ch
fetch & execute
ThreadDispatcher
ThreadDispatcher
ThreadDispatcher
8DAC2013
Implemented thread scheduler
• Our Thread Scheduler consists of three components– Dependency Controller, Thread Pool, and Thread Dispatcher
• Our Thread Scheduler ...– is low overhead for Scalable Performance– hides the number of cores from application for Transparency
DependencyController
Thread Pool
ThreadDispatcher
Core
Core
Thread Scheduler
ThreadDispatcher
core
Appl. reg
iste
r
Core
ThreadDispatcher
1 0 Thread Thread
3 1・・・・
Thread
ThreadThreadThreadThread
Thread
Thread
Thread
avai
labl
e
nece
ssar
y
fetc
h&
exe
cute
read
y
9
• Design goals for a many-core processor – Achieve scalable performance– Reuse existing software for a multi-core processor
• a many-core processor has to execute existing software efficiently • knowledge of the software is absolutely necessary
Software engineers and Hardware engineers collaborated closely to design a many-core processor
• Design cycles – use “Plan – Evaluate – Analyze – Improve” cycle– existing software is used through out evaluation – At 1st cycle,: detect issues of existing architecture– At 2nd cycle, improve and optimize
• Main design features from our development cycle– CAS instruction, multi-bank L2 cache, tree-based network on chip,
Designing a many-core processor
DAC2013
Plan
Evaluate usingSimulation
Analyze
Improve
10
• Used SAME application binary even if the number of cores is changed
These results confirms proposed thread scheduler achieves scalable performance with transparency!
Evaluation results
DAC2013
H.264 Decoding 1080p Super resolution (full HD to 4K2K)
ScalablePerformance
ScalablePerformance
Lack of READY threads# of ready threads < # of MPEs
11
Conclusions• Proposed a low-overhead thread scheduler
– It achieves scalable performance and transparency
– Reduces thread execution overheads
• defined unique properties for a thread
– A thread never suspends
– A thread becomes ready when all necessary data are available
• managed thread status by the number of dependencies
– Hides the number of core
• designed a distributed scheduler with a shared queue
• Confirmed performance scalability and transparency– Evaluated on a real 32-core many-core processor
– A scalable performance is achieved without modification of the application program
DAC2013
Our scheduler contributesto the reduction of the software development cost