Upload
randi
View
61
Download
0
Embed Size (px)
DESCRIPTION
Dataflow Execution of Sequential Imperative Programs on Multicore Architectures 在 多核心架構上以序列式命令程式的資料流執行. 陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C. Introduction. 提出一個嶄新的執行模型 (execution model ) - PowerPoint PPT Presentation
Citation preview
NCKU
Chen, Pin Chieh SoC & ASIC Lab 1
VLSI CADDataflow Execution of Sequential Imperative
Programs on Multicore Architectures在多核心架構上以序列式命令程式的資料流執行陳品杰
Department of Electrical EngineeringNational Cheng Kung University
Tainan, Taiwan, R.O.C
NCKU
Chen, Pin Chieh SoC & ASIC Lab 2
Introduction提出一個嶄新的執行模型 (execution model)
使靜態地序列程式 (statically-sequential programs) 能平行執行該模型之程式容易開發 , 且具平行性藉由一般的命令式程式語言 (ex. C++),
則可達成靜態循序程式之資料流 (dataflow) 平行執行在多個處理核心以資料流之方式動態地平行化執行序列程式
NCKU
Chen, Pin Chieh SoC & ASIC Lab 3
Dataflow Execution of Sequential Imperative Programs (1/6)
以資料流 (dataflow) 模型執行資料驅動 (data-driven) 執行的程式相較於控制流 (control flow) 是序列地執行指令
取而代之的是當指令的運算元可使用時 , 則執行該指令資料相依的指令會自動地依序執行 , 而獨立的指令將可平行執行為了能成功地在多核心環境採用相似的模型
列出一些已知和新的挑戰 實際的程式設計典範 相依性 資源管理 多核心環境下的應用原則
NCKU
Chen, Pin Chieh SoC & ASIC Lab 4
Dataflow Execution of Sequential Imperative Programs (2/6)
Dataflow on Multicores
傳統之資料流機器 (dataflow machines) 是指令層級平行 (ILP) 在此將粒度提升至 functions, 此已被用於 task-level 計算 可促進程式碼之重用
更適合於多核心之規模 藉由資料流之方式將 function 執行於各個 core 上
達成 Function-Level Parallelism(FLP)
NCKU
Chen, Pin Chieh SoC & ASIC Lab 5
Dataflow Execution of Sequential Imperative Programs (3/6)
Dataflow on Multicores在執行時 , 每次依序執行循序程式之一個 function
( 而不是一個指令 )在執行 function 前須先確認運算元將運算元由個別之暫存器或記憶體位址擴展至 object一個 function 之輸入運算元 : read set
其輸出運算元為 : write set 統稱為 : data set
C++ STL
NCKU
Chen, Pin Chieh SoC & ASIC Lab 6
Dataflow Execution of Sequential Imperative Programs (4/6)
Dataflow on Multicores每個 data set 裡的 object 都有一個身分
用於建立各個 function 間之資料相依判定是否目前要被執行之 function 與正在執行之任何 function(s)是否相依
如果沒相依 , submitted (or delegated) 至某一核心來執行 如果相依 , 將該 function shelved 直至沒有相依 上述兩種情況 , 皆會繼續執行下一個 function( 循序程式 )
NCKU
Chen, Pin Chieh SoC & ASIC Lab 7
Dataflow Execution of Sequential Imperative Programs (5/6)
Handling Dependences使用 token 來處理資料流機制的相依性 採用類似一般資料流機制的技巧 , 但有兩個關鍵的改進
一 傳統的是 token 與個別記憶體位址關聯取而代之的是把 token 與 objects 關聯 , 來達到資料抽象
二 每一個 object 配置多個 read tokens 只有一個 write token 藉此管理資料之建立與使用
NCKU
Chen, Pin Chieh SoC & ASIC Lab 8
Dataflow Execution of Sequential Imperative Programs (6/6)
Handling Dependences當 function 以 dataflow execution 執行時須
Function 會要求 read(write) tokens 給 function 裡 object 的 read(write) set
當 function 取得所需之 tokens 後 , 則已準備好且可執行一旦 function 執行完畢 , 將會放棄所有 tokens 給 shelved 且需要該
token 的 function當 shelved function 取得所需 tokens 後 , 將 unshelved 且執行此 Model 亦可以循序方式執行某種 function
即當在該 function 程式順序前的運算完成後才會執行該 function 且後續運算亦須等該 function 完成後才可執行
NCKU
Chen, Pin Chieh SoC & ASIC Lab 9
Dataflow Execution of Sequential Imperative Programs (1/7)
Model Overview - Example
Figure 1. (a) Example pseudocode that invokes functions T and T’. T: {write set} {read set} modifies (reads) objects in its write set(read set). Data set of T’ is unknown.
Figure 1. (b) Dynamic invocations of the functions T and T’ ,in the program order, and the data set of each invocation.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 10
Dataflow Execution of Sequential Imperative Programs (2/7)
Model Overview – Data dependence between the functions
WARRAWWAW
Time
(c) Dataflow graph of the dynamic function stream.
T1 T2
T3
T4
T5 T6
A
D
BD
B
A
NCKU
Chen, Pin Chieh SoC & ASIC Lab 11
Dataflow Execution of Sequential Imperative Programs (3/7)
Model Overview – Execution of the code as per model – t1
Time
Figure 1. (d) Dataflow execution schedule of the function stream
T1
T2
T3
T4 T5
T6
A
D
B
D
B AP1
P1
P1
T2
T6
T’
Barrier
t1 t2 t3 t4
T1 成功取得 read token 給 object A, and write tokens 給 B & C > submitted for execution
T2 成功取得 read token 給 object A, and write token 給 D > submitted for execution
T6 成功取得 read token 給 object H, and write token 給 G > submitted for execution
NCKU
Chen, Pin Chieh SoC & ASIC Lab 12
Dataflow Execution of Sequential Imperative Programs (4/7)
Model Overview – Execution of the code as per model – t2
Time
T1
T2
T3
T4 T5
T6
A
D
B
D
B AP1
P1
P1
T2
T6
T’
Barrier
t1 t2 t3 t4
T1 完成執行> 釋出 write token B & C> T4 取得 write token B 但缺 read token D
NCKU
Chen, Pin Chieh SoC & ASIC Lab 13
Dataflow Execution of Sequential Imperative Programs (5/7)
Model Overview – Execution of the code as per model – t3
Time
T1
T2
T3
T4 T5
T6
A
D
B
D
B AP1
P1
P1
T2
T6
T’
Barrier
t1 t2 t3 t4
T2 執行完畢> 釋出 write token D, and read token A> T3 取得 write token A, 開始運行> T4 取得 read token D, 開始運行
NCKU
Chen, Pin Chieh SoC & ASIC Lab 14
Dataflow Execution of Sequential Imperative Programs (6/7)
Model Overview – Execution of the code as per model – t4
Time
T1
T2
T3
T4 T5
T6
A
D
B
D
B AP1
P1
P1
T2
T6
T’
Barrier
t1 t2 t3 t4
T4 執行完成> 釋出 write token B, and read token D> T5 取得 write token B, start execution
NCKU
Chen, Pin Chieh SoC & ASIC Lab 15
Dataflow Execution of Sequential Imperative Programs (7/7)
Model Overview – Execution of the code as per model – after t4
Time
T1
T2
T3
T4 T5
T6
A
D
B
D
B AP1
P1
P1
T2
T6
T’
Barrier
t1 t2 t3 t4
T' will be submitted for execution after all previous functions complete.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 16
Dataflow Execution of Sequential Imperative Programs (1/1)
Deadlock Avoidance如果二或多個 functions 建立環狀的 tokens 相依 , token 機制在一般的 dataflow model 將會有 deadlock 發生例如 , 調用 T4 與 T5 可能建立一個要求順序
T4: 取得 B → T5: 等待 B → T5: 取得 D → T4: 等待 D > 導致deadlock
避免 token deadlocks 須確保 : (i) 某 token 一次僅能被一個 function 要求 (ii) 依照 function 要求的順序將 tokens 給予 object( 先要先給 )
因此 T5 的 token 僅能在 T4 之後索取
WARRAWWAW
T1 T2
T4
T5D
BD
B
D
B
NCKU
Chen, Pin Chieh SoC & ASIC Lab 17
Prototype Implementation (1/4)以 C++ runtime library 開發執行模型之軟體雛形Static Sequential Program
為了完整表示命令式語言 , 此模型允許程式在資料流與序列執行間切換目前模型之程式是以 C++ 編寫 , 如同傳統的序列程式特別地 , 使用者須知道
functions 間潛在地平行 Objects 在 functions 間之共享 Read set 與 write set
NCKU
Chen, Pin Chieh SoC & ASIC Lab 18
Prototype Implementation (2/4)Static Sequential Program
Dataflow Functions Library 提供 df_execute 介面可供程式中之 funciton 平行執行
df_execute 是以 C++ templates 實作之 runtime function 非 df_execute 之指令將以特定的順序執行
Shared data, in the form of global, passed-by-reference objects or pointers
to them, that are accessed by a function are passed to it as arguments.
Users group them into two sets, one that may be modified (write set) and another that is only read (read set).
The C++ STL-based set data structure of the token base class is used to create them.
Figure 2. Example program in the proposed model.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 19
Prototype Implementation (3/4)Static Sequential Program
Serial Segments/Functions 使用者可透過 df_end interface 來返回序列執行 df_end interface 相似於 barrier, 使程式之執行脫離 dataflow execution 在 df_end interface 前的程式執行完後 , 往後的程式將以序列執行
T’ in our example.
Figure 2. Example program in the proposed model.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 20
Prototype Implementation (4/4)Static Sequential Program
Serial Segments/Functions 為使共享之 object 在主程式裡依順序執行 , 提供 df_seq interface
df_seq accepts the object instance, the function (object method) pointer and any arguments to it.
df_seq 需等先前有使用該 object 者執行完畢 , 才可執行 , 且後面之程式須等 df_seq 執行完才可繼續執行 df_seq causes the runtime to suspend the main program context
until the associated function finishes operating on the specified object.
Execution will proceed from line 6 only after print finishes, but potentially in parallel with other (prior) functions
(that are not accessing G).
Figure 2. Example program in the proposed model.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 21
Runtime Mechanics (1/10)採用多執行緒來實作該機制 , 其平行執行是以 Pthread API 實現
將執行緒管理抽象化 , 讓使用者不會直接接觸使用者不須了解機制之底層架構
Executing Function on Processing CoresAt the start of a program, the runtime creates threads, usually one per
hardware context available to it. A double-ended work queue (deque) is then assigned to each thread
in the system. Computations are scheduled for execution by a thread by queuing
them in the corresponding work deque.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 22
Runtime Mechanics (2/10)Discovering Functions for Parallel Execution
剛開始執行時 , 僅一個 processor 在等待工作抵達它的 deque, 其他 processors 皆為閒置狀態執行初期相似於序列執行當遇到 df_execute, runtime 將被啟動
NCKU
Chen, Pin Chieh SoC & ASIC Lab 23
Runtime Mechanics (3/10)Discovering Functions for Parallel Execution
The runtime processes a dataflow function in three decoupled phases,
prelude execute postlude
Figure 3. Logical view of runtime operations to process a dataflow function.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 24
Runtime Mechanics (4/10)Discovering Functions for Parallel Execution
In the prelude phase Dereferences pointers to objects in the read/write sets, if need be,
and attempts to acquire the tokensExecute phase
Successful acquisition of tokens leads to the execute phase (Figure 3: 2), in which the function is delegated for (potentially parallel) execution
Specifically, the runtime pushes the program continuation (remainder of the program past the df_execute call) onto the thread's work deque, and executes the function on the same thread.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 25
Runtime Mechanics (5/10)Discovering Functions for Parallel Execution
A task-stealing scheduler Running on each hardware context,
will cause an idle processor to steal the program continuation and continue its execution, until it encounters the next df_execute, repeating the process of delegation and pushing of the program continuation onto its work deque.
Thus the execution of the program unravels in parallel with executing functions, and possibly on different hardware contexts rather than on one hardware context.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 26
Runtime Mechanics (6/10)Tokens and Dependency Tracking
在程式執行期間 ,
被配置的 object 會有 一個 write token 無限多的 read tokens
(limited only by the number of bits used to represent tokens), 一個 wait list
Tokens are acquired for objects that the dataflow functions operate on
Released when the functions complete
NCKU
Chen, Pin Chieh SoC & ASIC Lab 27
Runtime Mechanics (7/10)Tokens and Dependency Tracking
A token may be granted only if it is available.
Figure 4a gives the definition of availability of read and write tokens,
and Figure 4b shows the token acquisition protocol.
The wait list is used to track functions to which the token could not be granted at the time of their requests.
A non-empty wait list signifies pending requests, in the enlisted order.
An available token is not granted if an earlier function enqueued in the wait list is waiting to acquire it (Figure 4b: 1).
NCKU
Chen, Pin Chieh SoC & ASIC Lab 28
Runtime Mechanics (8/10)Tokens and Dependency Tracking
Figure 4. The token protocol: (a) Definition of availability
Figure 4. The token protocol: (b) Read/Write token acquisition
NCKU
Chen, Pin Chieh SoC & ASIC Lab 29
Runtime Mechanics (9/10)Shelving Functions/Program Continuations
如果一個 function 的 tokens無法取得的話 function is enqueued in the wait lists of all the objects for
which tokens could not be acquired (Figure 4b: 4 or 5), and subsequently shelved (Figure 3: 1.2).
While the shelved function waits for the dependences to resolve, the runtime looks for other independent work from the program
continuation to perform
Figure 4. The token protocol: (b) Read/Write token acquisition
Figure 3.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 30
Runtime Mechanics (10/10)Completion of Function Execution
Figure 4. The token protocol:(c) Token releaseFigure 3. Logical view of runtime
operations to process a dataflow function.
NCKU
Chen, Pin Chieh SoC & ASIC Lab 31
Example Execution (1/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 0
ER = 0
FR = 0
GR = 0
CPU0
CPU1
CPU2
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 32
Example Execution (2/15)
Figure 5. Example execution
AR = 1
BR = 0
CR = 0
DR = 0
ER = 0
FR = 0
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 33
Example Execution (3/15)
Figure 5. Example execution
AR = 2
BR = 0
CR = 0
DR = 0
ER = 0
FR = 0
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1 W: T2
T2
steal execution
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 34
Example Execution (4/15)
Figure 5. Example execution
AR = 2
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1 W: T2
T2
W: T3
T3
steal execution
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 35
Example Execution (5/15)
Figure 5. Example execution
AR = 2
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1 W: T2
T2
W: T3
T3T4 T4
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 36
Example Execution (6/15)
Figure 5. Example execution
AR = 2
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1 W: T2
T2
W: T3
T3T4 T4T5 T5
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 37
Example Execution (7/15)
Figure 5. Example execution
AR = 2
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
T1
W: T1 W: T1 W: T2
T2
W: T3
T3T4 T4T5 T5
T6
W: T6
HR = 1
NCKU
Chen, Pin Chieh SoC & ASIC Lab 38
Example Execution (8/15)
Figure 5. Example execution
AR = 1
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4 W: T2
T2
W: T3
T3T4T5 T5
T6
W: T6
HR = 1
Can’t execute
NCKU
Chen, Pin Chieh SoC & ASIC Lab 39
Example Execution (9/15)
Figure 5. Example execution
AR = 1
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4 W: T2
T2
W: T3
T3T4T5 T5
T6
W: T6
HR = 1
df_seq causes the runtime to shelve the program continuation beyond df_seq in G’s wait list and await completion of all functions accessing G
de_seq
NCKU
Chen, Pin Chieh SoC & ASIC Lab 40
Example Execution (10/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 0
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4
T4
W: T3
T3T5 T5
HR = 0
NCKU
Chen, Pin Chieh SoC & ASIC Lab 41
Example Execution (11/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 2
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4 W: T3
T5
HR = 0
W: T3
T3
NCKU
Chen, Pin Chieh SoC & ASIC Lab 42
Example Execution (12/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 2
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4 W: T3
T5
HR = 0
W: T3
T3
T4
T5 will be scheduled for execution once T4 completes
NCKU
Chen, Pin Chieh SoC & ASIC Lab 43
Example Execution (13/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 2
ER = 0
FR = 1
GR = 0
CPU0
CPU1
CPU2
W: T4 W: T3
T5
HR = 0
W: T3
T3
T4
G.print()
After print completes, the runtime schedules the program continuation for execution
NCKU
Chen, Pin Chieh SoC & ASIC Lab 44
Example Execution (14/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 1
ER = 0
FR = 0
GR = 0
CPU0
CPU1
CPU2
W: T5
HR = 0
The continuation is shelved again, thus preventing further processing of the program,until all in-flight functions finish
T5
NCKU
Chen, Pin Chieh SoC & ASIC Lab 45
Example Execution (15/15)
Figure 5. Example execution
AR = 0
BR = 0
CR = 0
DR = 0
ER = 0
FR = 0
GR = 0
CPU0
CPU1
CPU2
HR = 0
T’
NCKU
Chen, Pin Chieh SoC & ASIC Lab 46
ConclusionPresented a novel execution model
that achieves function-level parallel execution of statically-sequential imperative programs on multicore processors.
Parallel tasks (program functions) are dynamically extracted from a sequential program
and executed in a dataflow fashion on multiple processing cores using tokens associated with shared data objects,
and employing a token protocol to manage the dependences between tasks.
Thus combine the benefits of sequential programming and dataflow execution.