陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

NCKU

Chen, Pin Chieh SoC & ASIC Lab 1

VLSI CADDataflow Execution of Sequential Imperative

Programs on Multicore Architectures在多核心架構上以序列式命令程式的資料流執行陳品杰

Department of Electrical EngineeringNational Cheng Kung University

Tainan, Taiwan, R.O.C

NCKU


Introduction提出一個嶄新的執行模型 (execution model)

使靜態地序列程式 (statically-sequential programs) 能平行執行該模型之程式容易開發 , 且具平行性藉由一般的命令式程式語言 (ex. C++),

則可達成靜態循序程式之資料流 (dataflow) 平行執行在多個處理核心以資料流之方式動態地平行化執行序列程式

NCKU


Dataflow Execution of Sequential Imperative Programs (1/6)

以資料流 (dataflow) 模型執行資料驅動 (data-driven) 執行的程式相較於控制流 (control flow) 是序列地執行指令

取而代之的是當指令的運算元可使用時 , 則執行該指令資料相依的指令會自動地依序執行 , 而獨立的指令將可平行執行為了能成功地在多核心環境採用相似的模型

列出一些已知和新的挑戰實際的程式設計典範相依性資源管理多核心環境下的應用原則

NCKU



Dataflow on Multicores

傳統之資料流機器 (dataflow machines) 是指令層級平行 (ILP) 在此將粒度提升至 functions, 此已被用於 task-level 計算可促進程式碼之重用

更適合於多核心之規模藉由資料流之方式將 function 執行於各個 core 上

達成 Function-Level Parallelism(FLP)

NCKU



Dataflow on Multicores在執行時 , 每次依序執行循序程式之一個 function

( 而不是一個指令 )在執行 function 前須先確認運算元將運算元由個別之暫存器或記憶體位址擴展至 object一個 function 之輸入運算元 : read set

其輸出運算元為 : write set 統稱為 : data set

C++ STL

http://www.yolinux.com/TUTORIALS/LinuxTutorialC++STL.html

NCKU



Dataflow on Multicores每個 data set 裡的 object 都有一個身分

用於建立各個 function 間之資料相依判定是否目前要被執行之 function 與正在執行之任何 function(s)是否相依

如果沒相依 , submitted (or delegated) 至某一核心來執行如果相依 , 將該 function shelved 直至沒有相依上述兩種情況 , 皆會繼續執行下一個 function( 循序程式 )

NCKU



Handling Dependences使用 token 來處理資料流機制的相依性採用類似一般資料流機制的技巧 , 但有兩個關鍵的改進

一傳統的是 token 與個別記憶體位址關聯取而代之的是把 token 與 objects 關聯 , 來達到資料抽象

二每一個 object 配置多個 read tokens 只有一個 write token 藉此管理資料之建立與使用

NCKU



Handling Dependences當 function 以 dataflow execution 執行時須

Function 會要求 read(write) tokens 給 function 裡 object 的 read(write) set

當 function 取得所需之 tokens 後 , 則已準備好且可執行一旦 function 執行完畢 , 將會放棄所有 tokens 給 shelved 且需要該

token 的 function當 shelved function 取得所需 tokens 後 , 將 unshelved 且執行此 Model 亦可以循序方式執行某種 function

即當在該 function 程式順序前的運算完成後才會執行該 function 且後續運算亦須等該 function 完成後才可執行

NCKU



Model Overview - Example

Figure 1. (a) Example pseudocode that invokes functions T and T’. T: {write set} {read set} modifies (reads) objects in its write set(read set). Data set of T’ is unknown.

Figure 1. (b) Dynamic invocations of the functions T and T’ ,in the program order, and the data set of each invocation.

NCKU



Model Overview – Data dependence between the functions

WARRAWWAW

Time

(c) Dataflow graph of the dynamic function stream.

T1 T2

T3

T4

T5 T6

A

D

BD

B

A

NCKU



Model Overview – Execution of the code as per model – t1

Time

Figure 1. (d) Dataflow execution schedule of the function stream

T1

T2

T3

T4 T5

T6

A

D

B

D

B AP1

P1

P1

T2

T6

T’

Barrier

t1 t2 t3 t4

T1 成功取得 read token 給 object A, and write tokens 給 B & C > submitted for execution

T2 成功取得 read token 給 object A, and write token 給 D > submitted for execution

T6 成功取得 read token 給 object H, and write token 給 G > submitted for execution

NCKU




Time

T1

T2

T3

T4 T5

T6

A

D

B

D

B AP1

P1

P1

T2

T6

T’

Barrier

t1 t2 t3 t4

T1 完成執行> 釋出 write token B & C> T4 取得 write token B 但缺 read token D

NCKU




Time

T1

T2

T3

T4 T5

T6

A

D

B

D

B AP1

P1

P1

T2

T6

T’

Barrier

t1 t2 t3 t4

T2 執行完畢> 釋出 write token D, and read token A> T3 取得 write token A, 開始運行> T4 取得 read token D, 開始運行

NCKU




Time

T1

T2

T3

T4 T5

T6

A

D

B

D

B AP1

P1

P1

T2

T6

T’

Barrier

t1 t2 t3 t4

T4 執行完成> 釋出 write token B, and read token D> T5 取得 write token B, start execution

NCKU



Model Overview – Execution of the code as per model – after t4

Time

T1

T2

T3

T4 T5

T6

A

D

B

D

B AP1

P1

P1

T2

T6

T’

Barrier

t1 t2 t3 t4

T' will be submitted for execution after all previous functions complete.

NCKU



Deadlock Avoidance如果二或多個 functions 建立環狀的 tokens 相依 , token 機制在一般的 dataflow model 將會有 deadlock 發生例如 , 調用 T4 與 T5 可能建立一個要求順序

T4: 取得 B → T5: 等待 B → T5: 取得 D → T4: 等待 D > 導致deadlock

避免 token deadlocks 須確保 : (i) 某 token 一次僅能被一個 function 要求 (ii) 依照 function 要求的順序將 tokens 給予 object( 先要先給 )

因此 T5 的 token 僅能在 T4 之後索取

WARRAWWAW

T1 T2

T4

T5D

BD

B

D

B

NCKU


Prototype Implementation (1/4)以 C++ runtime library 開發執行模型之軟體雛形Static Sequential Program

為了完整表示命令式語言 , 此模型允許程式在資料流與序列執行間切換目前模型之程式是以 C++ 編寫 , 如同傳統的序列程式特別地 , 使用者須知道

functions 間潛在地平行 Objects 在 functions 間之共享 Read set 與 write set

NCKU


Prototype Implementation (2/4)Static Sequential Program

Dataflow Functions Library 提供 df_execute 介面可供程式中之 funciton 平行執行

df_execute 是以 C++ templates 實作之 runtime function 非 df_execute 之指令將以特定的順序執行

Shared data, in the form of global, passed-by-reference objects or pointers

to them, that are accessed by a function are passed to it as arguments.

Users group them into two sets, one that may be modified (write set) and another that is only read (read set).

The C++ STL-based set data structure of the token base class is used to create them.

Figure 2. Example program in the proposed model.

NCKU



Serial Segments/Functions 使用者可透過 df_end interface 來返回序列執行 df_end interface 相似於 barrier, 使程式之執行脫離 dataflow execution 在 df_end interface 前的程式執行完後 , 往後的程式將以序列執行

T’ in our example.


NCKU



Serial Segments/Functions 為使共享之 object 在主程式裡依順序執行 , 提供 df_seq interface

df_seq accepts the object instance, the function (object method) pointer and any arguments to it.

df_seq 需等先前有使用該 object 者執行完畢 , 才可執行 , 且後面之程式須等 df_seq 執行完才可繼續執行 df_seq causes the runtime to suspend the main program context

until the associated function finishes operating on the specified object.

Execution will proceed from line 6 only after print finishes, but potentially in parallel with other (prior) functions

(that are not accessing G).


NCKU


Runtime Mechanics (1/10)採用多執行緒來實作該機制 , 其平行執行是以 Pthread API 實現

將執行緒管理抽象化 , 讓使用者不會直接接觸使用者不須了解機制之底層架構

Executing Function on Processing CoresAt the start of a program, the runtime creates threads, usually one per

hardware context available to it. A double-ended work queue (deque) is then assigned to each thread

in the system. Computations are scheduled for execution by a thread by queuing

them in the corresponding work deque.

NCKU


Runtime Mechanics (2/10)Discovering Functions for Parallel Execution

剛開始執行時 , 僅一個 processor 在等待工作抵達它的 deque, 其他 processors 皆為閒置狀態執行初期相似於序列執行當遇到 df_execute, runtime 將被啟動

NCKU



The runtime processes a dataflow function in three decoupled phases,

prelude execute postlude

Figure 3. Logical view of runtime operations to process a dataflow function.

NCKU



In the prelude phase Dereferences pointers to objects in the read/write sets, if need be,

and attempts to acquire the tokensExecute phase

Successful acquisition of tokens leads to the execute phase (Figure 3: 2), in which the function is delegated for (potentially parallel) execution

Specifically, the runtime pushes the program continuation (remainder of the program past the df_execute call) onto the thread's work deque, and executes the function on the same thread.

NCKU



A task-stealing scheduler Running on each hardware context,

will cause an idle processor to steal the program continuation and continue its execution, until it encounters the next df_execute, repeating the process of delegation and pushing of the program continuation onto its work deque.

Thus the execution of the program unravels in parallel with executing functions, and possibly on different hardware contexts rather than on one hardware context.

NCKU


Runtime Mechanics (6/10)Tokens and Dependency Tracking

在程式執行期間 ,

被配置的 object 會有一個 write token 無限多的 read tokens

(limited only by the number of bits used to represent tokens), 一個 wait list

Tokens are acquired for objects that the dataflow functions operate on

Released when the functions complete

NCKU



A token may be granted only if it is available.

Figure 4a gives the definition of availability of read and write tokens,

and Figure 4b shows the token acquisition protocol.

The wait list is used to track functions to which the token could not be granted at the time of their requests.

A non-empty wait list signifies pending requests, in the enlisted order.

An available token is not granted if an earlier function enqueued in the wait list is waiting to acquire it (Figure 4b: 1).

NCKU



Figure 4. The token protocol: (a) Definition of availability

Figure 4. The token protocol: (b) Read/Write token acquisition

NCKU


Runtime Mechanics (9/10)Shelving Functions/Program Continuations

如果一個 function 的 tokens無法取得的話 function is enqueued in the wait lists of all the objects for

which tokens could not be acquired (Figure 4b: 4 or 5), and subsequently shelved (Figure 3: 1.2).

While the shelved function waits for the dependences to resolve, the runtime looks for other independent work from the program

continuation to perform

Figure 4. The token protocol: (b) Read/Write token acquisition

Figure 3.

NCKU


Runtime Mechanics (10/10)Completion of Function Execution

Figure 4. The token protocol:(c) Token releaseFigure 3. Logical view of runtime

operations to process a dataflow function.

NCKU


Example Execution (1/15)

Figure 5. Example execution

AR = 0

BR = 0

CR = 0

DR = 0

ER = 0

FR = 0

GR = 0

CPU0

CPU1

CPU2

HR = 0

NCKU




AR = 1

BR = 0

CR = 0

DR = 0

ER = 0

FR = 0

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1

HR = 0

NCKU




AR = 2

BR = 0

CR = 0

DR = 0

ER = 0

FR = 0

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1 W: T2

T2

steal execution

HR = 0

NCKU




AR = 2

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1 W: T2

T2

W: T3

T3

steal execution

HR = 0

NCKU




AR = 2

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1 W: T2

T2

W: T3

T3T4 T4

HR = 0

NCKU




AR = 2

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1 W: T2

T2

W: T3

T3T4 T4T5 T5

HR = 0

NCKU




AR = 2

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

T1

W: T1 W: T1 W: T2

T2

W: T3

T3T4 T4T5 T5

T6

W: T6

HR = 1

NCKU




AR = 1

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4 W: T2

T2

W: T3

T3T4T5 T5

T6

W: T6

HR = 1

Can’t execute

NCKU




AR = 1

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4 W: T2

T2

W: T3

T3T4T5 T5

T6

W: T6

HR = 1

df_seq causes the runtime to shelve the program continuation beyond df_seq in G’s wait list and await completion of all functions accessing G

de_seq

NCKU




AR = 0

BR = 0

CR = 0

DR = 0

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4

T4

W: T3

T3T5 T5

HR = 0

NCKU




AR = 0

BR = 0

CR = 0

DR = 2

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4 W: T3

T5

HR = 0

W: T3

T3

NCKU




AR = 0

BR = 0

CR = 0

DR = 2

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4 W: T3

T5

HR = 0

W: T3

T3

T4

T5 will be scheduled for execution once T4 completes

NCKU




AR = 0

BR = 0

CR = 0

DR = 2

ER = 0

FR = 1

GR = 0

CPU0

CPU1

CPU2

W: T4 W: T3

T5

HR = 0

W: T3

T3

T4

G.print()

After print completes, the runtime schedules the program continuation for execution

NCKU




AR = 0

BR = 0

CR = 0

DR = 1

ER = 0

FR = 0

GR = 0

CPU0

CPU1

CPU2

W: T5

HR = 0

The continuation is shelved again, thus preventing further processing of the program,until all in-flight functions finish

T5

NCKU




AR = 0

BR = 0

CR = 0

DR = 0

ER = 0

FR = 0

GR = 0

CPU0

CPU1

CPU2

HR = 0

T’

NCKU


ConclusionPresented a novel execution model

that achieves function-level parallel execution of statically-sequential imperative programs on multicore processors.

Parallel tasks (program functions) are dynamically extracted from a sequential program

and executed in a dataflow fashion on multiple processing cores using tokens associated with shared data objects,

and employing a token protocol to manage the dependences between tasks.

Thus combine the benefits of sequential programming and dataflow execution.

Documents

陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C