[F6]sooin lang vlab

[email protected]

파이어 [email protected]

2.1�� Execution�� management��

2.2�� asynchronous�� event�� queue��

2.3�� Garbage�� collection�� and�� memory�� management��

2.4�� List,�� Hash,�� and�� Dictionary�� –added�� bonus��

3.1�� 현재�� Status��

3.2�� 앞으로의�� 계획��

Language�� specific�� layer��

Parser�� Scanner��

Middle-end�� layer��

Control�� Flow�� Analysis�� Data�� Flow�� Analysis�� Constant�� folding�� …⋯Back-end�� layer��

GPU��

Execution�� layer��

Translation��

Interpretation��

JIT�� Compiler��

Run-time�� Optimizer��

Symbol�� table��

Event�� loop��

Garbage�� collector��

Reference�� graph�� serializer��

Mark�� phase�� Sweep�� phase��

•  최초�� IDL로�� 구현:�� +24시간�� 소요��

•  IDL�� +�� C�� library�� (Single�� CPU):�� ~4시간�� 소요��

•  Pthread?�� • MPI?�� •  보다�� 싸고�� 빠른�� 방법?��

Deploy��

쉬움�� 쉬움��

현대적인�� 문법��

GPU�� 활용�� 빠름�� 빠름��

• Everything�� on�� GPU�� • VM�� self-contained�� in�� GPU��

Standalone�� model��

• CPU�� delegates�� every�� arithmetic�� operation�� on�� GPU��

• Variables�� persist�� in�� GPU�� RAM��

Cooperative�� model��

• Some�� highly�� mathematical�� computations�� are�� done�� on�� GPU��

• Variables�� are�� ping-ponged�� to�� GPU�� on�� demand��

Plugin�� model��

•  GPU�� 내에서�� 새로운�� Execution�� kernel�� 구동�� 불가능��   VM에서�� Dispatch가�� 불가능��

•  Decode는�� branch가�� 많은데,�� GPU에서는�� branch가�� 속도에�� 치명적인�� 영향을�� 줌��

•  컴파일과�� JIT이�� 필수�� •  실행�� 흐름�� 제어는�� CPU에서�� 담당�� •  CPU의�� Event�� loop에서�� Block별�� Kernel

을�� 실행하여�� non-blocking�� parallelism�� 구현��


Execution�� management��

Memory�� management��

I/O��

•  GPU내에서�� Symbol�� table�� 구현�� X��   Local과�� Global�� 메모리�� 간의�� Sync�� X��

•  GPU�� 내에서�� 메모리�� 동적�� 할당�� X��   Global�� 메모리�� malloc�� X��   큰�� 메모리를�� 선점해서�� 소프트웨어로��

malloc을�� 구현하려�� 했으나�� Main�� board에서�� Partial�� read/write�� X��

•  Symbol�� table은�� CPU에서�� 관리�� •  변수의�� Info는�� Main으로,�� Content는�� GPU로

(Redundancy�� option�� 하에서는�� 양쪽의�� Content�� copy�� 유지)��

•  Local�� 메모리에서�� malloc�� 구현�� •  Garbage�� collection의�� Mark�� 단계를�� GPU�� 가

속��




I/O��

•  소프트웨어�� DMA(Direct�� Memory�� Access)�� 구현�� X��   Kernel�� 실행�� 도중에�� Global�� memor

y�� 읽기�� X�� •  소프트웨어�� 인터럽트�� 구현�� X��

  Kernel의�� 문맥을�� 저장하고�� 실행�� 중간에�� Trap을�� 이용하여�� 탈출하려고�� 했으나�� Trap을�� 이용해�� 빠져나올�� 경우�� Global�� memory가�� Corrupt��

•  I/O�� 명령이�� 포함된�� Kernel은�� I/O�� 연산이�� 필요한�� 시점을�� 기준으로�� 코드를�� 분할하여�� 올림��




I/O��

Project의�� focus는�� 현대적�� 언어의�� 고속�� Execution�� environment를�� 구축하는�� 것��

초기�� Proof�� of�� concept�� 용으로�� 대략�� 현대적이며�� Technical�� computing에�� 특화된�� 기존�� 언어�� 선정이�� 필요��

JULIA�� 언어를�� 선택하여�� Subset을�� 구현�� ->�� Sooin�� Lang��

Original�� code��

Basic�� block�� 1��



Reduced

�� block�� 1��

Reduced��

block�� 2��

Reduced��

block�� 3��

Out

�� 1��

Out

�� 2��

Out��

3��

…⋯��

…⋯��

…⋯��

JIT�� and�� Dependency�� analy

sis��

Promise�� delay�� chain��

Reduction�� by�� partial�� evaluation��

I/O��

Event�� Queue�� Event�� Loop��

Thread�� Pool��

File�� I/O��

Network�� I/O��

Other��

GPU�� Dispatch��

GPU��

Main��

점거��

비움�� 점거��

비움�� 점거��

점거��

I/O�� Request�� Memory�� full��

Write�� request�� Garbage�� collection��

Read/Write�� request��

Read/Write�� age�� calibration��

Script�� language가�� Script�� language다우려면�� …⋯��

GPU�� Global�� memory의�� Addressable�� space가�� 없는�� 듯�� …⋯�� malloc이�� 없음��

Data�� packing(Flattening)을�� GPU로�� 이식하여�� 사용하기로�� 함��

•  발표한�� 모듈들은�� 구현한�� 상태��

•  ~20,000�� Lines�� of�� codes,�� more�� to�� go��

•  초기�� 테스트에서�� Python(CPU�� only)와�� 비교�� 시�� ~x200�� Speed�� boost��

•  하지만,�� 현재�� 때때로�� 작동하는�� 상태��

•  3명의�� 제작진이�� Windows,�� Mac,�� Linux�� 기계로�� 작업을�� 진행하여�� 어쩔�� 수�� 없이�� Portability가�� 확보된�� 상황��

•  Alpha�� 단계�� 이후�� Open�� source로�� 배포�� 할�� 예정��

•  Debugging,�� debuging�� …⋯��

•  More�� codes�� …⋯��

•  More�� test�� cases�� …⋯��

•  More�� performance�� tests�� …⋯��

IR�� ->�� OpenCL�� 레벨에서�� 병렬화�� 가속��

GPU�� 상에서�� List,�� Hash,�� and�� Dictionary의�� on-site�� write�� 구현��

Concurrent�� Garbage�� Collection�� 구현��

Code�� refactoring,�� 구조�� 정교화��

Block�� 1��

Block�� 2��

Block�� 3��

Block�� 4�� Runtime�� Optimizer��

적시�� 컴파일��

GPU에서�� 실행��

Partial�� Evaluation��

.��

.��

.��

적시�� 컴파일��

PUSHI2 1, 2 ADD STORE result

Opcode�� OpenCL�� C�� code��

PUSHI2�� O1,�� O2�� stack[sp++]�� =�� O1;�� stack[sp++]�� =�� O2;��

ADD�� sp�� -=�� 2;�� stack[sp++]�� =�� stack[sp+1]�� +�� stack[sp];��

STORE�� O1�� O1�� =�� stack[--sp];��

Compiled�� Block�� 1��

__kernel void block1 (__global struct var *result);

OpenCL�� C�� Translation��

.��

.��

.��

Promise�� f�� =��

Chain��

Delay��

Promise�� 2�� *�� +�� Promise��

x1�� x2��

f(x1,�� x2)�� =�� 2�� *�� x1�� +�� x2��

10�� Fulfilled��

f*(x2)�� =�� 20�� +�� x2��

Residualize��

다차원�� 데이터�� Msgpack-CL�� 1차원�� 데이터��

{'a': 1, 'b': 2}" \x82\xa1a\x01\xa1b\x02"

Free�� bitmap에서�� 찾기�� malloc(sizeof(char)*5)�� 할당된�� 메모리의�� 오프셋과�� 길이를�� 가지는�� 테이블��

메모리�� 할당��

0�� 1�� 1�� 0�� 0�� 0�� 0�� 0�� 0�� 1�� 1�� 1�� 0��

객체��

marked?�� #�� of�� refs��

참조1��

참조�� 2��

참조�� n��

.��

.��

.��

input�� queue��

output�� queue��

헤더��

Mark�� phase�� kernel�� •  해당하는�� 객체의�� mark여부와�� 참조�� 수를�� 가져오고��

local�� memory에�� 써요.�� •  output�� queue에�� 쓸�� offset을�� prefix�� sum을�� 이용해

계산해요.�� •  객체가�� 참조하는�� 객체들을�� output�� queue에�� 써요.��

�� local�� memory��

nrefs��

prefix_sum��

2�� 0�� 1�� 0�� 0�� 3�� 4�� 1��

0�� 2�� 2�� 3�� 3�� 3�� 6�� 9�� offset��

Mark된�� 객체들��

Technology

[F6]sooin lang vlab