研究者のための Python による FPGA 入門

研究者のためのPython による FPGA 入門

有限会社シンビー2017/4/10

アジェンダ• FPGA と HPC(High-Performance Computing)• FPGA と Deep Learning• FPGA と HDL(Hardware Description

Language) • Python ( スクリプト言語 ) 入門• FPGA と jupyter• Polyphony ( 高位合成 ) 入門

はじめに

研究

テーマ

道具人脈

自己紹介• @ryos36• ハッシュタグ

#polyphony

FPGA と HPC自分の研究に必要な道具は何か？ CPU か？ GPU か？ FPGA か？

FPGA=Field-Programmable Gate Array

http://www.ni.com より

http://www.ni.com/

http://www.ni.com/

FPGA

Wikipedia より

FPGA の特徴得意なこと• 並列計算• レイテンシを守ること• ビット計算• 柔軟性

不得意なこと• 高速処理？

Clock の比較

Tesla P100GTX950-2GPYNQ-Z1

1328M1024M~200M

Core i7-7700K

4200M

不毛な比較です。

いろんなところから画像引用。価格は調べた限りでの参考

価格だけでいうと

6,556,568 円 !!Xilinx より

比較FPGAPYNQ

GPUGTX 950 HPC 用

Tesla P100CPUi7 7700K

HITACHI SR16000

Clock ~200M 1024M 1328M 4200M 4000M

FLOPS( 単精度 )

- 1317M 9600G 232G? 8200G x N

FLOPS( 倍精度 )

- 41M 4800G 116G 4100G x N

セル数 85,000 - - - -

SM - 6 24 - -

CUDA コア - 768 3584 - -

CPU Core - - - 4(8) 8 x 24 x N

命令セット拡張

- - - SSE4.1/4.2, AVX 2.0 Altivec VSX

微細 ? 28nm 28nm 16nm 14nm 45nm

得意なこと柔軟性のある並列計算

積和演算に特化逐次処理逐次処理2011 当時の最新 ? スパコンノード数不明

比較FPGAPYNQ

GPUGTX 950 HPC 用


HITACHI SR16000

Clock ~200M 1024M 1328M 4200M 4000M

FLOPS( 単精度 )

- 1317M 9600G 232G? 8200G x N

FLOPS( 倍精度 )

- 41M 4800G 116G 4100G x N

セル数 85,000 - - - -

SM - 6 24 - -

CUDA コア - 768 3584 - -

CPU Core - - - 4(8) 8 x 24 x N





積和演算に特化逐次処理逐次処理

自由度が高く入手性もよいので研究に最適 !!

スケールするかどうかが重要

FPGA で作って迅速にチェック。

論文は早い者勝ちだ !!

FPGA の特徴 ( 再掲 )

得意なこと• 並列計算• レイテンシを守ること• ビット計算• 柔軟性

不得意なこと• 高速処理？

FPGA の並列処理処理

処理

処理処理

処理

処理処理処理

FPGA でパイプライン処理処理処理処理処理処理

処理を細分化することで高速化が可能

FPGA で並列処理

処理

処理処理

処理処理

並列に処理することで高速化が可能

FPGA の柔軟性 ( 可変長ビット )

64bit

32bit

70bit

49bit

18bit

一般的なレジスタのサイズ

一般的なレジスタのサイズ

レジスタのサイズを変えることができる（混在も可能）

独自表現方法をとることが出来る8 23

11 52

float

doubleよくある浮動小数点

14 55独自表現

浮動小数点：参考

Wikipedia より

研究に必要な精度の確保

参考 :Maxwell (GPU) の中身超越関数用

64bit 浮動小数点演算器は1 個 !!

32bit 浮動小数点の演算器32

入手が容易な GPU は 64bit 浮動小数点の性能は 1/32 になる !!Load/Store がうまく働き 32bit 浮動小数点の積和演算をフル活用できたときのみカタログスペックの性能が出る

http://pc.watch.impress.co.jp/docs/column/kaigai/752331.html

参考 : スパコン

http://www.hitachi.co.jp/Prod/comp/hpc/SR_series/sr16000/feature.html に詳しいhttp://www.hitachi.co.jp/Prod/comp/hpc/SR_series/sr16000/feature.html

コンパイラが頑張っているらしい

積和演算が早いわけではない。 CPU の数分並列動作が期待できる。SR16000 は 1 ノードに 24 の CPU 。 CPU １つに 8 つ Power7 コア。

http://www.hitachi.co.jp/Prod/comp/hpc/SR_series/sr16000/feature.html

http://www.hitachi.co.jp/Prod/comp/hpc/SR_series/sr16000/feature.html

比較 ( 再掲 )+FPGAPYNQ

GPUGTX 950 HPC 用


HITACHI SR16000

Clock ~200M 1024M 1328M 4200M 4000M

FLOPS( 単精度 )

- 1317M 9600G 232G? 8200G x N

FLOPS( 倍精度 )

- 41M 4800G 116G 4100G x N

セル数 85,000 - - - -

SM - 6 24 - -

CUDA コア - 768 3584 - -

CPU Core - - - 4(8) 8 x 24 x N





積和演算に特化逐次処理逐次処理

CPU の分だけ逐次処理を並列化。将棋の分散処理などに適する

積和演算をやたらと早く可能。3D のレンダリング、Convolution などに適する

まとめ• CPU( スパコン ) が万能なわけではない• GPU もまた万能なわけではない ( 適合分野有 )• FPGA は柔軟性に富むので研究に使える

FPGA とディープラーニング

ディープラーニング

…

…

…

0 の確率

3 の確率

8 の確率

9 の確率

……

パーセプトロン1

𝑥1

𝑥2

𝑎 𝑦h( )

𝑏

𝑤1

𝑤2

𝑎=𝑏+𝑤1𝑥1+𝑤 2𝑥2典型的な h はシグモイド関数

シグモイド関数

実際の計算1

𝑥1

𝑥2

𝑎 𝑦h( )

𝑏

𝑤1

𝑤2

1.00.1

0.5

0.1

0.2

0.45 0.610639234

浮動小数点どこまで精度が必要かは不明

畳み込み演算

1 2 3 00 1 2 33 0 1 22 3 0 1

2 0 10 1 21 0 2

15 166 15

行列演算GPU が得意

スパコンは苦手（なはず）。領域分解して並列性をあげるなどして対応している（はず）。

研究者の皆さんどうしているのか？TensorFlow 等でネットワーク設計

データを用意して、レンタルサーバで計算検証

できたネットワーク＋パラメタC のソース生成

コンパイルして自分の PC やサーバで実行

実体は不明

FPGA を使うTensorFlow 等でネットワーク設計

データを用意して、レンタルサーバで計算検証


コンパイルして自分の PC やサーバで実行FPGA 上で

TensorFlow

• Python でネットワーク設計http://qiita.com/icoxfog417/items/fb5c24e35a849f8e2c5d が参考になる

𝑦=𝑥2+𝑏

import tensorflow as tf

def x2_plus_b(x, b): _x = tf.constant(x) _b = tf.constant(b) result = tf.square(_x) result = tf.add(result, _b) return result

𝑥𝑏

square add

http://qiita.com/icoxfog417/items/fb5c24e35a849f8e2c5d

http://qiita.com/icoxfog417/items/fb5c24e35a849f8e2c5d

FPGA の問題1

𝑥1

𝑥𝑛

𝑎 𝑦h( )

𝑏

𝑤1

𝑤𝑛

…

𝑎=∑𝑖=1

𝑛

𝑤𝑖𝑥𝑖 行列演算 = 積和演算しかも浮動小数点

𝑦h( )

二値化 NN

𝑎1𝑏

𝑤1𝑥1

𝑤𝑛𝑥𝑛

…

-1 -1 1-1 +1 -1+1 -1 -1+1 +1 1

0 0 10 1 01 0 01 1 1

Binarized Neural Networks

• Binarized Neural Networks: Training Neural Networks withWeights and Activations Constrained to +1 or -1

• XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Qiita に日本語で解説を書いた人がいるhttp://qiita.com/supersaiakujin/items/6adaf9731c9475891911

まとめ• Binary Neural Networks で FPGA でも高速にしかも GPU に劣らない早い計算ができる。精度もそれほど落ちない（らしい）。• Python がいろんなところで使われている

FPGA と HDL

HDL で何か実装してみよう !?

• 使う言語– VHDL– Verilog HDL

LED チカチカ =LチカハードウェアのHello World 的存在

VHDL を使った問題signal x: std_logic…process (clk) variable y : std_logic;begin if clk’event and clk = 1 then y := x + 1 x <= x + 1 end if;end process;

Q: x はいま 3 とします。左辺にある y と x はいくつになるでしょう？

VHDL を使った問題signal x: std_logic…process (clk) variable y : std_logic;begin if clk’event and clk = 1 then y := x + 1 x <= x + 1 end if;end process;

A: y は 4 、x は 3 のまま。次のクロックで 4 。

何を意味しているのか？

FPGA でパイプライン処理 ( 再掲 )

処理処理処理処理処理

処理を細分化することで高速化が可能

x + 1

x

x + 1

x 頭の中でオーバラップする時間を考えながら設計する !!

何を意味しているのか？

そもそも 15年くらい前は、、、

今でも検証には波形を見る

HDL 開発者のために本を書きました• ARM Cortex-A9×2! Zynq でワンチップ Linux

on FPGA

2014/11/25 に発売された本

Amazon で一瞬だけ 1 位になった

FPGA を使う ( 再掲 )

TensorFlow 等でネットワーク設計データを用意して、レンタルサーバで計算検証


コンパイルして自分の PC やサーバで実行FPGA 上でHLS という技術で C/C++ をHDL 化可能

東工大の中原先生の研究 ( だとおもう )https://www.slideshare.net/HirokiNakahara1/cnn-on-fpgagpu から勝手に想像

https://www.slideshare.net/HirokiNakahara1/cnn-on-fpgagpu

https://www.slideshare.net/HirokiNakahara1/cnn-on-fpgagpu

HLS(High Level Synthesis)

• Vivado HLS/SDSoC– C/C++

• Intel FPGA SDK for OpenCL– OpenCL(C ライク )

• Polypony– Python

雑誌で記事を書きました。

まとめ• FPGA の開発には HLS を使おう !!

–今回のこれからの話は Polyphony (Python Based)

PYTHON 入門Jupyter をつかって簡単に Python

Python を使う上での注意点• なぜか Python 2 と 3 がある

– バージョンで微妙に違いあり– ここでは Version 3.X をお勧め

>>> print 1, 2, 31 2 3

>>> print (1, 2, 3)1 2 3

Version 2.X Version 3.X

スクリプト言語っぽい普通の言語っぽい

入門者にお勧めの本

定価：本体 3,800 円 +税= 4,104 円

ただし、 Version 2 系使った限りでは Version 3 でも大丈夫だった

入門の次のステップ

この手の Python の本多し

Python って数学に強いの？BLASBasic Linear Algebra Subprogramsベクトルと行列に関する基本線型代数操作

LAPACKLinear Algebra PACKage線型計算のための数値解析ソフトウェアライブラリバックエンドでこれらを使用可能

• OpenBLAS• ATLAS

実際に使ってみる :jupyter を使うジュピター

Try it!!

http://jupyter.org/全員でアクセスして大丈夫

だろうか？

新規作成 (Python3)

Hello World

実行

Hello Word 関数

ここの空白が本当は重要

Python のプログラム例g_v0 = 20150529g_v1 = 20170406

def func(n): if n % 2: return g_v0 else: return g_v1

ここの空白が本当は重要TAB=4文字を推奨

Python のプログラム例def func(x): ans = 0 iterLeft = x while ( iterLeft != 0 ): ans = ans + x iterLeft = iterLeft - 1 if True : print("x = ", x, ", ans = ", ans, ", iterLeft = ", iterLeft) return ans

宣言しなくてもよい代入はローカル変数

Python のローカルとグローバル変数のルールは何ですか？Python では、関数内で参照されるだけの変数は暗黙的にグローバルにです。関数の本体のどこかで値が変数に代入されたなら、それは明示的にグローバルであると宣言されない限り、ローカルであるとみなされます。 Python では、関数内で参照されるだけの変数は暗黙的にグローバルにです。関数の本体のどこかで値が変数に代入されたなら、それは明示的にグローバルであると宣言されない限り、ローカルであるとみなされます。

最初はちょっと驚くでしょうが、少し考えると納得できます。一方では、代入された変数に global を要求することで、意図しない副作用を防げます。他方では、グローバルな参照の度に global が要求されてしまうと、 global を使ってばかりになってしまいます。ビルトイン関数やインポートされたモジュールの内容を参照するたびにグローバル宣言をしなければならないのです。その乱雑さは副作用を特定するための global 宣言の便利さよりも重大です。

http://docs.python.jp/3/faq/programming.html#what-are-the-rules-for-local-and-global-variables-in-python

Global と Local

• 関数内で参照されるだけの変数は暗黙的にグローバル。• 関数の本体のどこかで値が変数に代入されたなら、ローカルであ

るとみなされます。

def func(x): ans = g_v + x return ans

ans: 代入されているのでローカルg_v: 参照だけなのでグローバルx: 引数なのでローカルルール

Fibonacci Number#from polyphony import testbench

def fib(n): if n <= 0: return 0 if n == 1: return 1 r0 = 0 r1 = 1 for i in range(n-1): prev_r1 = r1 r1 = r0 + r1 r0 = prev_r1 return r1

#@testbenchdef test(): expect = [0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610] for i in range(len(expect)): result = fib(i) assert expect[i] == result print(i, "=>", result)

test()

まとめ• Python を使おう !!

–空白に注意– グローバルとローカルに注意– jupyter を使うと便利

FPGA と JUPYTERここでは単純に FPGA と Python の組み合わせ例を見せます

PYNQ:Python Productivity For Zynq

Zynq という ARM+FPGA のボードを Python で使い倒そうというコロンビア大学発のアカデミックな試み（たぶん）

ターゲットは誰？

ソフトウェアの開発者

Key technologies

実際のデモ• LED を光らせる• OLED に何か表示する• センサーをつかう

Binary NN のデモ• Cifar10

– 画像を 10種類に分類するデモ

おまけ：スプラトゥーンの解析ができるらしいIkaLog

https://www.nintendo.co.jp

@hasegawから勝手に引用K近傍使っているらしい

まとめ• PYNQ を使うと FPGA を知らなくてもハードが使える• Binary Neural Network も使える

POLYPHONY ( 高位合成 ) 入門Python で FPGA のプログラミング

まずは Lチカ、、、• こんな感じでできます

といいたいところですが、デモしません。たぶん方向性が違うので

from polyphony import testbench, module, is_worker_runningfrom polyphony.io import Bit

@moduleclass Blink: def __init__(self): self.led = Bit(0) self.append_worker(self.main, led)

def main(self): led = 1 while is_worker_running(): self.led.wr(led) led = ~led self._wait(10000000)

def _wait(self, interval): for i in range(interval): pass

blink = Blink()

Polyphony で可能な IoT 的なこと• SPI のアクセス• I2C のアクセス

SPI を通して A/D コンバータから情報を取得。LPF ( ロウ・パス・フィルター ) をかけて CPU に送る

Arduino や Raspberry Pi でもできます。 A/D に関しては

Raspberry Pi より Arduino根性があるなら FPGA

今回は説明しません

準備 (1/4)

• Python3 のインストール> python3 --versionPython 3.4.5

sudo apt-get python3-dev

準備 (2/4)

• pip3 のインストール> pip3 --versionpip 8.1.1 from /lib/python3.4/site-packages (python 3.4)

sudo apt-get python3-pip

準備 (3/4)

• polyphony のインストール> pip3 install polyphony> polyphony –VPolyphony 0.3.0

準備 (4/4)

• iverilog のインストール sudo apt-get install iverilog

> iverilog -VIcarus Verilog version 11.0 (devel) (s20150603-148-g24d1f49)

Copyright 1998-2015 Stephen Williams

Hello Worldfrom polyphony import testbench

def hello(): print("Hello World.")

@testbenchdef test(): hello()

test()

hello.py

Hello World を Python3 で実行> python3 hello.pyHello World.

重要 !!HLS ではソフトでバグをつぶしておく

Hello World をコンパイル・実行> polyphony hello.py> ls *.vhello.v polyphony_out.v test.v> iverilog -o hello polyphony_out.v test.v> ./hello 0:Hello World.Hello World.Hello World.Hello World. 150:finishHello World.

コンパイル.v ファイルを確認

iverilog でコンパイル

*.v の中身を見る必要なし波形も見る必要なし

simu.py をダウンロード• 前の手順をいっぺんにおこなう Python のスクリプトhttps://github.com/ktok07b6/polyphony/blob/master/simu.py

クリックでダウンロード

simu.py を使う> ../bin/simu.py hello.py 0:Hello World.Hello World.Hello World.Hello World. 150:finishHello World.

Hello World( 再掲 )from polyphony import testbench

def hello(): print("Hello World.")

@testbenchdef test(): hello()

test()

hello.py

FPGA 側で実行されるであろうコード

print は実機では使えない

ポイント• FPGA 側で実行されるコードは関数• print は実機では実行されない

– シミュレーションでは実行可能• 用語

– testbench ：テストベンチ• 要はテスト用プログラム

もう少し実用的なコードfrom polyphony import testbench

def mul_plus(a, b, c, d): return a * b + c * d

@testbenchdef test(): assert 17 == mul_plus(1, 2, 3, 4) assert 62 == mul_plus(4, 5, 6, 7)

test()

Assert整合性のチェック

Python3 で実行> python3 mul_plus.vTraceback (most recent call last): File "mul_plus.v", line 11, in <module> test() File “…./__init__.py", line 30, in _testbench_decorator func() File "mul_plus.v", line 8, in test assert 17 == mul_plus(1, 2, 3, 4)AssertionError 重要 !!

HLS ではソフトでバグをつぶしておく

エラー !!!

バグを修正from polyphony import testbench

def mul_plus(a, b, c, d): return a * b + c * d

@testbenchdef test(): assert 14 == mul_plus(1, 2, 3, 4) assert 62 == mul_plus(4, 5, 6, 7)

test()バグを修正

Python3 で実行> python3 mul_plus.v

重要 !!HLS ではソフトでバグをつぶしておく

何も出力されないけど正しい結果

simu.py で実行> ../bin/simu.py mul_plus.v 0:mul_plus_0_in_a= x, mul_plus_0_in_b= x, mul_plus_0_in_c= x, mul_plus_0_in_d= x, mul_plus_0_out_0= x 10:mul_plus_0_in_a= 0, mul_plus_0_in_b= 0, mul_plus_0_in_c= 0, mul_plus_0_in_d= 0, mul_plus_0_out_0= 0 110:mul_plus_0_in_a= 1, mul_plus_0_in_b= 2, mul_plus_0_in_c= 3, mul_plus_0_in_d= 4, mul_plus_0_out_0= 0 130:mul_plus_0_in_a= 1, mul_plus_0_in_b= 2, mul_plus_0_in_c= 3, mul_plus_0_in_d= 4, mul_plus_0_out_0= 14 160:mul_plus_0_in_a= 4, mul_plus_0_in_b= 5, mul_plus_0_in_c= 6, mul_plus_0_in_d= 7, mul_plus_0_out_0= 14 180:mul_plus_0_in_a= 4, mul_plus_0_in_b= 5, mul_plus_0_in_c= 6, mul_plus_0_in_d= 7, mul_plus_0_out_0= 62 220:finish

実行時間

Fibonacci Numberfrom polyphony import testbench

def fib(n): if n <= 0: return 0 if n == 1: return 1 r0 = 0 r1 = 1 for i in range(n-1): prev_r1 = r1 r1 = r0 + r1 r0 = prev_r1 return r1

@testbenchdef test(): expect = [0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610] for i in range(len(expect)): result = fib(i) assert expect[i] == result print(i, "=>", result)

test()

jupyter で実行

質疑応答• @ryos36• ハッシュタグ

#polyphony

一緒に論文を書いてくれる人募集

Polyphony の今後• CPU をつくります• Deep Learning• Bayes• 数値計算

– 精度保証付き？

Software

研究者のための Python による FPGA 入門