PFNにおける機械学習の取り組みとHPCについて · PFNにおける機械学習の取り組みとHPCについて ... Extremely Large Minibatch SGD: Training ResNet-50

PFNにおける機械学習の取り組みとHPCについて

福田圭祐

Preferred Networks, Inc.

Who we are?Preferred Networks, Inc. (PFN): A Tokyo-based Deep Learning & IoT company

2

Our Strategic Partners

and Collaborators

2015: OPTIMIZATION OF BIN-PICKING FANUC ROBOTS

• Picking random object is a typical task that is “easy for human, hard for robots”.

@CES 2016: CARS THAT DON’T CRASH

• Car positions are tracked from a ceiling camera and each car is controlled individually.

• White cars are autonomous

• The red car is a manually-controlled “evil” car: trying to disrupt other cars

6

“Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions”arXiv:1710.06280

@ICRA 2017 VOICE RECOGNITION + OBJECT PICKING

• ICRA is a top-tier conference on robotics• Best Paper Award on Human-Robot Interaction

• Technologies:• Visual recognition• Natural language processing (NLP)

• The robot can understand ambiguous words:• ”The Teddy bear”• ”The brown fluffy stuff”

@CEATEC JAPAN 2018 Autonomous Tidying-up Robot System

7

https://projects.preferred.jp/tidying-up-robot/

x2

Technologies behind the demos:Distributed Deep Learning

8

MN-1: An in-house supercomputerl MN-1a (Sep. ʻ17~)

━ 1024 NVIDIA Tesla P100━ InfiniBand FDR━ Peak 9.3 Peta FLOPS (SP)━ #227 in Top500 Nov. 2018

l MN-1b (July. ʻ18~)━ 512 NVIDIA Tesla V100 32GB━ InfiniBand EDR━ Peak 56 Peta (tensor) Flops

l Targeting Exa FL ops by 2020

Define-by-Run

Ca

Chainer:A Flexible Deep Learning Framework

Define

Modeldefinition

Computationalgraph

Gradientfunction

RunComputational

graphGradientfunction

Trainingdata

Define-by-Run

Modeldefinition

Computationalgraph

Gradientfunction

Trainingdata

Define-and-Run

10Caffe2, TensorFlow etc. PyTorch, TensorFlow(Eager Execution) etc.

ChainerMN: Distributed Training with Chainer

• Add-on package for Chainer• Enables multi-node distributed deep learning using NVIDIA NCCL2

Features• Scalable: Near-linear scaling with hundreds of GPUs• Flexible: Even GANs, dynamic NNs, and RL are applicable

All-Reduce

Forward

Forward

Forward

Backward

Backward

Backward

Optimize

Optimize

Optimize

Distributed Training with ChainerMN

11

010203040506070

Tim

e [m

in]

Training time of ResNet-50 (90 epochs) on ImageNet

12

Achievement on MN-1a: ImageNet in 15 minutes

2018 July2018 Nov

2017 Nov

arXiv: 1711.04325Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

2018 Nov

13

Achievement on MN-1b: PFDet in OIC 2018

14

• Google AI Open Images - Object Detection Track– Competition using Largest-class image dataset– 12 million bounding boxes, 1.7 million images– 454 competitiors– Approx. 500GB (annotated subset)

• Object detection: much harder than object recognition task


Object Detection

Detecting objects in an image by predicting...• bounding boxes that

contain them• category of the objects

• We won the 2nd position (0.023% diff to the 1st)

17


Open Sourcing PFDet:

We may make the implementation public

Technical report is already on arXiv:arXiv:1809.00778

Computation resources used in PFDet• Single training process of 16 epochs takes 33 hours with 512 x

V100 GPUs of MN-1b• Repeated model development & parameter tuning

19

June 3rd:First commit

June 29th:FPN added

Aug. 31st:Finish

20

Integration of a wide range of DL:

• Object Detection

(based on PFDet)

• Audio recognition

• NLP

• Picking planning

Autonomous Tyding-up robot

Technical topics

1. Communication & Fault tolerance

2. Storage

3. Resource Management

21

Communication

22

Allreduce: that always matters

• Critical component for distributed deep learning

• NVIDIA NCCL is the de-facto standard communication library

• Typical usage in ChainerMN:– MPI for process coordination– NCCL for actual communication

Communication

23

From our research blog:https://bit.ly/2Pt2Gcg

Open MPI 2.1.3(configured as CUDA-aware)

NCCL is the fastest production-ready library

Our demonstration-purposedprototype library

Bette

r

Communication

NCCL is now open-sourced 🎉• Very interesting internal architecture• NCCL still has some bug in large scale execution… we know what to do to an

OSS software? J

24

Communication: open questions (1)

• NCCL is optimized for bandwidth-bound (i.e. large buffer) communication

• What about short-messages?– Inter-process Batch Normalization– Model parallelism

• Fault tolerance?– NCCL now supports communication timeout... – But the MPI standard does not support FT (yet?)

25

Communication: open questions (2)

• Do we still use MPI…?– Legacy– The software stack is huge: installation and maintenance is hard

for non-HPC users– No fault tolerance– No separation of process coordination and communication– Hard to use with Kubernetes

26

Resource Management

• We deploy Kubernetes on our computing cluster.

• Why Kubernetes?– Advanced features from the cloud computing community

• Fault tolerance, dynamic job size management, preemption, etc.– We also deploy server-based services (such as JupyterHub, CI, etc.) on

the same cluster

• Still many challenges– Batch job scheduling,

27

Container

多種多様な実験環境

28Icon pack by Icons8 - https://icons8.comIcons by iconspng - https://www.iconspng.com

NFS

大規模データ

小規模データ

InfiniBand

Device pluginでリソース化

Node SelectorでCPU/GPUの選択

Pod Affinityでできるだけホップ数が近い物理サーバに配置

k8s-host-device-plugin - https://github.com/everpeace/k8s-host-device-plugin

- 静止画、動画、音声、テキスト

- 深層学習、分散深層学習、強化学習

プライベートDockerレジストリ

効率的なスケジューリング

29Icon pack by Icons8 - https://icons8.com

kube-throttler

PriorityClass毎にユーザが実行できるPod数を制御(例) 優先度毎に高：１、中：５、低：無限

Operator/CRD

Custom Scheduler

webhook

Node

PriorityClassによるプリエンプション

-分散深層学習向けのスケジューリング-待ち時間による優先度の重みづけ

スケジュール

kube-throttler - https://github.com/everpeace/kube-throttler

- 毎日、数百〜数千件のジョブ

- 試行錯誤の実験

- ハイパーパラメータチューニング

Namespace User A

マルチテナント


Namespace User B

User B

User A

NamespaceProject X

Network Policyによる通信制御

NFSKerberos認証,LDAPのユーザIDによるアクセス

Namespace毎のメトリックス

PSP, Admission WebhookによりLDAPのユーザIDをセット

RBACによるNamespaceへのアクセス制御

- ユーザ毎、プロジェクト毎の環境

- コンフィデンシャルな案件

自由度の担保


自社製実験ツール

Kubectl

NFS

Homeにマウントして物理サーバのように使う

GPUを手軽に使いたい

様々なツールからの利用

kubectl exec

実行中のPodに入ってデバッグ

- 様々なリテラシの研究者

- 特定のツールでロックインしない

- 実験環境の自由なカスタマイズ

- 実行環境に直接入ってデバッグ

グランドチャレンジ

- 全リソースを使った実験- ImageNet in 15 minutes (Tesla P100 x 1024)- PFDet in the Kaggle 2018 Google AI Open Images (Tesla V100 x 512)

- 日常のジョブはプリエンプションされる前提で使う

- グランドチャレンジのジョブは最大の優先度

- 任意のタイミングで全GPUを使った実験が可能


Resource Management: Open Questions

• Preemption is a strong tool for efficient resource management• How DL framework and scheduler cooperate ?• Job preemption/restarting & flexible resource (re-)allocation

• Resource isolation• Bandwidth isolation for Infiniband?• NUMA- / topology- aware pod placement?

33NOTE: We operate an in-house cluster, so all jobs are our own.

HPCとAIの関係

34

HPCのためのAIAIのためのHPC

HPCとAIの融合（AI技術の⼀般化）

35

AIのためのHPC

• HPCからAIへの貢献– DL/MLはHPCの1アプリ。HPCなくしてAIは無い

• 今後の課題– 計算（FLOPS）＝もっと– メモリ＝もっと– ストレージ

• セキュリティとアクセスコントロール／暗号化– スケジューラーと計算環境の進化・統合

• コンテナ／再現性の担保／サービスとの混在（JupyterHub etc.）／遊休計算資源の有効活⽤

36

HPCのためのAI

• AIの対象としてのHPC– パラメーターサーベイ

• DLのハイパーパラメーターの多さからメタパラメーターチューニングが発達• →HPCへ還流

– 電⼒効率、故障検知、ジョブスケジューリング etc.• 特に強化学習が鍵• Li “Transforming Cooling Optimization for Green Data Center via Deep Reinforcement

Learning” arXiv:1709.05077v4 • Mirhoseini et al. “Device Placement Optimization with Reinforcement Learning” ICML 2017

37

HPCとAIの融合（AI技術の⼀般化）

• HPCに限らず、AIは、コンピューターサイエンスのコア技術になっていく– c.f. 「演繹から帰納へ〜新しいシステム開発パラダイム〜」丸⼭宏, PPL2018 招待講演– 特別なものではなく、実装⼿法の1つとして広く使われるようになっていくのでは？

• HPC系アプリには、AIと相性の良いものが多いのでは？AIが向いている場⾯ AIが不向きな場⾯

• データが⼤量（⽣成可能）• 誤差が許容される• 現象が複雑／原理が不明• 法則・原理が⼀定• 予測が⽬的

• データが少ない• 厳密さが必要• 数値解析が可能• 過去から未来が予測できない• メカニズムの理解が⽬的

Thank you!

38

Documents

PFNにおける機械学習の取り組みとHPCについて · PFNにおける機械学習の取り組みとHPCについて ... Extremely Large Minibatch SGD: Training ResNet-50