Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
PFNにおける機械学習の取り組みとHPCについて
福田圭祐
Preferred Networks, Inc.
Who we are?Preferred Networks, Inc. (PFN): A Tokyo-based Deep Learning & IoT company
2
Our Strategic Partners
and Collaborators
2015: OPTIMIZATION OF BIN-PICKING FANUC ROBOTS
• Picking random object is a typical task that is “easy for human, hard for robots”.
@CES 2016: CARS THAT DON’T CRASH
• Car positions are tracked from a ceiling camera and each car is controlled individually.
• White cars are autonomous
• The red car is a manually-controlled “evil” car: trying to disrupt other cars
6
“Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions”arXiv:1710.06280
@ICRA 2017 VOICE RECOGNITION + OBJECT PICKING
• ICRA is a top-tier conference on robotics• Best Paper Award on Human-Robot Interaction
• Technologies:• Visual recognition• Natural language processing (NLP)
• The robot can understand ambiguous words:• ”The Teddy bear”• ”The brown fluffy stuff”
@CEATEC JAPAN 2018 Autonomous Tidying-up Robot System
7
https://projects.preferred.jp/tidying-up-robot/
x2
Technologies behind the demos:Distributed Deep Learning
8
MN-1: An in-house supercomputerl MN-1a (Sep. ʻ17~)
━ 1024 NVIDIA Tesla P100━ InfiniBand FDR━ Peak 9.3 Peta FLOPS (SP)━ #227 in Top500 Nov. 2018
l MN-1b (July. ʻ18~)━ 512 NVIDIA Tesla V100 32GB━ InfiniBand EDR━ Peak 56 Peta (tensor) Flops
l Targeting Exa FL ops by 2020
Define-by-Run
Ca
Chainer:A Flexible Deep Learning Framework
Define
Modeldefinition
Computationalgraph
Gradientfunction
RunComputational
graphGradientfunction
Trainingdata
Define-by-Run
Modeldefinition
Computationalgraph
Gradientfunction
Trainingdata
Define-and-Run
10Caffe2, TensorFlow etc. PyTorch, TensorFlow(Eager Execution) etc.
ChainerMN: Distributed Training with Chainer
• Add-on package for Chainer• Enables multi-node distributed deep learning using NVIDIA NCCL2
Features• Scalable: Near-linear scaling with hundreds of GPUs• Flexible: Even GANs, dynamic NNs, and RL are applicable
All-Reduce
Forward
Forward
Forward
Backward
Backward
Backward
Optimize
Optimize
Optimize
Distributed Training with ChainerMN
11
010203040506070
Tim
e [m
in]
Training time of ResNet-50 (90 epochs) on ImageNet
12
Achievement on MN-1a: ImageNet in 15 minutes
2018 July2018 Nov
2017 Nov
arXiv: 1711.04325Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
2018 Nov
13
Achievement on MN-1b: PFDet in OIC 2018
14
• Google AI Open Images - Object Detection Track– Competition using Largest-class image dataset– 12 million bounding boxes, 1.7 million images– 454 competitiors– Approx. 500GB (annotated subset)
• Object detection: much harder than object recognition task
Achievement on MN-1b: PFDet in OIC 2018
Object Detection
Detecting objects in an image by predicting...• bounding boxes that
contain them• category of the objects
• We won the 2nd position (0.023% diff to the 1st)
17
Achievement on MN-1b: PFDet in OIC 2018
Open Sourcing PFDet:
We may make the implementation public
Technical report is already on arXiv:arXiv:1809.00778
Computation resources used in PFDet• Single training process of 16 epochs takes 33 hours with 512 x
V100 GPUs of MN-1b• Repeated model development & parameter tuning
19
June 3rd:First commit
June 29th:FPN added
Aug. 31st:Finish
20
Integration of a wide range of DL:
• Object Detection
(based on PFDet)
• Audio recognition
• NLP
• Picking planning
Autonomous Tyding-up robot
Technical topics
1. Communication & Fault tolerance
2. Storage
3. Resource Management
21
Communication
22
Allreduce: that always matters
• Critical component for distributed deep learning
• NVIDIA NCCL is the de-facto standard communication library
• Typical usage in ChainerMN:– MPI for process coordination– NCCL for actual communication
Communication
23
From our research blog:https://bit.ly/2Pt2Gcg
Open MPI 2.1.3(configured as CUDA-aware)
NCCL is the fastest production-ready library
Our demonstration-purposedprototype library
Bette
r
Communication
NCCL is now open-sourced 🎉• Very interesting internal architecture• NCCL still has some bug in large scale execution… we know what to do to an
OSS software? J
24
Communication: open questions (1)
• NCCL is optimized for bandwidth-bound (i.e. large buffer) communication
• What about short-messages?– Inter-process Batch Normalization– Model parallelism
• Fault tolerance?– NCCL now supports communication timeout... – But the MPI standard does not support FT (yet?)
25
Communication: open questions (2)
• Do we still use MPI…?– Legacy– The software stack is huge: installation and maintenance is hard
for non-HPC users– No fault tolerance– No separation of process coordination and communication– Hard to use with Kubernetes
26
Resource Management
• We deploy Kubernetes on our computing cluster.
• Why Kubernetes?– Advanced features from the cloud computing community
• Fault tolerance, dynamic job size management, preemption, etc.– We also deploy server-based services (such as JupyterHub, CI, etc.) on
the same cluster
• Still many challenges– Batch job scheduling,
27
Container
多種多様な実験環境
28Icon pack by Icons8 - https://icons8.comIcons by iconspng - https://www.iconspng.com
NFS
大規模データ
小規模データ
InfiniBand
Device pluginでリソース化
Node SelectorでCPU/GPUの選択
Pod Affinityでできるだけホップ数が近い物理サーバに配置
k8s-host-device-plugin - https://github.com/everpeace/k8s-host-device-plugin
- 静止画、動画、音声、テキスト
- 深層学習、分散深層学習、強化学習
プライベートDockerレジストリ
効率的なスケジューリング
29Icon pack by Icons8 - https://icons8.com
kube-throttler
PriorityClass毎にユーザが実行できるPod数を制御(例) 優先度毎に高:1、中:5、低:無限
Operator/CRD
Custom Scheduler
webhook
Node
PriorityClassによるプリエンプション
-分散深層学習向けのスケジューリング-待ち時間による優先度の重みづけ
スケジュール
kube-throttler - https://github.com/everpeace/kube-throttler
- 毎日、数百〜数千件のジョブ
- 試行錯誤の実験
- ハイパーパラメータチューニング
Namespace User A
マルチテナント
30Icon pack by Icons8 - https://icons8.com
Namespace User B
User B
User A
NamespaceProject X
Network Policyによる通信制御
NFSKerberos認証,LDAPのユーザIDによるアクセス
Namespace毎のメトリックス
PSP, Admission WebhookによりLDAPのユーザIDをセット
RBACによるNamespaceへのアクセス制御
- ユーザ毎、プロジェクト毎の環境
- コンフィデンシャルな案件
自由度の担保
31Icon pack by Icons8 - https://icons8.com
自社製実験ツール
Kubectl
NFS
Homeにマウントして物理サーバのように使う
GPUを手軽に使いたい
様々なツールからの利用
kubectl exec
実行中のPodに入ってデバッグ
- 様々なリテラシの研究者
- 特定のツールでロックインしない
- 実験環境の自由なカスタマイズ
- 実行環境に直接入ってデバッグ
グランドチャレンジ
- 全リソースを使った実験- ImageNet in 15 minutes (Tesla P100 x 1024)- PFDet in the Kaggle 2018 Google AI Open Images (Tesla V100 x 512)
- 日常のジョブはプリエンプションされる前提で使う
- グランドチャレンジのジョブは最大の優先度
- 任意のタイミングで全GPUを使った実験が可能
32Icon pack by Icons8 - https://icons8.com
Resource Management: Open Questions
• Preemption is a strong tool for efficient resource management• How DL framework and scheduler cooperate ?• Job preemption/restarting & flexible resource (re-)allocation
• Resource isolation• Bandwidth isolation for Infiniband?• NUMA- / topology- aware pod placement?
33NOTE: We operate an in-house cluster, so all jobs are our own.
HPCとAIの関係
34
HPCのためのAIAIのためのHPC
HPCとAIの融合(AI技術の⼀般化)
35
AIのためのHPC
• HPCからAIへの貢献– DL/MLはHPCの1アプリ。HPCなくしてAIは無い
• 今後の課題– 計算(FLOPS)=もっと– メモリ=もっと– ストレージ
• セキュリティとアクセスコントロール/暗号化– スケジューラーと計算環境の進化・統合
• コンテナ/再現性の担保/サービスとの混在(JupyterHub etc.)/遊休計算資源の有効活⽤
36
HPCのためのAI
• AIの対象としてのHPC– パラメーターサーベイ
• DLのハイパーパラメーターの多さからメタパラメーターチューニングが発達• →HPCへ還流
– 電⼒効率、故障検知、 ジョブスケジューリング etc.• 特に強化学習が鍵• Li “Transforming Cooling Optimization for Green Data Center via Deep Reinforcement
Learning” arXiv:1709.05077v4 • Mirhoseini et al. “Device Placement Optimization with Reinforcement Learning” ICML 2017
37
HPCとAIの融合(AI技術の⼀般化)
• HPCに限らず、AIは、コンピューターサイエンスのコア技術になっていく– c.f. 「演繹から帰納へ〜新しいシステム開発パラダイム〜」丸⼭宏, PPL2018 招待講演– 特別なものではなく、実装⼿法の1つとして広く使われるようになっていくのでは?
• HPC系アプリには、AIと相性の良いものが多いのでは?AIが向いている場⾯ AIが不向きな場⾯
• データが⼤量(⽣成可能)• 誤差が許容される• 現象が複雑/原理が不明• 法則・原理が⼀定• 予測が⽬的
• データが少ない• 厳密さが必要• 数値解析が可能• 過去から未来が予測できない• メカニズムの理解が⽬的
Thank you!
38