[db tech showcase Tokyo 2017] D33: Deep Learningや、Analyticsのワークロードを加速するには-TensorFlow /VGG/Caffe/Spark by ピュア・ストレージ・ジャパン株式会社

1| © 2017 Pure Storage Inc.

DEEP LEARNINGや、ANALYTICSのワークロードを加速するには-TENSORFLOW /VGG/CAFFE/SPARK

2017年9月ピュア・ストレージ・ジャパン株式会社

大浦譲太郎FLASH BLADE セールスリード

永年保証、陳腐化しないオールフラッシュ


自己紹介大浦譲太郎 Twitter：@JOOOURA

趣味：２児の子育て、ガジェット、健康（糖質制限ダイエットで-9kg）

大手グローバルベンダーを経て、フラッシュメモリストレージ企業Fusion-ioの日本オフィス立ち上げに参画、セールス、広報、エバンジェリスト活動に従事。

その後、BigData PlatformのHortonworksで日本市場の拡大に寄与。現在は、AI/Bigdata時代に求められる新しいフラッシュ製品の立ち上げ役として、ピュアストレージに参画し、エヴァンジェリスト活動及びエンタープライズ向けセールス、

パートナー支援を行なっている。


4年連続でリーダー(業界を牽引する企業) の評価高まる市場での存在感と評価

ソリッド・ステート・アレイに関して実施されたガートナーのマジック・クアドラントで４年連続リーダーの位置付けを獲得

http://www.purestorage.com/microsites/gartner-mq-2016.html

2017年7月発表

リーダー

• 他ベンダーよりも優秀！！• テクノロジーが素晴らしい• 売れてる/市場性/顧客満足度が高い

Pure Storageは、価格設定、コントローラアップグレードの保証プログラム、SSDワランティ、保守価格設定などを通じて、顧客のペインポイントを理解し、ビジネスを展開している。

Pure Storageは、新製品開発とマインドシェアの獲得の維持により、SSAユースケースの拡大によって、市場シェアを維持、拡大し続けている

いまがんばってる製品

マニアック製品ビジョンがいい製

品

リーダー(テクノロジーリーダー)

（売れてる/市場性/満足度が高い）

Source, : Gartner Magic Quadrant for Solid State Arrays 13 July 2017.


企業向けストレージインフラの変革

13

2

2

4

4

6

6

8

10

10

10

10

13

13

17

19

19

23

23

28

Other

Infinidat

Huawei

Tintri

DataDirect Networks

Tegile

SimpliVity

EMC

Oracle

NetApp

IBM

Hitachi Data Systems

Hewlett Packard…

Dell

Nimble Storage

VMware (VSAN)

Nutanix

Pure Storage

Amazon Web Services

Microsoft Azure

Source: 451 Research, Voice of the Enterprise: Storage, Vendor Evaluations 2016

次のストレージ更改ではどのベンダーを検討しますか？

従来のストレージインフラ


AIは各業界で活用されている

Smart Kitchen- InnitIdentifies food in refrigerator, notifies when food will

expire, and recommends recipes

Brain Cancer MRI- Mayo ClinicFinds genetic markers in images to avoid surgery for

tumor samples & recommend treatments

Farming- Blue River10% of lettuce in the US is harvested by LettuceBot, using

AI to maximize crop yield & minimize chemicals

Fraud Detection- Capital OneIndustry loses $20B annually in fraud- Capitol One

detects suspicious activities in real-time

Crowd-Source Reviews- YelpHelps users discover new experiences with targeted recommendations while filtering suspicious content

Self-Driving Air Taxi- AirbusBy 2020, Airbus A3 plans to fly autonomously in

San Francisco Bay Area’s skies for commuters


第3次AIブーム

統計学習や、Deep Learning(深層学習)など、汎用性が高まり実用性が期待される。OSSベースのFrameworkやライブラリが充実し、間口が拡がった。


機械学習の裾野の拡がり

https://www.slideshare.net/TakeshiHasegawa1/20151016ssmjpikalog


ニューアルゴリズム超並列化による

人知を超えた正確性

CPU- 数十以上のコア

現在のコンピュートモデル超並列アーキテクチャ

性能を極大化

GPU- 数千以上のコア

BIG DATA“データは新たな油田である”

2020年には50 ZBに

インテリジェンスの創出FUELED BY PARALLEL COMPUTE, NEW ALGORITHMS, AND BIG DATA


データ活用のための新たな要求LEGACY, RETROFIT STORAGE BUILT ON SERIAL TECHNOLOGIES, PERFORMANCE GAP GROWING

STORAGE の性能GAP〜拡がり続けるGAP〜

PER

FOR

MA

NC

E

2015

Deep Learning で求められる計算能力は2年

で１５倍に

計算能力は2年で１０倍を実現

20172016

SSD/Disk 性能は2年で18％しか増加していな

い

レガシーなストレージアーキテクチャBuilt on Decade-Old Serial Technology

Disk Emulation Software

SAS (Serial Attached SCSI)SATA

NFS Software Stack

Object Translation Layer

Decade-old Protocol & SW

Newer Technologies Retrofitted

GAP


謎のAI半導体メーカー新たなコンピュートのスタイル


http://www.nvidia.co.jp/object/volvo-autoliv-select-drive-px-self-driving-cars-20170628-jp.html


GTC2017でのFacebookによる発表（引用）

http://on-demand.gputechconf.com/gtc/2017/presentation/s7815-soumith-chintala-building-scale-out-deep-learning-infrastructure-lessons-learned-facebook-ai-research.pdf




14| © 2017 Pure Storage Inc.http://on-demand.gputechconf.com/gtc/2017/presentation/s7815-soumith-chintala-building-scale-out-deep-learning-infrastructure-lessons-learned-facebook-ai-research.pdf

15| © 2017 Pure Storage Inc.http://on-demand.gputechconf.com/gtc/2017/presentation/s7815-soumith-chintala-building-scale-out-deep-learning-infrastructure-lessons-learned-facebook-ai-research.pdf




拡大します


MEGA-SCALE AI SUPERCOMPUTER

POWERED BY FLASHBLADE


FLASHBLADE

BLADE ELASTICITY ELASTIC FABRICSCALE-OUT

PROCESSING + FLASHSCALE-OUT STORAGE SOFTWARE LOW-LATENCY, SW-DEFINED

ETHERNET INTERCONNECT


MODERN ANALYTICSを支える大きな躍進とは

Amount of Data

Acc

ura

cy Older Learning Algorithms

Deep Learning

MODERN ANALYTICSImproves Linearly with Growing Data

Deep learning chart courtesy of Andrew NgIO sizes, 16 load generators (48 core CPU’s each with 2x10GbE), 256 Containers total, NFSv3Data capacity assumes 3:1 compression, 75 blade feature is subject to GA

0

10

20

30

40

50

60

70

80

15 30 45 60 75

GB

/s

# of Blades(1.6PB) (8.0PB)

FLASHBLADEImproves Linearly with Growing Data

PERFORMANCE OF 20 RACKSPower of Purpose-Built vs Legacy

Leading Information Services Company20 RACKS DISK 4U


“全てにおあつらえ向きの” デザインBIG DATA IS UNPREDICTABLE DATA- FLASHBLADE DELIVERS PERFORMANCE FOR ANY DATA

ELASTIC な性能

Designed to deliver maximum performance, from small & metadata-heavy to large

streaming files

Delivers linear scaling performance that grows with your data, from TBs to

PBs, to thousands of clients

高速なランダムI/O

Offers predictable, ultra-fast performance for any access pattern, random or

sequential

極小から巨大FILEまで


Training ImageNet in 1 Hour

Facebookの論文

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf?

https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf


FlashBlade パフォーマンス

高スループット、リニアなスケールアウトを実現※ 512KB IO sizes、16 load generators（48 core CPU’s each with 2x10GbE）、256 Containers total、NFSv3

7 8 9 10 11 12 13 14 15

Read 7.4 8.4 9.3 10.3 11.2 12.2 13.1 14.1 15.0

Write 2.2 2.5 2.8 3.1 3.4 3.6 3.9 4.2 4.5

0.0

1.5

3.0

4.5

6.0

7.5

9.0

10.5

12.0

13.5

15.0

GB/sec

Blades


FlashBladeへ直接のアクセスにより、データプリパレーションの時間を大幅に短縮が可能。


NVIDIA Test CaseMachine Learning

▪ 20 cpu wide run

▪ Machine learning storage test program from NVIDIA

▪ 7+ GB/s Read at Peak

▪ 1-4GB/s Write

▪ “Fasted we have ever seen” (compared to NFS and Infiniband connected storage) – NVIDIA storage team


AIに必要なデータスループットを提供Deep Learning Needs Maximum Read Performance, Mostly Small Files, To Keep Training Computers Busy

DGX-113K Images/Sec for each DGX-1

Assume 115KB on average for images

For DGX-1 13K images per second performance: http://files.shareholder.com/downloads/AMDA-1XAJD4/4389242263x0x918093/50C3BC56-468D-4A02-941B-C0599570915A/JHH_SC16_FINAL_PUBLISHED.pdf

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

1.5GB/s

FlashBlade1.5GB/Sec of Throughput to

Keep Each DGX-1 Busy

http://files.shareholder.com/downloads/AMDA-1XAJD4/4389242263x0x918093/50C3BC56-468D-4A02-941B-C0599570915A/JHH_SC16_FINAL_PUBLISHED.pdf


SPARKも速くなるの？


FLASHBLADEのソフトウェア開発での活用例

33% additional build time reduction with 15 Blades

More clients

15X Faster Build Time for same # clients

– 6 concurrent builds per minute

Linear scalability of Builds

– Add more Blades and Clients to increase Build rate

– Boost performance SW Dev/Build

Do more Builds with Less Storage

– Minimize concerns with Storage bottlenecking

– Consolidate multiple workloads and Spark

Environment


SW開発におけるデバッグ解析パイプライン

10 FB

20 clients

100+ tests

12

12

12

12

rsyslog

12

12

12

12

12

12

12

12



100 FB

200 clients

1,000+ tests

12

12

12

12

rsyslog

12

12

12

12

12

12

12

12

12

12

12

12

12

12



1,000+VMs

120+FBs

20+Jenkins

400+clients

16

16

16

16

rsyslog

12

12

12

12

12

12

12

12

12

12

6G

40

40

40

40

18T 18T6T

6G 12

Custom code

✓ Duplicate bug

✓ Infrastructure failure

✓ Performance regression

20,000+ tests


ADAMでスケーラブルなゲノムツールを構築する。

⎯ ADAM is an open source, high performance, distributed library for genomic analysis

⎯ ADAM defines a:

⎯ Data schema and layout on disk

⎯ Programming interface for distributed processing of genomic data using Spark + Scala

⎯ Goal is to enable both batch and exploratory analysis of all types of genomic data


APACHE MAPS WELL TO GENOMICS

Apache

⎯ An in-memory data parallel computing framework

⎯ Optimized for iterative jobs → unlike Hadoop

⎯ Provides an easy to use programming model (Resilient Distributed Dataset → parallel array over cluster) + Python/R/SQL support

Question is: how can we make a next-gen map-reduce platform like Apache Spark easy and efficient to use for processing genomic data?

val kmers = sc.loadAlignments(“/path/to/my/reads.sam”)

.flatMap(_.getSequence.sliding(21).map(k => (k, 1L)))

.reduceByKey(_ + _)


CLUSTER 構成例

FlashBlade

Switch

…64 node Hadoop YARN/HDFS cluster16 cores, 256GB RAM, 4TB per node

Running Spark on NFS


HIGHER LEVEL PRIMITIVES ENABLE OPTIMIZATIONS...

⎯ Maintain sort order across runs and optimize to reduce data skew

⎯ Leverage indices/sort orders

⎯ Push down join/filter queries into storage

⎯ Use join optimizations to develop BEDtools equivalent


SCALABILITYを提供するプラットフォーム

⎯ 30–50x speedup over traditional implementations

⎯ Speedup extends to O (16MB data / core)

⎯ 3x improvement in analysis cost


1m + IOPsAND

>18 GB/s*>75 GB/s

Performance

NFSv3, Object/S3AND

1.1 PBs (2:1)*5.3 PBs (2:1)N+2 redundancy

PurityPLUS

Pure1

8TB & 17TBOR

52TBBLADES

PowerMAX

1850WattFully Loaded

FLASHBLADE


WATCH

FLASHBLADESCALE-OUT INSTANTLY

7 Blades56TBs Raw

66 TBs Effective*

30 Blades Preview

1,560 TBs非圧縮容量2,144 TBs有効容量*

最大 30GB/Sec1M IOPS以上

Mix/Match 8.8TB, 52.8TB,

or Future Blades

リニアな拡張：各ブレードを即座に追加

容量 – IOPS – METADATA – NVRAM –帯域

9 Blades

15 Blades

8.8TBBlades

52.8TBBlades

364 TBs Raw394 TBs Effective*

72TBs Raw95 TBs Effective*

468 TBs Raw570 TBs Effective*


780 TBs Raw1,072 TBs Effective*

17TBBlades


91.8 TBs Raw183.6 TBs Effective*

172.6TBs Raw345 TBs Effective*

*圧縮時の有効容量は参考でありその容量を保証するものではありません。


Proprietary & Confidential – Do Not Share Outside of Pure

FlashBlade Hardware Designed for High Concurrency andHigh Performance Environments

Blades• Capacity & Performance• Embedded NVRAM

FLASHBLADE Chassis• Up to 15 Blades• 4RU Height• N+2 Redundant, Heals in Place

Fabric Module• 8 x 40GbE External ports

System Resources (15x52)• >200 (x86+ARM) cores• ~2 TB RAM• 780TB NAND Flash • 8x40GbE Ports

System Power ~ 2KW


FLASHBLADE

ブレードINTEL XEON

SoC

演算 + ネットワーキング + チップセット

低電力、低コスト設計8個の完全XEONコア

DRAMメモリー

プログラマブルプロセッサー

1個のFPGA、2個のARMコア

ELASTICFABRICコネクター

NANDフラッシュ17TB または 52TB

PURITY FBソフトウェア

すべてのプロセッサー上で分散して稼働

統合NV-RAM

スーパーコンデンサーが支える書き込みバッファ

PCIE 接続

PCIe上でCPUとフラッシュが独自仕様のプロトコルを介して通信

FLASHBLADEの仕様、機能、価格はすべて暫定であり、一般提供時に変更される可能性があります。有効容量はすべてのオーバーヘッドとデータ削減率3:1を想定しています。


REAL RESILIENCYDESIGNED FOR 99.9999% AVAILABILITY

N+2Data, Metadata, and NV-RAM all protected with N+2 redundancy

1/N Loss on FailureBlade failure results in

predictable 1/N loss in IO and metadata performance

Rebuilds in PlaceHeals around blade failure to return the array to full parity

within hours

Advanced ECCSoftware-based Flash ECC

protects against flash aging and bit errors over time

Multi-Layer IntegrityMultiple layers of checksums and

protection for both data and metadata ensure integrity


FLASHBLADE

INDUSTRY’S FIRST CLOUD-ERA FLASH PURPOSE-BUILT FOR MODERN ANALYTICS

SIMPLEEvergreen

No Manual TuningJust Add Blades for Performance

BIG10’s of Thousands of Clients

10’s of Billions of Objects & Files8 Petabytes with Single IP

FASTElastic Performance Up to 75 GB/sAlways-Fast, Small to Large Files

Massively Parallel from SW to Flash

75 blade feature is subject to GA release

Deep LearningやAnalyticsの環境に、新たなアプローチで

ワークロードを改善するデータプラットフォームソリューションがあります。

詳細のお問い合わせ、検証のご相談などお待ちしております。

まとめ

有り難う御座いました。