View
10
Download
0
Category
Preview:
Citation preview
DMTK超大规模深度学习框架及其在文本理解中的应用
Taifeng Wang
Lead Researcher
Machine Learning Group, MSRA
2016 GTC China
9/13/2016 Taifeng Wang@GTC China 1
Redmond, Washington, USA Sep 1991
Cambridge, UK July 1997
Bangalore, India Jan 2005
Cambridge, Massachusetts, USA 2008
New York, USA May 2012
Beijing, China Nov 1998
Microsoft Research Lab Locations
1.0
Technologies transferred into all
major Microsoft products
Microsoft Research Asia
About DMTK (Distributed machine learning toolkit)
http://dmtk.io
Release to github on 11.09.2015 by Machine Learning group of MSRA.
https://github.com/Microsoft/DMTK
We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model learning tasks.
9/13/2016 Taifeng Wang@GTC China 4
DMTK User Engagement
• Within just one week after release (2015.11.10): • 1000+ stars and 200+ forks at GitHub
• 1M+ visits to DMTK homepage
• 300K+ downloads of binary executables
• Major upgrade (2016.9)
9/13/2016 Taifeng Wang@GTC China 5
www.dmtk.io
• Parameter server 1.0 • Distributed LightLDA • Distributed Word2Vec
• Parameter server 2.0 • Simpler SDK usage • System performance
enhancement, e.g. memory & network cost reduction
• Enrich program language support for parameter server • Python • Lua
• Connect with torch/Caffe/Theano
• Deep integration with CNTK • Distributed Logistic
Regression with online update (FTRL)
• Distributed gradient boosting decision tree(GBDT)
• Innovation on distributed optimization: DC-ASGD, ensemble model, accelerated optimization methods
• Model parallelism on deep learning models
• 2C-RNN for text understanding
• Graph embedding
有关 DMTK – 发展历程
2015.11
2016.3
2016.6
2016.9
2016.12
9/13/2016 Taifeng Wang@GTC China 6
Execution Engines
YARN
Microsoft Distributed Machine Learning Toolkit (DMTK)
9/13/2016 Taifeng Wang@GTC China 7
Multiverso Parameter Server
Rich communication interface
MPI ZeroMQ
Distributed synchronization mechanism
MA / ADMM / BMUF ASGD /DC-ASGD
RDMA GPU direct
Matrix / tensor
Hash table Tree
Hybrid model store
Model Slicing
Distributed machine learning algorithms
2D-RNN LightGBM LightLDA Districted Word Embedding
AzureML CNTK Other Single node dnn tools:
Theano/caffe/torch
parallelize different machine learning toolkits
Logistic Regression
Workload supported
LightLDA Word2Vec GBDT LSTM CNN
Model
20M vocab, 1M topics (largest topic model)
Data
200B tokens (Bing web chunk)
Training time
60 hrs on 24 machines (nearly linear speed-up)
Model
10M vocab, 1000 dim (largest word embedding)
Data
200B samples (Bing web chunk)
Training time
40 hrs on 8 machines (nearly linear speed-up)
Model
3000 trees (120-node) (GBDT)
Data
7M records (Bing HRS data)
Training time
3 hrs on 8 machines (4x of speed-up)
Model
20M parameters (4 hidden layer, LSTM)
Data
1570 hrs speech data (Win phone data
Training time
1 day on 16 GPUs (15.9x speed-up)
Model
41M parameters (GoogleLeNet)
Data
2M images (ImageNet 1K dataset)
Training time
30 hrs on 16 GPUs (10x speed-up
Online FTRL
Model
800M parameters (Logistic Regression)
Data
6.4B impressions (Bing Ads click log)
Training time
2400s on 24 machines (12x speed-up)
4/25/2016 Taifeng Wang @ HKUST 8
Forward looking - Microsoft Cognitive Toolkits
9/13/2016 Taifeng Wang@GTC China 9
如何推动大规模机器学习的发展
Algorithmic Innovation
• Machine learning algorithms themselves need to have sufficiently high efficiency and throughout.
• Existing design/implementation of machine learning algorithms might not have considered this request; redesign/re-implementation might be needed.
System Innovation
• One needs to leverage the full power of distributed system, and pursue almost linear scale out/speed up.
• New distributed training paradigm needs to be invented in order to revolve the bottle neck of existing distributed machine learning systems.
9/13/2016 Taifeng Wang@GTC China 10
Evolution of Distributed ML
Dataflow (Deep learning)
Synchronous
Asynchronous
Data Parallelism Model Parallelism
Iterative MapReduce
(LDA, LR)
Parameter Server (Deep learning, LDA,
GBDT, LR)
Irregular Parallelism
9/13/2016 Taifeng Wang@GTC China 11
Evolution of Distributed Machine Learning
Iterative MapReduce
• Use MapReduce / AllReduce to sync parameters among workers
• Only synchronous update
• Example: Spark and
other derived systems
Local computation
Synchronous update
9/13/2016 Taifeng Wang@GTC China 12
Evolution of Distributed Machine Learning
Iterative MapReduce
Parameter Server
• Parameter server (PS) based solution is proposed to support: • Asynchronous update • Different mechanisms for model
aggregation, especially in asynchronous manner
• Model parallelism
• Example:
• Google’s DistBelief; Petuum • Multiverso PS
+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc. 9/13/2016 Taifeng Wang@GTC China 13
Evolution of Distributed Machine Learning
Iterative MapReduce
Parameter Server
Dataflow based solution is proposed to support: • Irregular parallelism (e.g., hybrid
data- and model-parallelism), particularly in deep learning
• Both high-level abstraction and low-level flexibility in implementation
Example: • Google’s TensorFlow
Dataflow
+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab)
Task scheduling & execution based on: 1. Data dependency 2. Resource availability
Dataflow Resource
9/13/2016 Taifeng Wang@GTC China 14
Delay compensate ASGD Our work on system innovation
9/13/2016 Taifeng Wang@GTC China 15
Delayed Gradients
9/13/2016 Taifeng Wang@GTC China 16
• Sequential SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ
• Async SGD 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡
≠
𝑔 𝑤𝑡+τ = 𝑔 𝑤𝑡 + 𝛻𝑔 𝑤𝑡 · 𝑤𝑡+τ − 𝑤𝑡 + O( 𝑤𝑡+τ − 𝑤𝑡2)
𝛻𝑔 𝑤𝑡 corresponds to the Hessian matrix
Unbiased Efficient Approximation of Hessian Matrix
9/13/2016 Taifeng Wang@GTC China 17
Theorem: Assume that 𝑌 is discrete random variable and 𝑃 𝑌 = 𝑘 𝑋 = 𝑥,𝑤 = 𝜎𝑘(𝑥; 𝑤), where 𝜎𝑘 𝑥; 𝑤 < 1, ∀𝑋,𝑤, 𝑘 = 1,… , 𝐾 . Let 𝐿 𝑥, 𝑦, 𝑤 = − 𝐼𝑦=𝑘 log 𝜎𝑘(𝑥;𝑤)𝑘 .
Then we can prove that there exists a function 𝜙, such that:
𝐸(𝑌|𝑥,𝑤)
𝜕2
𝜕𝑤2𝐿 𝑋, 𝑌;𝑤 = 𝐸 𝑌|𝑥,𝑤 𝜙
𝜕
𝜕𝑤𝐿 𝑋, 𝑌;𝑤
For cross-entropy loss, the second-order derivatives can be derived from first-order derivatives in an unbiased manner.
Delay Compensated ASGD (DC-ASGD)
9/13/2016 Taifeng Wang@GTC China 18
DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · 𝑤𝑡+τ − 𝑤𝑡
ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡
Experimental Result (based on ResNet)
9/13/2016 Taifeng Wang@GTC China 19
CFAR ImageNet
2C-RNN: A Super Efficient and Scalable Deep Algorithm for Text Understanding Our work on algorithm innovation – published in NIPS 2016
9/13/2016 Taifeng Wang@GTC China 20
Recurrent Neural Networks for text applications • A widely used model for sequence representation and learning
• Language modeling
• Machine translation
• Conversation bot
• Image/video caption
Major challenges: efficiency and scalability
9/13/2016 Taifeng Wang@GTC China 21
Language modeling ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑜 𝑡 = 𝑉ℎ 𝑡
𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡
Symbol Definition Dimension
𝑥 𝑡 (Input) Embedding vector of the word at position 𝑡
𝑤
𝑈 Parameter matrix: input hidden state
ℎ ∗ 𝑤
𝑊 Parameter matrix: hidden state hidden state
ℎ ∗ ℎ
𝑉 Output embedding matrix: hidden state output
𝑉 ∗ ℎ
𝑦 𝑡 Predicted probability for each word
|𝑉|
9/13/2016 Taifeng Wang@GTC China 22
Challenge in text applications: model size
• Large scale
• Large model size current GPU cannot support
http://www.dmtk.io/word2vec.html
Symbol Definition Dimension Memory Size
𝑥 (Input) Embedding vector of the word at position 𝑡
𝑉 ∗ 𝑤 10M*1024*4B = 40G
𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M
𝑊 Parameter matrix: hidden state hidden state
ℎ ∗ ℎ 1024*1024*4B=4M
𝑉 Output embedding matrix: hidden state output
𝑉 ∗ ℎ 10M*1024*4B = 40G
𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M
Dataset #token Vocab
Clueweb09(en) 143,820,387,816 10,784,180
9/13/2016 Taifeng Wang@GTC China 23
Challenge in text applications: running time
• Large scale
• Huge computation complexity • To choose one word, we need to go through all the words in the vocabulary
http://www.dmtk.io/word2vec.html
ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑜 𝑡 = 𝑉ℎ 𝑡
𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡
#operation Operation unit
ℎ 𝑡 2 million Float operation
𝑜 𝑡 10 billion Float operation
Dataset #token Vocab
Clueweb09(en) 143,820,387,816 10,784,180
9/13/2016 Taifeng Wang@GTC China 24
Device Computation (#core, flops)
Global Mem Cap./BW
Running time
Xeon Broadwell 14nm (V16)
Broadwell 2*20Core, 0.736 TFLOPS
8x32GB DDR4 (256GB)/95GBps
0.143T*10G*10*2/0.736T/3600/24/365=1232 years
GPU (K40) 28nm
5.0 TFLOPS (float32) (2880 cores)
12GB GDDR5 /288GBps
0.143T*10G*10*2/5T/3600/24/365=181 years
GPU (M40) 28nm
6.8 TFLOPS (float32) (3072 cores)
24GB GDDR5/ 288GBps
0.143T*10G*10*2/6.8T/3600/24/365=133 years
GPU (P100) 16nm
10.6 TFLOPS (float32) 21.2 TFLOPS (float16)
16GB HBM2/720GBps
0.143T*10G*10*2/10.6T/3600/24/365=85 years
25
Training time estimation on mainstream hardware Dataset #token Vocab
Clueweb09(en) 143,820,387,816 10,784,180
0.143T*10G*10*2/5T/3600/24/365=181 years
#tokens
#operations per token #epochs
Forward and backward propagation
#FLOPS
9/13/2016 Taifeng Wang@GTC China
Big challenge to algorithm innovation and hardware manufactory
• Key problem - Huge vocabulary
9/13/2016 Taifeng Wang@GTC China 26
Symbol Definition Dimension Memory Size
𝑥 (Input) Embedding vector of the word at position 𝑡
𝑉 ∗ 𝑤 10M*1024*4B = 40G
𝑈 Parameter matrix: input hidden state ℎ ∗ 𝑤 1024*1024*4B=4M
𝑊 Parameter matrix: hidden state hidden state
ℎ ∗ ℎ 1024*1024*4B=4M
𝑉 Output embedding matrix: hidden state output
𝑉 ∗ ℎ 10M*1024*4B = 40G
𝑦 𝑡 Predicted probability for each word |𝑉| 10M*4B = 40M
Our proposal: 2-Component shared embedding (Accepted by NIPS 2016)
Embedding vector
word
𝑥1 January
𝑥2 February
… …
𝑥15 one
𝑥16 two
Embedding vector
𝒚𝟏 𝒚𝟐 𝒚𝟑 𝒚𝟒
𝑥1 January February
𝑥2 one two
𝑥3
𝑥4
Embedding vector
word
𝑥1, 𝑦1 January
(𝑥1, 𝑦2) February
… …
(𝑥2, 𝑦1) one
(𝑥2, 𝑦2) two
Current practice
==
Our approach
2C: each word is partitioned and represented by two vectors (𝑥, 𝑦) Shared embedding: 𝑥 is shared in the same row, y is shared in the same column |𝑉| vectors
2 |𝑉| vectors
9/13/2016 Taifeng Wang@GTC China 27
2C-RNN
𝑥𝑡−1𝑟
ℎ𝑡−1𝑟
𝑃𝑐(𝑤𝑡−1)
𝑌𝑐
𝑈
𝑊
𝑋𝑟
𝑥𝑡−1𝑐
ℎ𝑡−1𝑐
𝑃𝑟(𝑤𝑡)
𝑌𝑟
𝑈
𝑊
𝑋c
𝑥𝑡𝑟
ℎ𝑡𝑟
𝑃𝑐(𝑤𝑡)
𝑌𝑐
𝑈
𝑊
𝑋𝑟
𝑥𝑡𝑐
ℎ𝑡𝑐
𝑃𝑟(𝑤𝑡+1)
𝑌𝑟
𝑈
𝑊
𝑋𝑐
𝑥𝑡+1𝑟
ℎ𝑡+1𝑟
𝑃𝑐(𝑤𝑡+1)
𝑌𝑐
𝑈
𝑊
𝑋𝑟
𝑥𝑡−1
ℎ𝑡−1
𝑃(𝑤𝑡)
𝑌
𝑊
𝑈
𝑥𝑡
ℎ𝑡
𝑃(𝑤𝑡+1)
𝑌
𝑊
𝑈 𝑈
𝑋
𝑤𝑡−1
𝑤𝑡
𝑤𝑡
𝑤𝑡+1
𝑋
𝑤𝑡−1 𝑤𝑡
𝑤𝑡 𝑤𝑡+1
Previous word
Predicted current word
Current word
Predicted next word
9/13/2016 Taifeng Wang@GTC China 28
Analysis on model size
Symbol Definition Dimension Memory Size
𝑥, 𝑦 (Input) Embedding vector of the word at position 𝑡 2 ∗ 𝑉 ∗ 𝑤 2*(10M)1/2*1024*4B = 25M
𝑈𝑥, 𝑈𝑦 Parameter matrix: input hidden state 2 ∗ ℎ ∗ 𝑤 2*1024*1024*4B=8M
𝑊𝑥,𝑊𝑦 Parameter matrix: hidden state hidden state 2 ∗ ℎ ∗ ℎ 2*1024*1024*4B=8M
𝑉𝑥, 𝑉𝑦 Output embedding matrix: hidden state output 2 ∗ 𝑉 ∗ ℎ 2*(10M)1/2*1024*4B = 25M
𝑦 𝑡 Predicted probability for each word 2 ∗ 𝑉 <1M
80G 70M ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑜 𝑡 = 𝑉ℎ 𝑡
𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡
9/13/2016 Taifeng Wang@GTC China 29
Analysis on computational complexity
10G 10M
#operation per token Operation unit
ℎ𝑥𝑡, ℎ𝑦
𝑡 4𝑀 Float operation
𝑜𝑥𝑡, 𝑜𝑦
𝑡 2 ∗ 10𝑀 ∗ 1024 = 6𝑀 Float operation
GPU (K40) 28nm
5.0 TFLOPS (float32) (2880 cores)
12GB GDDR5 /288GBps
0.143T*10M*10*2/5T/3600/24/365=0.18 years
Training time estimation: 181 years 0.18 years
ℎ 𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑜 𝑡 = 𝑉ℎ 𝑡
𝑦 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑜 𝑡
9/13/2016 Taifeng Wang@GTC China 30
Now we can easily parallel with parameter server framework, if you have 20 machine -> 2-3 days
How to allocate words into the 2D table
• Cold start • Row partition according to prefix
• Column partition according to suffix
• Bootstrap • Train with current partitions for several
iterations
• Adjust partitions based on training loss
• Continue training
Billion Million Trillion xxxllion
react real return rexxxxx
sure prepare gre xxxxre
In the same column
In the same column
In the same row
9/13/2016 Taifeng Wang@GTC China 31
Experimental results
Middle-sized Dataset 2013 ACL Workshop dataset
PPL. on test (ACLW-Spanish)
PPL. on test (ACLW-French)
model size
KN-4 [1] 219 243
MLBL[1] 203 227
LSTM word-in/word-out 186 202 61 M
LSTM char-cnn-in/word-out [2]
169 190 45 M
Our 2C-RNN [cold start] 184 210 17 M
Our 2C-RNN [bootstrap] 157 181 17 M
[1] non-lstm-rnn method baseline. http://jmlr.org/proceedings/papers/v32/botha14.pdf [2] previous state-of-art method using character-cnn-input, Character-Aware Neural Language Models, http://arxiv.org/abs/1508.06615
Method (1 GPU) Runtime(hours) Reallocation/Training
HSM 168 --
2C-RNN 82 0.19%
To achieve same PPL with HSM baseline
9/13/2016 Taifeng Wang@GTC China 32
Experimental results Large scale Dataset one billion benchmark:
[1] One Billion Word benchmark for measuring progress in statistical language modeling, https://arxiv.org/abs/1312.3005 [2] Strategies for Training Large Vocabulary Neural Language Models, http://arxiv.org/abs/1512.04906 [3] blackout:speeding up recurrent neural network language models with very large vocabularies, https://arxiv.org/pdf/1511.06909v7.pdf
PPL. on test
model size
KN-5 [1] 68 2 G
HSM [2] 85 1.6 G
Blackout-RNN [3] 68 4.1 G
Our 2C-RNN [cold start] 78 41 M
Our 2C-RNN [bootstrap] 66 41 M
KN + HSM [2] 56 --
KN + Blackout-RNN [3] 47 --
KN + 2C-RNN 43 --
Method (1 GPU) Runtime(hours) Reallocation/Training
HSM 168 --
2C-RNN 70 2.36%
To achieve same PPL with HSM baseline
9/13/2016 Taifeng Wang@GTC China 33
Summary and forward looking
• DMTK includes innovation from both system and algorithm • Excellent speed up and widely available system integration • Advanced distributed optimization method • Many world leading algorithms
• Machine learning for distributed deep learning
• Learning how to acquire, select, and partition the data • Learning the optimal network structure • Learning how to perform model update • Learning how to tune the hyper-parameters • Learning how to aggregate local models
• Create an AI that can automatically create new AI!
9/13/2016 Taifeng Wang@GTC China 34
Thanks! taifengw@microsoft.com
https://www.microsoft.com/en-us/research/people/taifengw/
9/13/2016 Taifeng Wang@GTC China 35
DMTK有关材料 dmtk@Microsoft.com http://www.dmtk.io https://github.com/Microsoft/multiverso/wiki
欢迎加入我们的微信群,一起讨论大数据人工智能
分布式机器学习联盟
Bootstrap: bipartite graph matching
9/13/2016 Taifeng Wang@GTC China 36
Comparison ∗
𝑐1 𝑐2 𝑐𝑘
𝑤1,1 𝑤1,𝑘 𝑤𝑘,1 𝑤𝑘,𝑘
Class based softmax
…
…
… …
Model size Training time Test time Generalization time
Standard 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)
Tree based softmax 𝑂( 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂(log 𝑉 × 𝑤) 𝑂( 𝑉 × 𝑤)
Class based softmax 𝑂( 𝑉 × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( 𝑉 × 𝑤)
Our 2C 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤) 𝑂( |𝑉| × 𝑤)
9/13/2016 Taifeng Wang@GTC China 37
Recommended