52
2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

2020. 04. 17

Data Mining & Quality Analytics Lab.

목 충 협

Page 2: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 2 / 52 -Copyright ⓒ 2019, All rights reserved.

• Introduction

• Optimization

• Problem description

• Learning to learn

• Appendix

Page 3: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 3 / 52 -Copyright ⓒ 2019, All rights reserved.

• 딥러닝(Deep learning)

출처 : https://brunch.co.kr/@gdhan/7

흔히 알고 있는 머신러닝(machine learning)에 속하는 인공지능 구현 방법론 중 하나

인공지능 ⊃ 머신러닝 ⊃ 인공신경망 ⊃ 딥러닝

Page 4: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 4 / 52 -Copyright ⓒ 2019, All rights reserved.

• 딥러닝(Deep learning)

음성인식 분야 이미지 처리 텍스트 분야

출처 : https://brunch.co.kr/@gdhan/7

Page 5: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 5 / 52 -Copyright ⓒ 2019, All rights reserved.

• 딥러닝(Deep learning)

컴퓨팅 파워가 발전함에 따라 기존의 머신 러닝 방법론들을 능가함

높은 성능을 보이는 이유는 많은 데이터를 사용하여 많은 가중치들(paramters)을 학습하기 때문

Optimizer는 gradient descent를 이용하여 많은 가중치들을 한번에 학습함

Optimizer 가 Optimization problem을 푸는 것딥러닝 모델을 학습시킨다

Page 6: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 6 / 52 -Copyright ⓒ 2019, All rights reserved.

• Optimization problem

Given a function 𝑓: 𝐴 → 𝑅 from some set 𝐴 to the real numbers

Minimization : find an element 𝑥0 in 𝐴 such that 𝑓 𝑥0 ≤ 𝑓 𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 𝑖𝑛 𝐴

Page 7: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 7 / 52 -Copyright ⓒ 2019, All rights reserved.

• Optimization problem in deep learning

𝑤1

𝑤2

𝑤3

𝑤4

𝑤𝑖

𝑤𝑖+1

𝑤𝑖+2

𝑤𝑗

𝑤𝑗+1

𝑤𝑗+2𝑤𝑗+3

𝑤𝑗+4

𝑤𝑖+3

𝑤𝑖+4

고양이(1)

강아지(0)

개와 고양이 사진을 분류하는 인공신경망 모델

학습 데이터(D)가 주어졌을 때 Loss function(𝑓)를 최소화하는 최적의 parameters(𝜃∗)를 구하는 것

𝑥1, 𝑦1 = ( , 고양이)

𝑥2, 𝑦2 = ( , 강아지)

𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑘 , 𝑦𝑘

𝜃 = 𝑤1, 𝑤2, 𝑤3 , …

Page 8: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 8 / 52 -Copyright ⓒ 2019, All rights reserved.

• Stochastic gradient descent(SGD)

이전 시점보다 작은 loss function값을 갖는 방향으로 가자!

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡)

𝛻𝑓(𝜃𝑡) : 방향

𝛼 : 거리(stepsize)𝜃𝑡

loss(f)

Page 9: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 9 / 52 -Copyright ⓒ 2019, All rights reserved.

• Stochastic gradient descent(SGD)

이전 시점보다 작은 loss function값을 갖는 방향으로 가자!

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡)

𝛻𝑓(𝜃𝑡) : 방향

𝛼 : 거리(stepsize)

𝜃𝑡+1

𝛼𝛻𝑓(𝜃𝑡)𝜃𝑡

loss(f)

Page 10: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 10 / 52 -Copyright ⓒ 2019, All rights reserved.

• Stochastic gradient descent(SGD)

이전 시점보다 작은 loss function값을 갖는 방향으로 가자!

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡)

1. 방향을 수정

2. Stepsize를 수정Weight

loss(f)

Local

minima

loss(f)

Weight

Saddle

point

해결 방안

Page 11: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 11 / 52 -Copyright ⓒ 2019, All rights reserved.

• Momentum

더 좋은 방향을 찾아서

• 관성을 이용하여 Local optimal, saddle point 등을 극복

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡) 𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝑣𝑡+1

𝑣𝑡+1 = 𝜌𝑣𝑡 + 𝛻𝑓(𝜃𝑡)

loss(f)

Weight

Saddle

point

𝛻𝑓 𝜃𝑡 = 0이지만, 𝑣𝑡+1 > 0이므로 지나갈 수 있음

Page 12: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 12 / 52 -Copyright ⓒ 2019, All rights reserved.

• Momentum

더 좋은 방향을 찾아서

• 관성을 이용하여 Local optimal, saddle point 등을 극복

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡) 𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝑣𝑡+1

𝑣𝑡+1 = 𝜌𝑣𝑡 + 𝛻𝑓(𝜃𝑡)

loss(f)

Weight

Saddle

point

𝛻𝑓 𝜃𝑡 = 0이지만, 𝑣𝑡+1 > 0이므로 지나갈 수 있음

Momentum 과거 정보 이용

Page 13: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 13 / 52 -Copyright ⓒ 2019, All rights reserved.

• Nesterov momentum

더 좋은 방향을 찾아서

• 관성 방향으로 움직인 후 더 정확한 방향을 계산

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝑣𝑡+1

𝑣𝑡+1 = 𝜌𝑣𝑡 + 𝛻𝑓(𝜃𝑡)

𝜃𝑡

𝜃𝑡+1

𝑣𝑡+1 = 𝜌𝑣𝑡 + 𝛻𝑓(𝜃𝑡 + 𝜌𝑣𝑡)

𝜃𝑡

𝜃𝑡+1

Nesterov momentumMomentum

Page 14: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 14 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive gradient(Adagrad)

더 좋은 stepsize를 찾아서

• Gradient(𝛻𝑓)의 값과 상관 없이 항상 같은 크기의 stepsize(𝛼)가 반영되기 때문에 비효율적

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡) 𝜃𝑡+1 = 𝜃𝑡 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

loss(f)

Weight

Saddle

point

loss(f)

Weight

작은 𝛂가필요한경우 큰 𝛂가필요한경우

Page 15: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 15 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive gradient(Adagrad)

더 좋은 stepsize를 찾아서

• Gradient(𝛻𝑓)의 값과 상관 없이 항상 같은 크기의 stepsize(𝛼)가 반영되기 때문에 비효율적

Gradient 변화(𝛻𝑓)가 큰 parameters 작게 업데이트Gradient 변화(𝛻𝑓) 가 작은 parameters 크게 업데이트

𝐺𝑡 = 𝛻𝑓(𝜃𝑡)2 + 𝛻𝑓(𝜃𝑡−1)

2 + 𝛻𝑓(𝜃𝑡−2)2 +⋯

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡) 𝜃𝑡+1 = 𝜃𝑡 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

𝐺𝑡 = 𝐺𝑡−1 + 𝛻𝑓(𝜃𝑡)2

Page 16: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 16 / 52 -Copyright ⓒ 2019, All rights reserved.

• RMSProp

더 좋은 stepsize를 찾아서

• Adagrad는 𝐺𝑡의 증가 폭이 너무 커서 stepsize( )가 금방 0으로 수렴하는 경우가 발생

𝐺𝑡 를 시간순서를 반영하여 감소시킴(오래된 gradient(𝛻𝑓)는 반영하지 않음)

𝐺𝑡 = (1 − γ) 𝛻𝑓(𝜃𝑡)2+γ(1 − γ) 𝛻𝑓(𝜃𝑡−1)

2 + γ2(1 − γ) 𝛻𝑓(𝜃𝑡−2)2 +⋯

𝜃𝑡+1 = 𝜃𝑡 − 𝛼𝛻𝑓(𝜃𝑡) 𝜃𝑡+1 = 𝜃𝑡 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

𝐺𝑡 = 𝐺𝑡−1 + 𝛻𝑓(𝜃𝑡)2 𝐺𝑡 = γ𝐺𝑡−1 + (1 − γ) 𝛻𝑓(𝜃𝑡)

2

Page 17: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 17 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive moment estimation(Adam)

여러 장점들을 조합

Stochastic gradient descent(SGD)

Momentum

Adaptive gradient(Adagrad) RMSProp

Adaptive moment estimation(Adam)

Page 18: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 18 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive moment estimation(Adam)

여러 장점들을 조합

𝜃𝑡 = 𝜃𝑡−1 − 𝛼𝑣𝑡

𝑣𝑡 = 𝜌𝑣𝑡−1 + 𝛻𝑓(𝜃𝑡)

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

𝐺𝑡 = γ𝐺𝑡−1 + (1 − γ) 𝛻𝑓(𝜃𝑡)2

좋은 방향(Momentum)

좋은 stepsize(RMSProp)

Page 19: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 19 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive moment estimation(Adam)

여러 장점들을 조합

𝜃𝑡 = 𝜃𝑡−1 − 𝛼𝑣𝑡

𝑣𝑡 = 𝜌𝑣𝑡−1 + 𝛻𝑓(𝜃𝑡)

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

𝐺𝑡 = γ𝐺𝑡−1 + (1 − γ) 𝛻𝑓(𝜃𝑡)2

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝑣𝑡

좋은 방향(Momentum)

좋은 stepsize(RMSProp)

Page 20: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 20 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive moment estimation(Adam)

여러 장점들을 조합

𝜃𝑡 = 𝜃𝑡−1 − 𝛼𝑣𝑡

𝑣𝑡 = 𝜌𝑣𝑡−1 + 𝛻𝑓(𝜃𝑡)

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝛻𝑓(𝜃𝑡)

𝐺𝑡 = γ𝐺𝑡−1 + (1 − γ) 𝛻𝑓(𝜃𝑡)2

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝑣𝑡

좋은 방향(Momentum)

좋은 stepsize(RMSProp)

𝑣𝑡 = 𝛽1𝑣𝑡−1 + (1 − 𝛽1)𝛻𝑓(𝜃𝑡)

𝐺𝑡 = 𝛽2𝐺𝑡−1 + (1 − 𝛽2) 𝛻𝑓(𝜃𝑡)2

Page 21: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 21 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adaptive moment estimation(Adam)

여러 장점들을 조합

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖ෝ𝑣𝑡

ෝ𝑣𝑡 =𝑣𝑡

1 − 𝛽1𝑡

𝐺𝑡 =𝐺𝑡

1 − 𝛽2𝑡

𝜃𝑡 = 𝜃𝑡−1 −𝛼

𝐺𝑡 + 𝜖𝑣𝑡

𝑣𝑡 = 𝛽1𝑣𝑡−1 + (1 − 𝛽1)𝛻𝑓(𝜃𝑡)

𝐺𝑡 = 𝛽2𝐺𝑡−1 + (1 − 𝛽2) 𝛻𝑓(𝜃𝑡)2

𝑣𝑡 = 𝛽1𝑣𝑡−1 + (1 − 𝛽1)𝛻𝑓(𝜃𝑡)

𝐺𝑡 = 𝛽2𝐺𝑡−1 + (1 − 𝛽2) 𝛻𝑓(𝜃𝑡)2

Adam

Bias correction

𝑣𝑡, 𝐺𝑡가 0 근처반영 안됨

Page 22: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 22 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adam이 가장 좋은가?

Simulation

밝을수록 낮은 Loss 값

Global optimal point 존재

Local optimal point 1개

https://emiliendupont.github.io/2018/01/24/optimization-visualization/

Global optimal point Local optimal point

Page 23: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 23 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adam이 가장 좋은가?

Simulation

https://emiliendupont.github.io/2018/01/24/optimization-visualization/

Adam, RMSProp만 Global optimal point에 도달 모두 Global optimal point에 도달하지 못함

Page 24: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 24 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adam이 가장 좋은가?

Simulation

https://emiliendupont.github.io/2018/01/24/optimization-visualization/

모두 Global optimal point에 도달 Adam만 Global optimal point에 도달하지 못함

Page 25: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 25 / 52 -Copyright ⓒ 2019, All rights reserved.

• Adam이 가장 좋은가?

Adam을 가장 많이 사용하지만 항상 좋지는 않음

시작 지점, 데이터에 따라 다름

Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.

Page 26: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 26 / 52 -Copyright ⓒ 2019, All rights reserved.

• 어떤 optimizer를 써야 할까?

어떤 알고리즘을 쓰더라도 항상 최고의 성능을 보장하지 못함

Page 27: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 27 / 52 -Copyright ⓒ 2019, All rights reserved.

• 어떤 optimizer를 써야 할까?

어떤 알고리즘을 쓰더라도 항상 최고의 성능을 보장하지 못함

Data에 적합한 optimizer를 만들어 학습시키자!

소개할 논문은 LSTM모델로 optimizer를 만들어 학습

Page 28: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 28 / 52 -Copyright ⓒ 2019, All rights reserved.

• Learning to learn by gradient descent by gradient descent

Google deepmind

29th Neural Information Processing Systems (NIPS) 2016

인용 수 높음(696회)

Page 29: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 29 / 52 -Copyright ⓒ 2019, All rights reserved.

Learning to learn by gradient descent by gradient descent

학습을 위한 학습(meta-learning)기존 모델을 학습시키는 optimizer의 학습

Gradient descent를 예측하는 optimizer(LSTM)을gradient descent로 학습

Page 30: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 30 / 52 -Copyright ⓒ 2019, All rights reserved.

• Long Short-Term Memory(LSTM)

순환 신경망 구조(RNN)의 한 모델

데이터의 순차적 정보(Sequence data)를 반영함

Sequence가 길어짐에 따라 발생하는 문제를 해결하기 위해 고안됨

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 31: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 31 / 52 -Copyright ⓒ 2019, All rights reserved.

• Hand-designed optimizer vs model-based optimizer

Hand-designed optimizer (SGD, Adam, …)

• 데이터/모델이 바뀔 때마다 수동으로 hyper-parameter tuning이 필요함

• 모든 데이터에 적합한 알고리즘 찾지 못함

Model-based optimizer (proposed)

• Data 별로 적합한 optimizer를 학습시키자! (meta-learning 관점)

비슷한 데이터로 optimizer를 학습시켜 이용하면 원래 데이터에서 더 좋은 학습이 가능하다.

(Transfer learning 관점)

Page 32: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 32 / 52 -Copyright ⓒ 2019, All rights reserved.

Hand-designed optimizer (SGD)

Model-based optimizer (proposed)

𝛻𝑓(𝜃𝑡)방향

x −𝛼거리

𝛻𝑓 𝜃𝑡

• Hand-designed optimizer vs model-based optimizer

model

방향 + 거리

Rule-based

Data-driven

− 𝛼𝛻𝑓(𝜃𝑡)𝜃𝑡 = 𝜃𝑡+1

+ 𝑔𝑡𝜃𝑡 = 𝜃𝑡+1

Page 33: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 33 / 52 -Copyright ⓒ 2019, All rights reserved.

f

𝛻

m

𝛻𝑓(𝜃𝑡)

Loss function

Optimizer(LSTM)

• LSTM optimizer

𝛻𝑓 𝜃𝑡−2

+ 𝑔𝑡−2𝜃𝑡−2 = 𝜃𝑡−1

model

방향 + 거리

1. 𝑓(𝜃𝑡−2)의 gradient 𝛻𝑓 𝜃𝑡−2 를 계산2. 𝛻𝑓 𝜃𝑡−2 를 Optimizer(LSTM)의 input으로 사용3. Optimizer(LSTM)의 output 𝑔𝑡−2를 계산4. 𝜃𝑡−2와 𝑔𝑡−2를 더하여 다음 𝜃𝑡−1 계산

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., ... & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems (pp. 3981-3989).

Page 34: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 34 / 52 -Copyright ⓒ 2019, All rights reserved.

f

𝛻

m

𝛻𝑓(𝜃𝑡)

Loss function

Optimizer(LSTM)

• LSTM optimizer

𝛻𝑓 𝜃𝑡−2

+ 𝑔𝑡−2𝜃𝑡−2 = 𝜃𝑡−1

model

방향 + 거리

1. 𝑓(𝜃𝑡−2)의 gradient 𝛻𝑓 𝜃𝑡−2 를 계산2. 𝛻𝑓 𝜃𝑡−2 를 Optimizer(LSTM)의 input으로 사용3. Optimizer(LSTM)의 output 𝑔𝑡−2를 계산4. 𝜃𝑡−2와 𝑔𝑡−2를 더하여 다음 𝜃𝑡−1 계산

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., ... & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems (pp. 3981-3989).

Page 35: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 35 / 52 -Copyright ⓒ 2019, All rights reserved.

f

𝛻

m

𝛻𝑓(𝜃𝑡)

Loss function

Optimizer(LSTM)

• LSTM optimizer

𝛻𝑓 𝜃𝑡−2

+ 𝑔𝑡−2𝜃𝑡−2 = 𝜃𝑡−1

model

방향 + 거리

1. 𝑓(𝜃𝑡−2)의 gradient 𝛻𝑓 𝜃𝑡−2 를 계산2. 𝛻𝑓 𝜃𝑡−2 를 Optimizer(LSTM)의 input으로 사용3. Optimizer(LSTM)의 output 𝑔𝑡−2를 계산4. 𝜃𝑡−2와 𝑔𝑡−2를 더하여 다음 𝜃𝑡−1 계산

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., ... & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems (pp. 3981-3989).

Page 36: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 36 / 52 -Copyright ⓒ 2019, All rights reserved.

f

𝛻

m

𝛻𝑓(𝜃𝑡)

Loss function

Optimizer(LSTM)

• LSTM optimizer

𝛻𝑓 𝜃𝑡−2

+ 𝑔𝑡−2𝜃𝑡−2 = 𝜃𝑡−1

model

방향 + 거리

1. 𝑓(𝜃𝑡−2)의 gradient 𝛻𝑓 𝜃𝑡−2 를 계산2. 𝛻𝑓 𝜃𝑡−2 를 Optimizer(LSTM)의 input으로 사용3. Optimizer(LSTM)의 output 𝑔𝑡−2를 계산4. 𝜃𝑡−2와 𝑔𝑡−2를 더하여 다음 𝜃𝑡−1 계산

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., ... & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems (pp. 3981-3989).

Page 37: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 37 / 52 -Copyright ⓒ 2019, All rights reserved.

f

𝛻

m

𝛻𝑓(𝜃𝑡)

Loss function

Optimizer(LSTM)

• LSTM optimizer

𝛻𝑓 𝜃𝑡−2

+ 𝑔𝑡−2𝜃𝑡−2 = 𝜃𝑡−1

model

방향 + 거리

1. 𝑓(𝜃𝑡−2)의 gradient 𝛻𝑓 𝜃𝑡−2 를 계산2. 𝛻𝑓 𝜃𝑡−2 를 Optimizer(LSTM)의 input으로 사용3. Optimizer(LSTM)의 output 𝑔𝑡−2를 계산4. 𝜃𝑡−2와 𝑔𝑡−2를 더하여 다음 𝜃𝑡−1 계산

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., ... & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems (pp. 3981-3989).

Page 38: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 38 / 52 -Copyright ⓒ 2019, All rights reserved.

• LSTM optimizer

Optimizer의 학습

Optimizer가 학습시킨 모델의 성능이 좋아지도록 학습됨

Optimizer(∅)가 모델(𝑓)를 완전히 학습시켜 구한 최적 파라미터

현실적으로 완전히 학습시킨 𝜃∗를 구하는 것은 어려움

최적 parameters(𝜃∗) 대신 매 스탭 학습과정의 parameters(𝜃𝑡)를 이용하여 가중치로 합함

𝐿 ∅ = 𝐸[𝑓(𝜃∗)]

𝜃∗(𝑓, ∅) :

𝐿 ∅ = 𝐸[

𝑡=1

𝑇

𝑤𝑡𝑓(𝜃𝑡)]

Page 39: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 39 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input 𝜃1 𝜃2 𝜃3

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝜃𝑇

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 40: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 40 / 52 -Copyright ⓒ 2019, All rights reserved.

𝑤1𝑤2𝑤3𝑤4

𝑤𝑖

𝑤𝑖+1𝑤𝑖+2

𝑤𝑗

𝑤𝑗+1

𝑤𝑗+2𝑤𝑗+3𝑤𝑗+4

𝑤𝑖+3

𝑤𝑖+4

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

𝜃𝑡 = (𝑤12, 𝑤2

2, 𝑤32, 𝑤4

2, … , 𝑤𝑛2)

n 수만 ~ 수십, 수백만 개

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝜃1 𝜃2 𝜃3

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝜃𝑇

Page 41: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 41 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝑤12

𝑤22

𝑤32

𝑤𝑛2

𝑤11

𝑤21

𝑤31

𝑤𝑛1

𝑤13

𝑤23

𝑤33

𝑤𝑛3

𝑤1𝑇

𝑤2𝑇

𝑤3𝑇

𝑤𝑛𝑇

Page 42: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 42 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝑤12

𝑤22

𝑤32

𝑤𝑛2

𝑤11

𝑤21

𝑤31

𝑤𝑛1

𝑤13

𝑤23

𝑤33

𝑤𝑛3

𝑤1𝑇

𝑤2𝑇

𝑤3𝑇

𝑤𝑛𝑇

𝑤1의 학습 과정

Page 43: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 43 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝑤12

𝑤22

𝑤32

𝑤𝑛2

𝑤11

𝑤21

𝑤31

𝑤𝑛1

𝑤13

𝑤23

𝑤33

𝑤𝑛3

𝑤1𝑇

𝑤2𝑇

𝑤3𝑇

𝑤𝑛𝑇

𝑤2의 학습 과정

Page 44: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 44 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝑤12

𝑤22

𝑤32

𝑤𝑛2

𝑤11

𝑤21

𝑤31

𝑤𝑛1

𝑤13

𝑤23

𝑤33

𝑤𝑛3

𝑤1𝑇

𝑤2𝑇

𝑤3𝑇

𝑤𝑛𝑇

𝑤3의 학습 과정

Page 45: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 45 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Output

Input

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑔1 𝑔2 𝑔3 𝑔𝑇

𝑤12

𝑤22

𝑤32

𝑤𝑛2

𝑤11

𝑤21

𝑤31

𝑤𝑛1

𝑤13

𝑤23

𝑤33

𝑤𝑛3

𝑤1𝑇

𝑤2𝑇

𝑤3𝑇

𝑤𝑛𝑇 𝑤𝑛의 학습 과정

Page 46: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 46 / 52 -Copyright ⓒ 2019, All rights reserved.

• Parameter sharing (Coordinatewise LSTM optimizer)

Input/Output 차원이 작기 때문에 optimizer(LSTM)의 학습이 용이함

Input size가 많고 데이터가 적은 것 보다 Input size가 적고 데이터가 많은 것이 LSTM학습에 유리

모든 parameters(𝑤11, 𝑤2

1, …)마다 각각 학습되는 다양한 패턴을 학습할 수 있음

과거 각각의 parameters들의 정보를 알기 때문에 정확한 예측이 가능함(LSTM의 장점 활용)

Momentum, Adam과 같은 원리

Parameters 별 학습이기 때문에 hidden layer, units 등 모델 구조가 달라져도 사용이 가능

Conv/FC/LSTM 종류별로만 지식 전이(knowledge transfer)가 가능

Page 47: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 47 / 52 -Copyright ⓒ 2019, All rights reserved.

• Results

Optimizer(LSTM) 학습 시 구조 : hidden layer 1개, units 20개, sigmoid 사용

Quadratics, MNIST 데이터 셋에 대해 실험을 진행

Page 48: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 48 / 52 -Copyright ⓒ 2019, All rights reserved.

• Results

Optimizer(LSTM) 학습 시 구조 : hidden layer 1개, units 20개, sigmoid 사용

Units, hidden layers, activation function을 바꾸어가며 실험 진행

Sigmoid ReLU로 변환 시 오히려 학습이 안되는 것을 알 수 있음

Page 49: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 49 / 52 -Copyright ⓒ 2019, All rights reserved.

• Results

Optimizer(LSTM) 학습 시 구조 : hidden layer 1개, units 20개, sigmoid 사용

CIFAR-10, CIFAR-5, CIFAR-2 데이터 셋에 대해 실험을 진행

LSTM-sub의 경우 Test에 사용되지 않는 클래스들의 이미지를 이용하여 Optimizer 학습을 진행함

Page 50: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 50 / 52 -Copyright ⓒ 2019, All rights reserved.

사진(원본) 그림(스타일) 사진(원본) 그림(스타일)

• Results

실제 사용성을 증명하기 위해 Computer vision task에 적용함

Page 51: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 51 / 52 -Copyright ⓒ 2019, All rights reserved.

• Conclusion

parameters가 최적화되는 방향과 거리를 한번에 예측해주는 LSTM optimizer를 사용

데이터를 이용하여 효율적인 optimizer를 학습시키는 것이 가능

모델 구조를 변경하더라도 사용이 가능

𝜃𝑛𝑒𝑤 = 𝜃𝑜𝑙𝑑 − 𝛼𝛻𝑓(𝜃𝑡)

LSTM optimizer의 예측 값

Page 52: 2020. 04. 17 Data Mining & Quality Analytics Lab.dmqm.korea.ac.kr/uploads/seminar/0417세미나_목충협.pdf · 2020-04-17 · 2020. 04. 17 Data Mining & Quality Analytics Lab. 목충협

- 52 / 52 -Copyright ⓒ 2019, All rights reserved.

감사합니다