25
Learning Coordination strategies using reinforcement learning -- Myriam Z. Abramson , dissertation, 2003 張張張 dorgon chang 20120614

Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Embed Size (px)

Citation preview

Page 1: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

-- Myriam Z. Abramson , dissertation, 2003

張景照dorgon chang

20120614

Page 2: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

2

Index

• Coordinate problem( 要解決的問題 )• Evaluation of the GO• Reinforcement learning• Temporal Difference learning( 使用 Sarsa)• Learning Vector Quantization(LVQ)• Sarsa LVQ (SLVQ) <= 作者提出的方法

Page 3: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

3

Coordinate problem

• Coordination strategy problem 簡單來說就是action selection problem 。

• 當我們只知道 local situation 的時候,如何選擇一個正確的行動,在不依靠 end game state 的情況下,讓其能夠跟其他的行動結合在一起。• 局部地區的戰術 如何影響整體的戰略。

Page 4: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

4

Evaluation of the GOThis method convey the spatial connectivity between the stones

ε 為自定數值,當該點的 influence 超過 ε 時會繼續往外擴散

對盤面上所有數字做加總,可以得到一個盤面的評估值

黑子往外擴散 1白子往外擴散 -1

在接下來的方法中當成 reward

Page 5: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

5

Reinforcement learning :Introduction

• Machine learning 的目標是用來產生一個agents ,而 RL 是其中的一個方法 ,其特徵是 Trial-and-error search and delayed reward 。

下一顆子

例如:贏或輸、 平手盤面

Agent 往後預測幾步盤面

Page 6: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

6

Reinforcement learning :Value Function

• π = agent 所使用來選擇 action 的 policy 。• s = 目前的 state 。• :在 π 這個 policy 下, state s 所得到的 reward 。 • :在 π 這個 policy 下, state s 採取 action a 所得到的 reward 。 • 最常見的 policy 為 ε-greedy 。 ( 另有 greedy 、 ε-soft 、 softmax……)

• ε 介於 0~1 之間,其值越高,代表「 exploration 」越被鼓勵。 (exploration v.s. exploitation)

• ε-greedy :大部份都會選擇擁有最高 reward 的行動,有少部份的機率 ε 會亂數決定。

Page 7: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

7

Temporal Difference learning

• TD learning 是用來評估 RL 中 value function的方法。

DP :當前的評估值是基於先前學習過的評估值為基礎。MC :利用 random game 的模式,統計其結果來解決未來可能遇到的問題。TD 方法結合:

Page 8: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

8

Temporal Difference learning : Forward View of TD() (1)

• Monte Carlo : observe reward for all steps in an episode

• TD(0) : observed one step only   observed two step

• TD() is a method for averaging all n-step,

1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

V(St)V(St+1)

Value update

set λ = 0, TD(0) set λ = 1, Monte Carlo

r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate

為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar

Page 9: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

9

Temporal Difference learning : Forward View of TD() (2)

• Monte Carlo : observe reward for all steps in an episode

• TD(0) : observed one step only   observed two step

• TD() is a method for averaging all n-step,

1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

Value update

set λ = 0, TD(0) set λ = 1, Monte Carlo

set λ = 0 代入

r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate

為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar

Page 10: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

10

Temporal Difference learning : Forward View of TD() (3)

• Monte Carlo : observe reward for all steps in an episode

• TD(0) : observed one step only   observed two step

• TD() is a method for averaging all n-step,

1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

Value update

set λ = 0, TD(0) set λ = 1, Monte Carlo

set λ = 1 代入

r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate

為在 t 時間點往後觀察 T 步 的 total reward, 回傳一個 scalar

Page 11: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

11

Temporal Difference learning : Forward View of TD() (4)

T 為一場 game 的total step 、 t 為這場 game 的第幾個

step

1 1(1) ( )t t tR r V s

S0

w1 =

w2 =

w3 =

S1

S2

S3

Normalize 確保 weight 總和為 1

1nw

set λ = 0.5 and t = 0, T = 3

Page 12: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

12

Temporal Difference learning : Forward View of TD() (5)

= 之後的總和

λ 越高 weight 下降越快,越重視前面的結果。 λ 越低 weight 下降越慢,越重視後面的結果。

總結 λ 存在的功能與意義:1. 作為 TD 跟 MC 方法的橋梁2. 對於一個沒有立即影響的 action ,我們如何去做 punish or reward

Eligibility Traces

若 λ =0.1 => 1-λ = 0.9

若 λ = 0.9 =>1-λ = 0.1

Set λ = 0.5 and t = 0, T = 3 的結果

Page 13: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

13

Temporal Difference learning : Backward View of TD()(1)

• Eligibility Traces :• Reinforcing Events :• Value updates :

1

1

( )( ) 1

( ) t

t tt

te s s ss s

se s

e

1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e

rt+1

V(St)

0

( ) ( )k

tt k

t ssk

e s I

10t

tss

t

s sI

s s

非遞迴的定義

利用 Reinforcing Events一步一步的往回更新

Page 14: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

14

Temporal Difference learning : Backward View of TD()(2)

• Eligibility Traces :• Reinforcing Events :• Value updates :

1

1

( )( ) 1

( ) t

t tt

te s s ss s

se s

e

1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e

set λ = 0 get TD(0)1 1

(1) ( )t t tR r V s TD(0)

0

( ) ( )k

tt k

t ssk

e s I

10t

tss

t

s sI

s s

非遞迴的定義

Page 15: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

15

Temporal Difference learning : Why Backward View ?• Forward view– theoretical view :概念上比較容易理解– Not directly implementable :資訊仍要模擬取得。

• Backward view– mechanistic view :較好實作– simple conceptually and computationally – In the offline case, achieving the same result as the

forward view ( 可證明 )

Page 16: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

16

Temporal Difference learning : Equivalence of the Forward and Backward Views

1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

Ref : 7.4 Equivalence of the Forward and Backward Views, http://www.cs.ualberta.ca/~sutton/book/7/node1.html( 證明,在 offline case 下相等 )

Backward view

Forward view

Value update 相等

10t

tss

t

s sI

s s

Sum of Forward:If λ = 1(MC) and T = 3 =>

Page 17: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

17

Temporal Difference learning :Sarsa 演算法

Behavior Policy

Estimation policy

For each game

每下一子

更新時 Rt 要往後觀察幾步,看所使用的方法:如 Sarsa (λ)1 1

(1) ( )t t tR r V s

Page 18: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

18

Learning Vector Quantization

• 主要目的:資料的壓縮• 基本概念:希望以較少的群集來表示整個輸入樣本空間。 => 找一個類別的代表點

LVQVQ

適用於無類別資訊的資料 適用於有類別資訊的資料

M=3 ,代表數量 O=prototype vector+ = input data

m1 m2

m3

Page 19: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

19

SLVQ :架構 (1)<= 代表點,初始時 random 撒在棋盤上

an idea of what a SOM looks like

建立 n 個 agent = pattern database

可用 SOM 演算法動態決定需要幾個 M

意即 pattern 數量可以動態增減

Agent 會將嚐試過的 state/action pair 的值記錄下來,經由 LVQ 演算法: Q(s, a) = >Q(m, a) , state-space 的數量被大幅的壓縮。

Page 20: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

20

SLVQ :架構 (2)

m1

m2

m3

初始各代表點的 weight 亂數產生m1

m2

m3

遊戲終盤時,代表點更新 ( 用 LVQ)

更新 更新

設 M=3

m1 m2

m3

越多場訓練,代表點的代表性會越足夠 =>會逐漸收斂

更新代表點時,會利用相似性的計算找出特定的 pattern 。 ( 利用幾何距離 )Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions onPattern Analysis and Machine Intelligence, 21(9), 1999.

使用 Backward View 做更新

Page 21: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

21

Candidate Moves (1)

• 就經驗來講,如果一個 move 有多重意義的話會比較好。以下為圍棋中移動的特徵:• Attack:reduce opponent’s liberties• Defend:increase own’s liberties• Claim:increase own’s influence• Invade:decrease opponent’s influence• Connect:Join two groups• Conquer:enclose liberties

Page 22: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

22

Candidate Moves(2)

Attack : A,B,C,D,E,F => reduce opponent’s liberties(氣 )

黑方為攻擊方

Defend : N,O,P,G,Q => increase own’s liberties

No use : M,L,K,J,I,H =>從候選移動名單中移除

Pattern database中一個 agent 可能的攻擊及防守點

Match

m

J

12

Page 23: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

23

Reference(1)• 英文部份:• Myriam Z. Abramson, Learning Coordination strategies using reinforcement

learning, dissertation, George Mason University, Fairfax, VA, 2003

• Shin Ishii, Control of exploitation-exploration meta-parameter in reinforcement learning, Nara Institute of Science and technology, Neural Netwokrs 15(4-6), pp.665-687, 2002

• Simon Haykin, Neural networks and learning machines third Edition, Chapter 12, PEARSON EDUCATION

• Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal-difference learning with linear function approximation, Reinforcement Learning and Artificial Intelligence Laboratory, Department of Computing Science University of Alberta

Page 24: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

24

Reference(2)

• 中文部份:• 陳漢鴻,電腦象棋的自我學習,碩士論文,國立雲林科技大學資訊工程系,民 95年 6月

Page 25: Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

25

Reference(3)• 網頁部份:• Reinforcement Learning,• http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03

• Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03

• Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03

• [MATH] Monte Carlo Method 蒙地卡羅法則 , http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03

• Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03

• Simple Competitive Learning , http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12

• Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12

• Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12

• Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16• Reinforcement Learning , http://www.informatik.uni-freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf,

2009.12.25