Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies using reinforcement learning

-- Myriam Z. Abramson , dissertation, 2003

張景照dorgon chang

20120614


2

Index

• Coordinate problem( 要解決的問題 )• Evaluation of the GO• Reinforcement learning• Temporal Difference learning( 使用 Sarsa)• Learning Vector Quantization(LVQ)• Sarsa LVQ (SLVQ) <= 作者提出的方法


3

Coordinate problem

• Coordination strategy problem 簡單來說就是action selection problem 。

• 當我們只知道 local situation 的時候，如何選擇一個正確的行動，在不依靠 end game state 的情況下，讓其能夠跟其他的行動結合在一起。• 局部地區的戰術如何影響整體的戰略。


4

Evaluation of the GOThis method convey the spatial connectivity between the stones

ε 為自定數值，當該點的 influence 超過 ε 時會繼續往外擴散

對盤面上所有數字做加總，可以得到一個盤面的評估值

黑子往外擴散 1白子往外擴散 -1

在接下來的方法中當成 reward


5

Reinforcement learning ：Introduction

• Machine learning 的目標是用來產生一個agents ，而 RL 是其中的一個方法，其特徵是 Trial-and-error search and delayed reward 。

下一顆子

例如：贏或輸、平手盤面

Agent 往後預測幾步盤面


6

Reinforcement learning ：Value Function

• π = agent 所使用來選擇 action 的 policy 。• s = 目前的 state 。• ：在 π 這個 policy 下， state s 所得到的 reward 。 • ：在 π 這個 policy 下， state s 採取 action a 所得到的 reward 。 • 最常見的 policy 為 ε-greedy 。 ( 另有 greedy 、 ε-soft 、 softmax……)

• ε 介於 0~1 之間，其值越高，代表「 exploration 」越被鼓勵。 (exploration v.s. exploitation)

• ε-greedy ：大部份都會選擇擁有最高 reward 的行動，有少部份的機率 ε 會亂數決定。


7

Temporal Difference learning

• TD learning 是用來評估 RL 中 value function的方法。

DP ：當前的評估值是基於先前學習過的評估值為基礎。MC ：利用 random game 的模式，統計其結果來解決未來可能遇到的問題。TD 方法結合：


8

Temporal Difference learning ： Forward View of TD() (1)

• Monte Carlo ： observe reward for all steps in an episode

• TD(0) ： observed one step only 　 observed two step

• TD() is a method for averaging all n-step,

1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

V(St)V(St+1)

Value update

set λ = 0, TD(0) set λ = 1, Monte Carlo

r = 在 t 時間點的 reward, γ = 觀察未來 reward 的 discount rate

為在 t 時間點往後觀察 T 步的 total reward, 回傳一個 scalar


9





1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

Value update


set λ = 0 代入




10





1 1(1) ( )t t tR r V s 2

1 2 2(2) ( )t t t tt r r V sR

Value update


set λ = 1 代入




11


T 為一場 game 的total step 、 t 為這場 game 的第幾個

step

1 1(1) ( )t t tR r V s

S0

w1 =

w2 =

w3 =

S1

S2

S3

Normalize 確保 weight 總和為 1

1nw

set λ = 0.5 and t = 0, T = 3


12


= 之後的總和

λ 越高 weight 下降越快，越重視前面的結果。 λ 越低 weight 下降越慢，越重視後面的結果。

總結 λ 存在的功能與意義：1. 作為 TD 跟 MC 方法的橋梁2. 對於一個沒有立即影響的 action ，我們如何去做 punish or reward

Eligibility Traces

若 λ =0.1 => 1-λ = 0.9

若 λ = 0.9 =>1-λ = 0.1

Set λ = 0.5 and t = 0, T = 3 的結果


13

Temporal Difference learning ： Backward View of TD()(1)

• Eligibility Traces ：• Reinforcing Events ：• Value updates ：

1

1

( )( ) 1

( ) t

t tt

te s s ss s

se s

e

1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e

rt+1

V(St)

0

( ) ( )k

tt k

t ssk

e s I

10t

tss

t

s sI

s s

非遞迴的定義

利用 Reinforcing Events一步一步的往回更新


14

Temporal Difference learning ： Backward View of TD()(2)

• Eligibility Traces ：• Reinforcing Events ：• Value updates ：

1

1

( )( ) 1

( ) t

t tt

te s s ss s

se s

e

1 1( ) ( )t tt t t tr V s V s ( ) ( )t tt s sV e

set λ = 0 get TD(0)1 1

(1) ( )t t tR r V s TD(0)

0

( ) ( )k

tt k

t ssk

e s I

10t

tss

t

s sI

s s

非遞迴的定義


15

Temporal Difference learning ： Why Backward View ？• Forward view– theoretical view ：概念上比較容易理解– Not directly implementable ：資訊仍要模擬取得。

• Backward view– mechanistic view ：較好實作– simple conceptually and computationally – In the offline case, achieving the same result as the

forward view ( 可證明 )


16

Temporal Difference learning ： Equivalence of the Forward and Backward Views

•

1 1

0 0

( ) ( )t

b ft t

T T

sst t

tV s V Is

Ref ： 7.4 Equivalence of the Forward and Backward Views, http://www.cs.ualberta.ca/~sutton/book/7/node1.html( 證明，在 offline case 下相等 )

Backward view

Forward view

Value update 相等

10t

tss

t

s sI

s s

Sum of Forward:If λ = 1(MC) and T = 3 =>

http://www.cs.ualberta.ca/~sutton/book/7/node1.html


17

Temporal Difference learning ：Sarsa 演算法

•

Behavior Policy

Estimation policy

For each game

每下一子

更新時 Rt 要往後觀察幾步，看所使用的方法：如 Sarsa (λ)1 1

(1) ( )t t tR r V s


18

Learning Vector Quantization

• 主要目的：資料的壓縮• 基本概念：希望以較少的群集來表示整個輸入樣本空間。 => 找一個類別的代表點

LVQVQ

適用於無類別資訊的資料適用於有類別資訊的資料

M=3 ，代表數量 O=prototype vector+ = input data

m1 m2

m3


19

SLVQ ：架構 (1)<= 代表點，初始時 random 撒在棋盤上

an idea of what a SOM looks like

建立 n 個 agent = pattern database

可用 SOM 演算法動態決定需要幾個 M

意即 pattern 數量可以動態增減

Agent 會將嚐試過的 state/action pair 的值記錄下來，經由 LVQ 演算法： Q(s, a) = >Q(m, a) ， state-space 的數量被大幅的壓縮。


20

SLVQ ：架構 (2)

m1

m2

m3

初始各代表點的 weight 亂數產生m1

m2

m3

遊戲終盤時，代表點更新 ( 用 LVQ)

更新更新

設 M=3

m1 m2

m3

越多場訓練，代表點的代表性會越足夠 =>會逐漸收斂

更新代表點時，會利用相似性的計算找出特定的 pattern 。 ( 利用幾何距離 )Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions onPattern Analysis and Machine Intelligence, 21(9), 1999.

使用 Backward View 做更新


21

Candidate Moves (1)

• 就經驗來講，如果一個 move 有多重意義的話會比較好。以下為圍棋中移動的特徵：• Attack:reduce opponent’s liberties• Defend:increase own’s liberties• Claim:increase own’s influence• Invade:decrease opponent’s influence• Connect:Join two groups• Conquer:enclose liberties


22

Candidate Moves(2)

Attack ： A,B,C,D,E,F => reduce opponent’s liberties(氣 )

黑方為攻擊方

Defend ： N,O,P,G,Q => increase own’s liberties

No use ： M,L,K,J,I,H =>從候選移動名單中移除

Pattern database中一個 agent 可能的攻擊及防守點

Match

m

J

12


23

Reference(1)• 英文部份：• Myriam Z. Abramson, Learning Coordination strategies using reinforcement

learning, dissertation, George Mason University, Fairfax, VA, 2003

• Shin Ishii, Control of exploitation-exploration meta-parameter in reinforcement learning, Nara Institute of Science and technology, Neural Netwokrs 15(4-6), pp.665-687, 2002

• Simon Haykin, Neural networks and learning machines third Edition, Chapter 12, PEARSON EDUCATION

• Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal-difference learning with linear function approximation, Reinforcement Learning and Artificial Intelligence Laboratory, Department of Computing Science University of Alberta


24

Reference(2)

• 中文部份：• 陳漢鴻，電腦象棋的自我學習，碩士論文，國立雲林科技大學資訊工程系，民 95年 6月


25

Reference(3)• 網頁部份：• Reinforcement Learning,• http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03

• Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03

• Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03

• [MATH] Monte Carlo Method 蒙地卡羅法則 , http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03

• Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03

• Simple Competitive Learning , http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12

• Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12

• Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12

• Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16• Reinforcement Learning , http://www.informatik.uni-freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf,

2009.12.25

http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html

http://www.cns.atr.jp/cnb/crp/

http://rl.cs.mcgill.ca/Projects/off-policy.html

http://www.wretch.cc/blog/glCheng/3431370

http://en.wikipedia.org/wiki/Intelligent_agent

http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html

http://www.cs.ualberta.ca/~sutton/book/7/node1.html

http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf

http://davis.wpi.edu/~matt/courses/soms/

http://www.informatik.uni-freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf

Engineering

Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003