Chapter 13: Reinforcement Learning

Chapter 13: Reinforcement Learning

INTRODUCTION (1)

agent, state, actions, policy 주제

이러한 agents 가 그들이 처한 환경에서 행동함으로써 어떻게 성공적인 제어 정책을 학습할 수 있는가

agent 의 목표는 reward 함수에 의하여 정의 됨 제어 정책

어떤 초기 상태로부터 최대의 누적 보상이 얻어지는 행동을 선택

예 : manufacturing optimization problems, sequential scheduling problems

INTRODUCTION (2)

INTRODUCTION (3)

Function approximation problems : S A, a = (s)Delayed reward

• training set 의 형태가 <s, (s)> 가 아니고 행동의 sequence 에 대한 reward 이므로 temporal credit assignment 문제 발생

Exploration• 모르는 states 와 actions 의 exploration ( 새 정보 획득 )• 이미 학습한 states 와 actions 의 exploitation ( 최대의 누적

reward)

Partially observable states• 실제로 환경에 대한 전체 정보를 알 수 없으므로 , 행동을

선택함에 있어 전 단계에서 관찰된 것도 고려해야 함Life-long learning

• 몇 개의 관련된 작업도 학습하는 것이 요구 됨

THE LEARNING TASK(1)

The problem of learning sequential control strategiesagent’s action : deterministic nondeterministictrained : expert itself

Markov decision process(MDP)agent 는 S 를 인식할 수 있고 , A 를 가지고 있다 .st+1 = (st , at), rt = r(st , at)

와 r 은 환경에 따르며 , agent 가 알 필요가 없다 .(st , at), r(st , at) 은 현재 state 와 action 에만

의존하며 , 이전 states 나 actions 과는 상관 없다 .


Task of the agentlearn a policy, : S A, (st) = at

discounted cumulative reward

optimal policy

02

21)(

iit

itttt rrrrsV

)( ),(maxarg* ssV

)()( **

sVsV


Q LEARNING (1)

Training example 의 형태가 <s, a> 가 아니고 r(si, ai) 이므로 직접적으로 : S A 를 학습하기는 어렵다 .

Evaluation function

와 r 이 완벽하게 알려져 있을 때만 사용 가능실제 문제에서는 이러한 함수에 대한 결과의 정확한

예측이 불가능 ( 예 : robot control)

))],((),([maxarg)( ** asVasrsa

Q LEARNING (2)

The Q Function evaluation function

optimal action

와 r 을 모르는 경우에라도 optimal action 을 선택할 수 있다 .

)),((),(),( * asVasrasQ

),( maxarg)(* asQsa

Q LEARNING (3)

An Algorithm for Learning Q iterative approximation

training rule

),( max)(* asQsVa

)),,(( max),(),( aasQasrasQa

),(ˆ max),(ˆ asQrasQa

Q LEARNING (4)

Q learning algorithmFor each s,a initialize the table entry to zero

Observe the current state s

Do forever:

•Select an action a and execute it

•Receive immediate reward r

•Observe the new state s’

•Update the table entry for as follows:

• s s’

),(ˆ asQ

),(ˆ asQ

),(ˆ max),(ˆ asQrasQa

Q LEARNING (5)

An Illustrative Example

두 가지 특성90

}100,81,63max{9.00

),(ˆ max),(ˆ21

asQrasQa

right

),(ˆ),(ˆ ),,( 1 asQasQnas nn

),(),(ˆ0 ),,( asQasQnas n

Q LEARNING (6)

Convergence 수렴 조건

The system is a deterministic MDP.The immediate reward values are bounded.

The agent selects actions in such a fashion that it visits every possible state-action pair infinitely often.

Theorem converges to as n , for all s, a.

casras ),( ),(

),(ˆ asQn ),( asQ

Q LEARNING (7)

Proof.

|),(),(ˆ|max,

asQasQnas

n

|),(ˆ),(ˆ|max

|),(ˆ),(ˆ|max

|),(ˆmax),(ˆmax|

|)),(ˆmax()),(ˆmax(| ),(),(ˆ

,

1

asQasQ

asQasQ

asQasQ

asQrasQrasQasQ

nas

na

an

a

an

an

nn asQasQ |),(),(ˆ| 1

01 |),(),(ˆ| n

n asQasQ

Q LEARNING (8)

Experimentation Strategies agent 가 action 을 선택하는 방법

를 최대로 하는 action a 를 선택• 다른 action 을 이용하지 않게 됨

확률적인 방법

• k 가 크면 exploit, 작으면 explore

• 반복 회수에 따라 k 를 변화 시킬 수도 있음

),(ˆ asQ

j

asQ

asQ

ij

i

k

ksaP

),(ˆ

),(ˆ

)|(

Q LEARNING (9)

Updating Sequence training example 의 순서를 바꿈에 의하여

training 효율을 향상 시킬 수 있다 .역순으로 update 목표 지점에 도달하는

경로상의 모든 transition 에 대하여 한번에 update 가능 ( 추가의 저장 공간 필요 )

이전의 state-action transition 과 reward 를 저장해 둔 후 , 주기적으로 retrain 한다 .

NONDETERMINISTIC REWARDS AND ACTIONS (1)

(s, a) 와 r(s, a) 은 확률 분포에 의하여 결정 nondeterministic MDP

s, a 에만 의존하고 이전 s, a 에는 무관expected value

0

)(i

iti

t rEsV

s

sVassPasrE

asVEasrE

asVasrEasQ

)(),|()],([

))],(([)],([

))],((),([),(

*

*

*

s

aasassPasrEasQ ),(max),|()],([),(

NONDETERMINISTIC REWARDS AND ACTIONS (2)

Training rule

Theorem

converges to as n , for all s, a.

)],(ˆ max[),(ˆ)1(),(ˆ11 asQrasQasQ n

annnn

),(1

1

asvisitsnn

1

2),,(

1),,( ][ ,

iasin

iasin

),(ˆ asQn ),( asQ

TEMPORAL DIFFERENCE LEARNING

Q learning is a special case of a general class of temporal difference algorithms.

TD() by Sutton (1988)

),(ˆ max),( 1)1( asQrasQ t

attt

),(ˆ max),( 22

1)2( asQrrasQ t

atttt

),(ˆ max),( 1)1(

1)( asQrrrasQ nt

a

nnt

ntttt

n

),(),(),()1(),( )3(2)2()1(tttttttt asQasQasQasQ

)],(),(ˆmax)1[(),( 11 tttta

ttt asQasQrasQ

GENERALIZING FROM EXAMPLES

Target function 이 명확한 lookup table 로 표현된다는 가정과 모든 가능한 state-action pair 가 방문 되어야 한다는 가정은 비현실적이다 . ( 무한 공간이나 , action 의 수행 비용이 큰 경우 ) 다른 방법과의 통합이 필요하다 .Lookup table 대신 neural network 을 사용하여 Q

learning algorithm 에 BACK PROPAGATION 과 같은 function approximation algorithm 을 통합한다 .

RELATIONSHIP TO DYNAMIC PROGRAMMING

오래 전부터 연구되어 온 dynamic programming 과 밀접한 관련이 있다 .완벽한 배경 지식계산을 최소화하는 것이 가장 큰 목표직접적 상호 작용이 없는 내부적 simulation

(offline)Bellman’s equation

)))](,(())(,([)()( ** ssVssrEsVSs

SUMMARY

Reinforcement learning addresses the problem of learning control strategies for autonomous agents.

The reinforcement learning algorithms addressed in this chapter fit a problem setting known as a Markov decision process.

Q learning is one form of reinforcement learning in which the agent learns an evaluation function over states and actions.

Q learning can be proven to converge to the correct Q function under certain assumptions.

Q learning is a member of a more general class of algorithms, called temporal difference algorithms.

Reinforcement learning is closely related to dynamic programming approaches to Markov decision processes.

Documents

Chapter 13: Reinforcement Learning