Эриберто Кваджавитль "Адаптивное обучение с подкреплением для интерактивных систем и роботов"

1

Hierarchical Reinforcement Hierarchical Reinforcement Learning for Interactive Learning for Interactive

Systems and RobotsSystems and Robots

Heriberto CuayáhuitlHeriberto CuayáhuitlInteraction LabInteraction Lab

Heriot-Watt University, Edinburgh, UKHeriot-Watt University, Edinburgh, UKSchool of Mathematical & Computer SciencesSchool of Mathematical & Computer Sciences

[email protected]@hw.ac.uk

AINL, Moscow, 12-13 September 2014

Mary Ellen Foster

Simon Keizer

Zhuoran Wang

Srini Janarthanam

Xingkun Liu

Helen Hastie Oliver Lemon

Verena Rieser

Dimitra Gkatzia

Nina Dethlefs Arash Eshghi

2

Heriberto Cuayahuitl

Ioannis Efstathiou

Kathin Lohan

Wenshuo Tang

3

Reinforcement Learning Projects

Interactive Learning System/Robot

• Interactive learning machine: is an entity which improves its performance through interacting with other machines, its physical world and/or humans.

4(Cuayáhuitl, H., et al., 2013, IJCAI-MLIS)

A Motivating Scenario

A robot learning to play multiple games

from interaction 5

Outline

1. Reinforcement Learning (RL)

2. Hierarchical

RL

3. Applications

4. Related Work

5. Future Directions

Interactive Learning Systems

6

6. Summary

Outline: Where are we?


2. Hierarchical

RL

3. Applications

4. Related Work



7

6. Summary

Interaction as a Markov Decision Process (MDP)

● The environment is described as an MDP:● A set of states S;● A set of actions A;● A state transition function T; ● A reward function R.

● The MDP solution (policy or interaction manager) decides what to do using reinforcement learning

Choice pointsPr(s2|s1,a1)

Reinforcement Learning is not Trivial

9

100 101 102100

105

1010

1015

1020

1025

1030

Stat

e Sp

ace

Gro

wth

Number of Binary Variables

Known Issues: Scalability and

Partial Observability

The Goal of Reinforcement Learners

The goal is to find an optimal policy:

How to Represent the Agent's Policy?

● Tabular representations

● Tree-based representations

● Function approximation● Linear

● Non-linear11

Reinforcement Learning Algorithms

● Q-Learning

● Q-Learning with Linear Function Approximation

12(Sutton & Barto, MIT Press, 1998; Szepesvari, Morgan Clay Pub., 2010)

Illustrative Example: The Interactive Taxi

• State Trans.: 0.8 of correct navigation/recognition

• Reward:+100 for reaching the goal, 0 otherwise

• Size of state-action space: |S*A| = 50*5^4*3*4*16 = 6M state-actions 13



2. Hierarchical

RL

3. Applications

4. Related Work



14

6. Summary

Hierarchical Reinforcement Learning

• Why? To learn system behaviours to carry out multiple tasks jointly (not separately)

15

I know how to do that, from playing the other game

Interaction as a Semi-Markov Decision Process (SMDP)

● Environment as an SMDP:● S: set of states● A: set of (complex) actions● T: state transition function● R: reward function

● One SMDP for each task or subtask

● Hierarchical reinforcement learning algorithms to solve SMDPs (e.g. HSMQ, MAXQ)

Tasks

Task1

Task N

Sub-task

Sub-Task

Sub-task

Sub-Task

16

The goal is to find:

Conceptual SMDP for Interactive Systems

quicker learning, more scalability, behaviour reuse

Benefits

Hierarchical Reinforcement Learning Algorithms

● HSMQ-Learning

● HSMQ-Learning with Linear Function Approximation

● Other HRL algorithms: MAXQ, HAMQ

● Algorithms for structure learning: HEXQ, VISA, HI-MAT

18(Barto & Mahadevan, 2003; Hengst, 2010)

Illustrative Example: The Interactive Taxi

• State Trans.: 0.8 of correct navigation/recognition

• Reward:+100 for reaching the goal, 0 otherwise

• State-action space: |S*A| = 10.7K state-actions19



2. Hierarchical

RL

3. Applications

4. Related Work



20

6. Summary

Speech-Based Human-Machine Communication

HRL Agents

Application 1: Travel Planning

● HRL without prior knowledge (HSMQ-Learning)

● HRL with prior knowledge (HAM+HSMQ-Learning)

● Training with simulated interactions

● Testing with real users

22(Cuayahuitl et al., Computer, Speech & Language, 2010)

W=joint state (SMDP+HAM)

Travel Planning Spoken Dialogue System

23(Cuayáhuitl et al., Computer, Speech & Language, 2010)

Results in the Travel Planning Domain

24

• HRL finds solutions faster than flat learning

• HRL is more scalable than flat learning

• Learnt policies outperform hand-coded ones(Cuayáhuitl et al., Computer, Speech & Language, 2010)

Application 2: Indoor Wayfinding

● HRL without policy reuse (HSMQ-Learning)

● HRL with policy reuse (HSMQ_PR-Learning)● Detect situations where the system knows how to act● Action-selection using an optimal (if reuse=true) or an

exploratory policy (if reuse=false)

● Training with simulated interactions

● Testing with real users


Indoor Wayfinding Dialogue System

26(Cuayáhuitl & Dethlefs., ACM Trans. Speech & Lang. Proc., 2011)

Infokiosk & mobile phone

interfaces

Results in the Indoor Wayfinding Domain

27

• Policy reuse finds solutions faster than without it

• Adaptive route instructions are more efficient

(Cuayáhuitl & Dethlefs., ACM Trans. Speech & Lang. Proc., 2011)

Application 3: Human-Robot Interaction

● HSMQ vs. FlexHSMQ Learning w/linear function approx. ● Training with simulated interactions● Testing with real users


Robot Dialogue System (Quiz Game)

29

Interaction Manager

(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)

Results in the Quiz Domain

30

• Non-strict HRL leads to more natural interactions

• Non-strict HRL is preferred by human users

(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)

Robot Asking and Answering Questions

(Belpaeme, et al., 2012, Intl. Journal of HRI) 31



2. Hierarchical

RL

3. Applications

4. Related Work



32

6. Summary

Learning with Large State Spaces

33

Learning under Uncertainty

34

Spectrum of Markov Process Models

Promising for multi-task learning systems

35

(Mahadevan, S. et al., 2004, Handbook of Learning and Approx. Dyn. Prog.)



2. Hierarchical

RL

3. Applications

4. Related Work



36

6. Summary

Issues that Might Lead to Future Interactive Learning Systems

1.Big effort to make the system perform similar tasks

2.Simulations may not represent the real world

3.It is often hard to specify the reward function

4.The real world is partially known and dynamic

5.Poor spatial cognition will affect real world impact

6.Small vocabularies discourage talking to machines

7.Lack of interactive learning systems in the real world

37

Towards Autonomous Interactive Systems and Robots

Degre

e o

f auto

nom

y

Amount of tasks

Current interactive systems require a

lot of human intervention

Future interactive systems should

be more autonomous

How do we get here?

38

Wholistic perspective for language, vision and robotics



2. Hierarchical

RL

3. Applications

4. Related Work



39

6. Summary

Summary

• Machines can be programmed to behave just as expected, but the physical world and humans demand systems that can learn

• Hierarchical learning plays an important role for multi-tasked interactive systems and robots

• More autonomy is needed if systems are to learn new skills with little human intervention

• A wholistic interdisciplinary perspective is needed for intelligent interactive robots

40

References• Cuayáhuitl, H., Dethlefs, N., Kruijff -Korbayová, I., (2014) Non-

Strict Hierarchical Reinforcement Learning for Interactive Systems and Robots. To appear in ACM Transactions on Intelligent Interactive Systems, vol. 4, no. 3.

• Cuayáhuitl, H. and Dethlefs, N., (2011), Spatially-Aware Dialogue Control Using Hierarchical Reinforcement Learning. In ACM Transactions on Speech and Language Processing, vol. 7, no. 3, pp. 5:1-5:26.

• Cuayáhuitl, H., Renals, S., Lemon, O., Shimodaira, H., (2010), Evaluation of a Hierarchical Reinforcement Learning Spoken Dialogue System. In Computer Speech and Language, vol. 24, no. 2, pp. 395-429.

E-Mail: [email protected]

Technology

Эриберто Кваджавитль "Адаптивное обучение с подкреплением для интерактивных систем и роботов"