Эриберто Кваджавитль "Адаптивное обучение с...

Hierarchical Reinforcement Hierarchical Reinforcement Learning for Interactive Learning for Interactive

Systems and RobotsSystems and Robots

Heriberto CuayáhuitlHeriberto CuayáhuitlInteraction LabInteraction Lab

Heriot-Watt University, Edinburgh, UKHeriot-Watt University, Edinburgh, UKSchool of Mathematical & Computer SciencesSchool of Mathematical & Computer Sciences

hc213@hw.ac.ukhc213@hw.ac.uk

AINL, Moscow, 12-13 September 2014

Mary Ellen Foster

Simon Keizer

Zhuoran Wang

Srini Janarthanam

Xingkun Liu

Helen Hastie Oliver Lemon

Verena Rieser

Dimitra Gkatzia

Nina Dethlefs Arash Eshghi

Heriberto Cuayahuitl

Ioannis Efstathiou

Kathin Lohan

Wenshuo Tang

Reinforcement Learning Projects

Interactive Learning System/Robot

• Interactive learning machine: is an entity which improves its performance through interacting with other machines, its physical world and/or humans.

4(Cuayáhuitl, H., et al., 2013, IJCAI-MLIS)

A Motivating Scenario

A robot learning to play multiple games

from interaction 5

Outline

1. Reinforcement Learning (RL)

2. Hierarchical

3. Applications

4. Related Work

5. Future Directions

Interactive Learning Systems

6. Summary

Outline: Where are we?

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Interaction as a Markov Decision Process (MDP)

● The environment is described as an MDP:● A set of states S;● A set of actions A;● A state transition function T; ● A reward function R.

● The MDP solution (policy or interaction manager) decides what to do using reinforcement learning

Choice pointsPr(s2|s1,a1)

Reinforcement Learning is not Trivial

100 101 102100

Number of Binary Variables

Known Issues: Scalability and

Partial Observability

The Goal of Reinforcement Learners

The goal is to find an optimal policy:

How to Represent the Agent's Policy?

● Tabular representations

● Tree-based representations

● Function approximation● Linear

● Non-linear11

Reinforcement Learning Algorithms

● Q-Learning

● Q-Learning with Linear Function Approximation

12(Sutton & Barto, MIT Press, 1998; Szepesvari, Morgan Clay Pub., 2010)

Illustrative Example: The Interactive Taxi

• State Trans.: 0.8 of correct navigation/recognition

• Reward:+100 for reaching the goal, 0 otherwise

• Size of state-action space: |S*A| = 50*5^4*3*4*16 = 6M state-actions 13

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Hierarchical Reinforcement Learning

• Why? To learn system behaviours to carry out multiple tasks jointly (not separately)

I know how to do that, from playing the other game

Interaction as a Semi-Markov Decision Process (SMDP)

● Environment as an SMDP:● S: set of states● A: set of (complex) actions● T: state transition function● R: reward function

● One SMDP for each task or subtask

● Hierarchical reinforcement learning algorithms to solve SMDPs (e.g. HSMQ, MAXQ)

Task N

Sub-task

Sub-Task

Sub-task

Sub-Task

The goal is to find:

Conceptual SMDP for Interactive Systems

quicker learning, more scalability, behaviour reuse

Benefits

Hierarchical Reinforcement Learning Algorithms

● HSMQ-Learning

● HSMQ-Learning with Linear Function Approximation

● Other HRL algorithms: MAXQ, HAMQ

● Algorithms for structure learning: HEXQ, VISA, HI-MAT

18(Barto & Mahadevan, 2003; Hengst, 2010)

Illustrative Example: The Interactive Taxi

• State Trans.: 0.8 of correct navigation/recognition

• Reward:+100 for reaching the goal, 0 otherwise

• State-action space: |S*A| = 10.7K state-actions19

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Speech-Based Human-Machine Communication

HRL Agents

Application 1: Travel Planning

● HRL without prior knowledge (HSMQ-Learning)

● HRL with prior knowledge (HAM+HSMQ-Learning)

● Training with simulated interactions

● Testing with real users

22(Cuayahuitl et al., Computer, Speech & Language, 2010)

W=joint state (SMDP+HAM)

Travel Planning Spoken Dialogue System

23(Cuayáhuitl et al., Computer, Speech & Language, 2010)

Results in the Travel Planning Domain

• HRL finds solutions faster than flat learning

• HRL is more scalable than flat learning

• Learnt policies outperform hand-coded ones(Cuayáhuitl et al., Computer, Speech & Language, 2010)

Application 2: Indoor Wayfinding

● HRL without policy reuse (HSMQ-Learning)

● HRL with policy reuse (HSMQ_PR-Learning)● Detect situations where the system knows how to act● Action-selection using an optimal (if reuse=true) or an

exploratory policy (if reuse=false)

● Training with simulated interactions

● Testing with real users

Indoor Wayfinding Dialogue System

26(Cuayáhuitl & Dethlefs., ACM Trans. Speech & Lang. Proc., 2011)

Infokiosk & mobile phone

interfaces

Results in the Indoor Wayfinding Domain

• Policy reuse finds solutions faster than without it

• Adaptive route instructions are more efficient

(Cuayáhuitl & Dethlefs., ACM Trans. Speech & Lang. Proc., 2011)

Application 3: Human-Robot Interaction

● HSMQ vs. FlexHSMQ Learning w/linear function approx. ● Training with simulated interactions● Testing with real users

Robot Dialogue System (Quiz Game)

Interaction Manager

(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)

Results in the Quiz Domain

• Non-strict HRL leads to more natural interactions

• Non-strict HRL is preferred by human users

(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)

Robot Asking and Answering Questions

(Belpaeme, et al., 2012, Intl. Journal of HRI) 31

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Learning with Large State Spaces

Learning under Uncertainty

Spectrum of Markov Process Models

Promising for multi-task learning systems

(Mahadevan, S. et al., 2004, Handbook of Learning and Approx. Dyn. Prog.)

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Issues that Might Lead to Future Interactive Learning Systems

1.Big effort to make the system perform similar tasks

2.Simulations may not represent the real world

3.It is often hard to specify the reward function

4.The real world is partially known and dynamic

5.Poor spatial cognition will affect real world impact

6.Small vocabularies discourage talking to machines

7.Lack of interactive learning systems in the real world

Towards Autonomous Interactive Systems and Robots

f auto

Amount of tasks

Current interactive systems require a

lot of human intervention

Future interactive systems should

be more autonomous

How do we get here?

Wholistic perspective for language, vision and robotics

2. Hierarchical

3. Applications

4. Related Work

6. Summary

Summary

• Machines can be programmed to behave just as expected, but the physical world and humans demand systems that can learn

• Hierarchical learning plays an important role for multi-tasked interactive systems and robots

• More autonomy is needed if systems are to learn new skills with little human intervention

• A wholistic interdisciplinary perspective is needed for intelligent interactive robots

References• Cuayáhuitl, H., Dethlefs, N., Kruijff -Korbayová, I., (2014) Non-

Strict Hierarchical Reinforcement Learning for Interactive Systems and Robots. To appear in ACM Transactions on Intelligent Interactive Systems, vol. 4, no. 3.

• Cuayáhuitl, H. and Dethlefs, N., (2011), Spatially-Aware Dialogue Control Using Hierarchical Reinforcement Learning. In ACM Transactions on Speech and Language Processing, vol. 7, no. 3, pp. 5:1-5:26.

• Cuayáhuitl, H., Renals, S., Lemon, O., Shimodaira, H., (2010), Evaluation of a Hierarchical Reinforcement Learning Spoken Dialogue System. In Computer Speech and Language, vol. 24, no. 2, pp. 395-429.

E-Mail: hc213@hw.ac.uk41

Эриберто Кваджавитль "Адаптивное обучение с...

Technology

Коллективное поведение роботов · 2012. 4. 11. · взаимодействии роботов. Сами роботы построены компанией

Двумерное моделирование и детали для роботов

Российский издательский продукт - редактор интерактивных изданий

Математические модели роботов с неабсолютной памятью

Программирование роботов, осень 2014: Робот-манипулятор

У роботов тоже есть права

Инструменты для интерактивных модулей

Создание интерактивных приложений на платформе Silverlight

Метод управления сетью роботов

создание интерактивных историй в Sway

Юриспруденция для роботов

Фигурки роботов

Как бы нам тоже делать роботов

лига роботов Liga robotov 2

Дайджест Роботов #5

Обезьянки против роботов. Часть III (TestLabs09)

Программирование роботов, осень 2014: Вводная лекция

Адаптивное пространство школы

Сервис для создания интерактивных изданий UnderPage

Создание интерактивных упражнений