14
Multi Armed Bandit Algorithms By, Shrinivas Vasala

Bandit algorithms

Embed Size (px)

Citation preview

Page 1: Bandit algorithms

Multi Armed Bandit Algorithms

By,Shrinivas Vasala

Page 2: Bandit algorithms

2

Overview

- K Slot Machine- Multi Armed Bandit Problems- A/B Testing- MAB Algorithms- Summary

Page 3: Bandit algorithms

3

K Slot Machines

- Choose a machine and receive a reward- T turns (chances)- What will be your goal ?

- Maximize the cumulative rewards- How you choose the machines (arms) ?

Page 4: Bandit algorithms

4

Multi Armed Bandit Problem (MAB)

- Goal : Two Fold- Try different arms (Exploration)- Play the seemingly most rewarding arm (Exploitation)

- Explore – Exploit Trade Off- Multi Armed Bandit Algorithms

- Reward distribution ( Unknown)- Mean Reward : <µ1, . . . , µK>- Standard Deviation Reward: <σ1, . . . , σk>

- Regret :- Maximize Cumulative Rewards = Minimize Regret

(Minimize)

Page 5: Bandit algorithms

5

A/B Testing

- Advertisement selection for a request from a pool of advertisements- Rewards : CTR/AR or CPM

- Recommendation of news articles to users - Product pricing and promotional offers- MAB is used to measure the performance of A/B

Testing experiments

Page 6: Bandit algorithms

6

MAB Algorithms

- Epsilon-greedy- Softmax- Pursuit- Upper Confidence Bound (UCB1)- UCB1-Tuned

Page 7: Bandit algorithms

Epsilon-greedy Algorithm- Choose epsilon ( Ɛ) : exploration factor- Play the best arm with probability (1 – Ɛ): Exploitation - Play the random arm with probability Ɛ: Exploration

Note : - Typical value of Ɛ = 0.10 (10%)

Page 8: Bandit algorithms

8

Softmax Algorithm

Page 9: Bandit algorithms

9

Pursuit Algorithm

ExplorationExploitation

Page 10: Bandit algorithms

10

Upper Confidence Bound 1 (UCB1)

- At each iteration, choose the arm corresponding to maximum above score.

Exploitation Exploration

Page 11: Bandit algorithms

11

UCB1- Tuned

Exploitation Exploration

Variance of the reward

Page 12: Bandit algorithms

12

Advanced Bandits

- Adversarial Bandits- Contextual Bandits- Infinite Armed Bandits- Thomson Sampling Bandits

Page 13: Bandit algorithms

13

Summary- Each algorithm has an upper bound on regret

- It’s a function of average rewards distribution- Each algorithm has a tuning parameter- Parameter tuning is a function of reward function - Choose right MAB algorithm based on

simulations/historical data

- All these algorithms have life time auto learning mechanism

Page 14: Bandit algorithms

14

Thank You