Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion

Nonstochastic Multi-Armed BanditsWith Graph-Structured Feedback

Noga Alon, TAUNicolo Cesa-Bianchi, MilanClaudio Gentile, InsubriaShie Mannor, TechnionYishay Mansour, TAU and MSROhad Shamir, Weizmann

Nonstochastic sequential decision-making

• K actions and T time steps• lt(a) – loss of action a at time t• At time t– player picks action Xt

– incurs loss lt(Xt) – observe feedback on losses• Multi-arm bandit: only lt(Xt)

• Experts (full information): lt(j) for any j

3

Nonstochastic sequential decision-making

• Goal:– minimize losses– benchmark: The best

single action• The action j that

minimizes the loss

– no stochastic assumptions on losses

• Regret

• Known regret bounds:– MAB

– Experts

actionbest

T

ttj

lossplayer

T

ttT jXER

1

1

)(min)]([

TK

KT ln

4

Motivation – observablity

undirected directed

undirected observation graph

?

?

?

??

?

?

?


?

3

?

??

?

?

?


5

3

?

1?

7

?

?

8


• MAB: no edges • Experts: clique

?

3

?

??

?

?

?

5

3

6

14

7

8

2

9

Modeling

Directed vs Undirected

• Different types of dependencies

• Different measures– Independent set– Dominating set– Max Acyclic Subgraph

Informed vs Uniformed

• When does the learner observes the graph– Before– After

• only the neighbors

10

Our Results

Uniformed setting• Undirected graph• Uniformed setting

– Only the neighbors of the node– Independent sets

• Directed graph– Max Acyclic Subgraph (not tight)– Random Erdos-Renyi graphs

Informed setting• Directed graphs

• Regret characterization– dominating sets and ind. set

• Both expectation and high prob.

)ln)((~

KGTO

EXP3-SET

))(êxp(]Pr[1

t

s st aaX

• Online Algorithm

otherwise 0

obseved )( if ] observed is )(Pr[

)()(ˆ a

a

aa t

t

t

t

where

)lnK(G)(

1t

(G)lnKT observed] is )(|Pr[2

ln

tG

T

tttT aaX

KR

• Theorem

)()](ˆ[ aaE tt

12

EXP3-Set Regret – key lemma

• Lemma

Note:MAB: Q=KFull info. Q=1

• Proof: Build an i.s. S– consider action a with

minimal Pr[a observed]– Add a to S– Delete a and its

neighbors

• Note

a

t Ga

aXQ )(

]observed Pr[

]Pr[

1 ]observed Pr[

]Pr[

]observed Pr[

]Pr[

)()(

aNj

t

aNj

t

a

jX

j

jX

Dominating set – directed graph

?

?

?

??

?

?

?

Dominating set – directed graph

?

?

?

??

?

?

?

17

EXP3-DOM

• Simplified version– fixed graph G– D is dominating set

• log approx

• Main modification– add probabilities to D

• induce observability

• probabilities:

• Select Xt using pt

• Observe lt(a) for a in SXt,t

• weights

][||

)1( ,, DaI

DW

wp

t

tata

|)|/)(êxp(,1, Daww ttata

][] Pr[

)()(ˆ

,tXt

t tSiI

aobserve

aa

18

EXP3-DOM

• Simple example• Transitive observability– tournament

• action 1 observes all actions– D={1}

• EXP3-DOM• Sample action 1 with

prob γ– action 1 is the

exploration

• Otherwise run a MAB– specifically EXP3-SET

• Intuition– action 1 replaces

mixture with uniform

Conclusion

• Observability model– Between MAB and Experts

• more work to be done

• Uninformed setting– Undirected graph

• Informed setting– Directed graph

• [Kocak, Neu, Valko and R. Muno] improved uniformed

Thank You

Outline

• Model and motivation• symmetric observability• non-symmetric observability

24

EXP3-DOM: key lemma

• Lemma– G directed graph, – d-

i indegree of i, – α=α(G)

• Turan’s Theorem– undirected graph G(V,E)

• Proof: high level– shrink graph

• GK,Gk-1, …

– delete nodes

• step s: – delete max indegree node

• From Turan’s theorem

K

i i

K

d1

1ln21

1

||

||2);(

1

||

V

EG

V

2

1

2

||

||

||max

s

s

s

si

V

V

Dd

EXP3-DOM: key lemma (proof)

• Completing the proof

• Note, due to edge elimination

)1ln(22

1

12

1

12

1

1

1

1

1

1

1

1

1 1,

2 ,

2 ,,11 ,

K

i

dK

dK

ddd

K

i i

i

K

i KiK

K

K

i KiK

K

K

i KiK

K

i Ki

1,,1 KiKi dd

EXP3-DOM- Key lemma (modified)

• Lemma (what we really need!)• G(V,E) directed graph– INi indegree of i – r size dominating set; and α size ind. set– p distribution over V• pi≥β

r

KrK

pp

pQ

K

iINj ji

i

i

21ln2

2

1

27

EXP3 –DOM: changing graphs

• Simple– all dom. set same size– approx. same size

• Problem– different size dom. set

• can be 1 or K

• Solution– keep log levels

• depend on log2 (Dt)

– algorithm per level

• Complications– parameters depend on

level– setting the learning rate

• need a delicate doubling

• Main tech. challenge– handle dynamic

adversary.

28

EXP3-DOM

• receive obs. graph– find dominating set Dt

• logarithmic approximation

• Run the right copy– Let bt = log2 (Dt)

– run copy bt

• log copies

• For Copy bt – param. depend on bt

• probabilities:

• Select Xt using p

• Observe lt(a) for a in SXt,t

• weights

][)1( ,, t

tt

tata DaI

DW

wp

)2/)(êxp(,1,tb

ttata aww

][] Pr[

)()(ˆ

,tXt

t tSiI

aobserve

aa

EXP3-DOM – main Theorem

• Theorem:

• tuning γb

K

bTt b

btb

b

b

T b

QE

KR

log

01]

21[

ln2

))ln()(ln]||4[)((ln1

KTKQDEKORT

t

bttTt

30

Independent set

• Independent set α(G) • [Mannor & Shamir 2012]

• Tight Regret

– α(G) “replaces” K

• Cons:– requires to observe G– solves an LP each step

?

?

?

??

?

?

?

KGT ln)(

Documents

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion