Thompson Sampling on Symmetric -Stable Banditsweb.mit.edu/dubeya/www/files/alphats_ijcai19.pdfactions in such systems often lead to heavy-tailed and power law distributions, such as

Thompson Sampling on Symmetric α-Stable Bandits

Abhimanyu Dubey and Alex PentlandMassachusetts Institute of Technology{dubeya, pentland}@mit.edu

Abstract

Thompson Sampling provides an efficient technique to introduce prior knowledge in themulti-armed bandit problem, along with providing remarkable empirical performance. In thispaper, we revisit the Thompson Sampling algorithm under rewards drawn from α-stable dis-tributions, which are a class of heavy-tailed probability distributions utilized in finance andeconomics, in problems such as modeling stock prices and human behavior. We present anefficient framework for α-stable posterior inference, which leads to two algorithms for Thomp-son Sampling in this setting. We prove finite-time regret bounds for both algorithms, anddemonstrate through a series of experiments the stronger performance of Thompson Samplingin this setting. With our results, we provide an exposition of α-stable distributions in sequentialdecision-making, and enable sequential Bayesian inference in applications from diverse fields infinance and complex systems that operate on heavy-tailed features.

1 Introduction

The multi-armed bandit (MAB) problem is a fundamental model in understanding the exploration-exploitation dilemma in sequential decision-making. The problem and several of its variants havebeen studied extensively over the years, and a number of algorithms have been proposed that op-timally solve the bandit problem when the reward distributions are well-behaved, i.e. have a finitesupport, or are sub-exponential.

The most prominently studied class of algorithms are the Upper Confidence Bound (UCB) al-gorithms, that employ an “optimism in the face of uncertainty” heuristic [ACBF02], which havebeen shown to be optimal (in terms of regret) in several cases [CGM+13, BCBL13]. Over the pastfew years, however, there has been a resurgence in interest in the Thompson Sampling (TS) algo-rithm [Tho33], that approaches the problem from a Bayesian perspective.

Rigorous empirical evidence in favor of TS demonstrated by [CL11] sparked new interest in thetheoretical analysis of the algorithm, and the seminal work of [AG12, AG13, RVR14] demonstratedthe optimality of TS when rewards are bounded in [0, 1] or are Gaussian. These results were extendedin the work of [KKM13] to more general, exponential family reward distributions. The empiricalstudies, along with theoretical guarantees, have established TS as a powerful algorithm for the MABproblem.

However, when designing decision-making algorithms for complex systems, we see that inter-actions in such systems often lead to heavy-tailed and power law distributions, such as modelingstock prices [BT03], preferential attachment in social networks [MCM+13], and online behavior onwebsites [KT10].

Specifically, we consider a family of extremely heavy-tailed reward distributions known as α-stable distributions. This family refers to a class of distributions parameterized by the exponent α,that include the Gaussian (α = 2), Levy (α = 1/2) and Cauchy (α = 1) distributions, all of whichare used extensively in economics [Fra], finance [CW03] and signal processing [SN93].

1

The primary hurdle in creating machine learning algorithms that account for α-stable distribu-tions, however, is their intractable probability density, which cannot be expressed analytically. Thisprevents even a direct evaluation of the likelihood under this distribution. Their heavy-tailed nature,additionally, often leads to standard algorithms (such as Thompson Sampling assuming Gaussianrewards), concentrating on incorrect arms.

In this paper, we create two algorithms for Thompson Sampling under symmetric α-stable re-wards with finite means. Our contributions can be summarized as follows:

1. Using auxiliary variables, we construct a framework for posterior inference in symmetric α-stable bandits that leads to the first efficient algorithm for Thompson Sampling in this setting,which we call α-TS.

2. To the best of our knowledge, we provide the first finite-time polynomial bound on the BayesianRegret of Thompson Sampling achieved by α-TS in this setting.

3. We improve on the regret by proposing a modified Thompson Sampling algorithm, called

Robust α-TS, that utilizes a truncated mean estimator, and obtain the first O(N1

1+ε ) BayesianRegret in the α-stable setting. Our bound matches the optimal bound for α ∈ (1, 2) (withinlogarithmic factors).

4. Through a series of experiments, we demonstrate the proficiency of our two Thompson Sam-pling algorithms for α-stable rewards, which consistently outperform all existing benchmarks.

Our paper is organized as follows: we first give a technical overview of the MAB problem, theThompson Sampling algorithm and α-stable distributions. Next, we provide the central algorithmα-TS and its analysis, followed by the same for the Robust α-TS algorithm. We then provideexperimental results on multiple simulation benchmarks and finally, discuss the related work in thisarea prior to closing remarks.

2 Preliminaries

2.1 Thompson Sampling

The K-Armed Bandit Problem: In any instance of the K-armed bandit problem, there exists anagent with access to a set of K actions (or “arms”). The learning proceeds in rounds, indexed byt ∈ [1, T ]. The total number of rounds, known as the time horizon T , is known in advance. Theproblem is iterative, wherein for each round t ∈ [T ]:

1. Agent picks arm at ∈ [K].

2. Agent observes reward rat(t) from that arm.

For arm k ∈ [K], rewards come from a distribution Dk with mean µk = EDk [r]1. The largestexpected reward is denoted by µ∗ = maxk∈[K] µk, and the corresponding arm(s) is denoted as theoptimal arm(s) k∗. In our analysis, we will focus exclusively on the i.i.d. setting, that is, for eacharm, rewards are independently and identically drawn from Dk, every time arm k is pulled.

To measure the performance of any (possibly randomized) algorithm we utilize a measure knownas Regret R(T ), which, at any round T , is the difference of the cumulative mean reward of thealgorithm against the expected reward of always playing an optimal arm.

R(T ) = µ∗T −T∑t=0

µat (1)

1α-stable distributions with α ≤ 1 do not admit a finite first moment. To continue with existing measures of regret,we only consider rewards with α > 1.

2

Thompson Sampling (TS): Thompson Sampling [Tho33] proceeds by maintaining a posterior dis-tribution over the parameters of the bandit arms. If we assume that for each arm k, the rewarddistribution Dk is parameterized by a (possibly vector) parameter θk that come from a set Θ witha prior probability distribution p(θk) over the parameters, the Thompson Sampling algorithm pro-ceeds by selecting arms based on the posterior probability of the reward under the arms. For eachround t ∈ [T ], the agent:

1. Draws parameters θk(t) for each arm k ∈ [K] from the posterior distribution of parameters,

given the previous rewards rk(t − 1) = {r(1)k , r

(2)k , ...} till round t − 1 (note that the posterior

distribution for each arm only depends on the rewards obtained using that arm). When t = 1,this is just the prior distribution over the parameters.

θk(t) ∼ p(θk|rk(t− 1)) ∝ p(rk(t− 1)|θk)p(θk) (2)

2. Given the drawn parameters θk(t) for each arm, chooses arm at with the largest mean rewardover the posterior distribution.

at = arg maxk∈[K]

µk(θk(t)) (3)

3. Obtains reward rt after taking action at and updates the posterior distribution for arm at.

In the Bayesian case, the measure for performance we will utilize in this paper is the Bayes Regret(BR) [RVR14], which is the expected regret over the priors. Denoting the parameters over all armsas θ = {θ1, ..., θk} and their corresponding product distribution as D =

∏iDi, for any policy π, the

Bayes Regret is given by:BayesRegret(T, π) = Eθ∼D[R(T )]. (4)

While the regret provides a stronger analysis, any bound on the Bayes Regret is essentially a boundon the expected regret, since if an algorithm admits a Bayes Regret of O(g(T )), then its ExpectedRegret is also stochastically bounded by g(·) [RVR14]. Formally, we have, for constants M, ε:

P(E[R(T )|θ]g(T )

≥M)≤ ε ∀ T ∈ N. (5)

2.2 α-Stable Distributions

α-Stable distributions, introduced by Levy [Lev25] are a class of probability distributions definedover R whose members are closed under linear transformations.

Definition 1 ( [BHW05]). Let X1 and X2 be two independent instances of the random variable X.X is stable if, for a1 > 0 and a2 > 0, a1X1 + a2X2 follows the same distribution as cX + d forsome c > 0 and d ∈ R.

A random variable X ∼ Sα(β, µ, σ) follows an α-stable distribution described by the parametersα ∈ (0, 2] (characteristic exponent) and β ∈ [−1, 1] (skewness), which are responsible for the shapeand concentration of the distribution, and parameters µ ∈ R (shift) and σ ∈ R+ (scale) whichcorrespond to the location and scale respectively. While it is not possible to analytically expressthe density function for generic α-stable distributions, they are known to admit the characteristicfunction φ(x;α, β, σ, µ):

φ(x;α, β, σ, µ) = exp {ixµ− |σx|α (1− iβ sign(x)Φα(x))} ,

where Φα(x) is given by

Φα(x) ={

tan(πα2 ) when α 6= 1, − 2π log |x|, when α = 1

3

Figure 1: Sample probability density for standard (µ = 0, σ = 1) symmetric α-stable distributions withvarious values of α [Wik19].

For fixed values of α, β, σ and µ we can recover the density function from φ(·) via the inverseFourier transform:

p(z) =1

2π

∫ ∞−∞

φ(x;α, β, σ, µ)e−izxdx

Most of the attention in the analysis of α-stable distributions has been focused on the stabilityparameter α, which is responsible for the tail “fatness”. It can be shown that asymptotically, thetail behavior (x→ ±∞) of X ∼ Sα(β, µ, σ) follows [BHW05]:

f(x) ∼ 1

|x|1+α

(σα(1 + sgn(x)β) sin

(πα2

) Γ(α+ 1)

π

)where α < 2 and Γ(·) denotes the Gamma function. The power-law relationship admitted by thedensity is responsible for the heaviness of said tails.

Fact 1 ( [BHW05]). X ∼ Sα(β, µ, σ), α < 2 admits a moment of order λ only when λ ∈ (−∞, α).

From Fact 1 it follows that α-stable random variables only admit a finite mean for α > 1, and alsoadmit a finite variance only when α = 2, which corresponds to the family of Gaussian distributions.To continue with existing measures of regret, we restrict our analysis hence to α-stable distributionsonly with α > 1. Note that for all our discussions, 1 < α < 2, hence, all distributions examined areheavy-tailed, with infinite variance. Additionally, we restrict ourselves to only symmetric (β = 0)distributions: asymmetric distributions do not allow a scaled mixture representation (which is thebasis of our framework, see Section 3.1).

2.2.1 Sampling from α-Stable Densities

For general values of α, β, σ, µ, it is not possible to analytically express the density of α-stabledistributions, and hence we resort to using auxiliary variables for sampling. The Chambers-Mallows-Stuck [CMS76] algorithm is a straightforward method to generate samples from the density Sα(β, 1, 0)(for α 6= 1) via a non-linear transformation of a uniform random variable V and an exponential ran-dom variable W , which can then be re-scaled to obtain samples from Sα(β, σ, µ) (Algorithm 1).

2.2.2 Products of α-Stable Densities

A central property of α-stable densities that will be utilized in the following section is the behaviorof products of independent α-stable variables.

4

Algorithm 1 Chambers-Mallows-Stuck Generation

Input: V ∼ U(−π/2, π/2),W ∼ E(1)Output: X ∼ Sα(β, σ, µ)

Set Bα,β = arctan(β tan(πα/2))α−1

Set Sα,β =(1 + β2 tan2(πα/2)

)1/(2α)

Set Y = Sα,β × sin(α(V+Bα,β)

cos(V )1/α×(

cos(V−α(V+Bα,β)W

) 1−αα

return X = σY + µ.

Lemma 1 ( [BHW05]). Let Z and Y > 0 be independent random variables such that Z ∼ Sγ(0, σ1, µ1)and Y ∼ Sδ(1, σ2, µ2). Then ZY 1/γ is stable with exponent γδ.

We now begin with the algorithm description and analysis.

3 α-Thompson Sampling

We consider the setting where, for an arm k, the corresponding reward distribution is given byDk = Sα(0, σ, µk) where α ∈ (1, 2), σ ∈ R+ are known in advance, and µk is unknown2. Note thatE[rk] = µk, and hence we set a prior distribution over the variable µk which is the expected rewardfor arm k. We can see that since the only unknown parameter for the reward distributions is µk,Dk is parameterized by θk = µk. Therefore, the steps for Thompson Sampling can be outlined asfollows. For each round t ∈ [T ], the agent:

1. Draws parameters θk(t) = µk(t) for each arm k ∈ [K] from the posterior distribution of

parameters, given the previous rewards rk(t− 1) = {r(1)k , r

(2)k , ...} till round t− 1.

µk(t) ∼ p(µk|rk(t− 1)) ∝ p(rk(t− 1)|µk)p(µk) (6)

2. Given the drawn parameters µk(t) for each arm, chooses arm at with the largest mean rewardover the posterior distribution.

at = arg maxk∈[K]

µk(t) (7)

3. Obtains reward rt after taking action at and updates the posterior distribution for arm at.

In this section, we will derive the form of the prior distribution, and outline an algorithm forBayesian inference.

3.1 Scale Mixtures of Normals

On setting γ = 2 (Gaussian) and β = α/2 < 1 in Lemma 1, the product distribution X = ZY 1/2 isstable with exponent α. This property is an instance of the general framework of scale mixtures ofnormals (SMiN) [AM74], which are described by the following:

pX(x) =

∫ ∞0

N (x|0, λσ2)pΛ(λ)dλ (8)

This framework contains a large class of heavy-tailed distributions which include the exponentialpower law, Student’s-t and symmetric α-stable distributions [GK99]. The precise form of the variable

2Typically, more general settings have been explored for TS in the Gaussian case, where the variance is alsounknown. Our algorithm can also be extended to an unknown scale (σ) using an inverse Gamma prior, similarto [GK99].

5

X depends on the mixing distribution pΛ(λ). For instance, when pΛ is the inverted Gamma distri-bution (the conjugate prior for a unknown variance Gaussian), the resulting pX follows a Student’s-tdistribution.

Bayesian inference directly from Sα(0, σ, µ) is difficult: the non-analytic density prevents a directevaluation of the likelihood function, and the non-Gaussianity introduces difficulty in its implemen-tation. However, the SMiN representation enables us to draw samples directly from Sα(0, σ, µ) usingthe auxiliary variable λ:

x ∼ N (µ, λσ2), λ ∼ Sα/2(1, 1, 0) (9)

This sampling assists in inference since x is Gaussian conditioned on λ: given samples of λ, we cangenerate x from the induced conditional distribution (which is Gaussian).

3.2 Posteriors for α-Stable Rewards

Let us examine approximate inference for a particular arm k ∈ [K]. At any time t ∈ [T ], assumethis arm has been pulled nk(t) times previously, and hence we have a vector of reward samples

rk(t) = {r(1)k , ..., r

(nk(t))k } observed until time t. Additionally, assume we have nk(t) samples of an

auxiliary variable λk(t) = {λ(1)k , ..., λ

(nk(t))i } where λk ∼ Sα/2(1, 1, 0).

Recall that rk ∼ Sα(0, σ, µk) for an unknown (but fixed) µk. From the SMiN representation,we know that rk is conditionally Gaussian given the auxiliary variable λk, that is p(rk|λk, µk) ∼N (µk, λkσ

2). We can then obtain the conditional likelihood as the following:

p(rk(t)|λk(t), µk) ∝ exp

− 1

2σ2

nk(t)∑i=1

(r(i)k − µk)2

λ(i)k

. (10)

We can now assume a conjugate prior over µk, which is a normal distribution with mean µ0k and

variance σ2. We then obtain the posterior density for µk as (full derivation in the Appendix):

p(µk|rk(t),λk(t)) ∝ N(µk(t), σ2

k(t))

where, σ2k(t) =

σ2∑nk(t)i=1

1

λ(i)k

+ 1, µk(t) =

∑nk(t)i=1

r(i)k

λ(i)k

+ µ0k∑nk(t)

i=11

λ(i)k

+ 1.

(11)

We know that σ2k(t) > 0 since λ

(i)k > 0 ∀i as they are samples from a positive stable distribution

(β = 1). Given rk(t) and µk, we also know that the individual elements of λk(t) are independent,which provides us with the following decomposition for the conditional density of λk(t),

p(λk(t)|rk(t), µk) =

nk(t)∏i=1

p(λ(i)k |r

(i)k , µk), where, p(λ

(i)k |r

(i)k , µk) ∝ N (r

(i)k |µk, λ

(i)k , σ2)fα/2,1(λ

(i)k ).

(12)Here, fα,β(·) is the density of a random variable following Sα(β, 1, 0). Our stepwise posterior sam-

pling routine is hence as follows. At any time t, after arm k is pulled and we receive reward rnk(t)k ,

we set rk(t) = [rk(t− 1), rnk(t)k ]. Then for a fixed Q iterations, we repeat:

1. For i ∈ [1, nk(t)], draw λ(i)k ∼ p(λ

(i)k |r

(i)k , µk(t)).

2. Draw µk(t) ∼ p(µk|rk(t),λk(t)) .

Sampling from the conditional posterior of µk is straightforward since it is Gaussian. To sample

from the complicated posterior of λ(i)k , we utilize rejection sampling.

6

Algorithm 2 α-Thompson Sampling

1: Input: Arms k ∈ [K], priors N (µ0k, σ

2) for each arm.2: Set Dk = 1, Nk = 0 for each arm k.3: for For each iteration t ∈ [1, T ] do

4: Draw µk(t) ∼ N(µ0k+NkDk

, σ2

Dk

)for each arm k.

5: Choose arm At = arg maxk∈[K]

µk(t), and get reward rt.

6: for q ∈ [0, Q) do

7: Calculate v(t)At

= rt − µAt .8: Draw λ

(t)k following Section 3.3.

9: Set Dq = Dk + 1/λ(t)k , Nq = Nk + rt/λ

(t)k .

10: Draw µAt ∼ N(µ0k+NqDq

, σ2

Dq

).

11: end for12: Set Dk = Dk + 1/λ

(t)k , Nk = Nk + rt/λ

(t)k .

13: end for

3.3 Rejection Sampling for λ(i)k

Sampling directly from the posterior is intractable since it is not analytical. Therefore, to sample

λ(i)k we follow the pipeline described in [GK99]. We note that the likelihood of the mean-normalized

reward v(i)k = r

(i)k − µk(t) forms a valid rejection function since it is bounded:

p(v

(i)k |0, λ

(i)k σ2

)≤ 1

v(i)k

√2π

exp(−1/2) (13)

Since v(i)k ∼ N (0;λ

(i)k σ2). Thus, we get the procedure:

1. Draw λ(i)k ∼ Sα/2(1, 1, 0) (using Algorithm 1).

2. Draw u ∼ U(

0, (v(i)k

√2π)−1 exp(−1/2)

).

3. If u > p(v(i)k |0, λ

(i)k σ2), reject λ

(i)k and go to Step 1.

Combining all these steps, we can now outline our algorithm, α-Thompson Sampling (α-TS) asdescribed in Algorithm 2.

It is critical to note that in Algorithm 2, we do not draw from the full vector of λk(t) at everyiteration, but only from the last obtained reward. This is done to accelerate the inference process,and while it leads to a slower convergence of the posterior, we observe that it performs well inpractice. Alternatively, one can re-sample λk(t) over a fixed window of the previous rewards, toprevent the runtime from increasing linearly while enjoying faster convergence.

3.4 Regret Analysis

In this section, we derive an upper bound on the finite-time Bayesian Regret (BR) incurred by theα-TS algorithm. We continue with the notation used in previous sections, and assume a K armedbandit with T maximum trials. Each arm k follows an α-stable reward Sα(0, σ, µk), and withoutloss of generality, let µ∗ = maxk∈[K] µi denote the arm with maximum mean reward.

7

Algorithm 3 Robust α-Thompson Sampling

1: Input: Arms k ∈ [K], priors N (µ0k, σ

2) for each arm.2: Set Dk = 1, Nk = 0 for each arm k.3: for For each iteration t ∈ [1, T ] do

4: Draw µk(t) ∼ N(µ0k+NkDk

, σ2

Dk

)for each arm k.

5: Choose arm At = arg maxk µk(t), and get reward rt. i2log(T )1

1+ε , set rt = 0.6: for q ∈ [0, Q) do

7: Calculate v(t)At

= rt − µAt .8: Sample λ

(t)k following Section 3.3.

9: Set Dq = Dk + 1/λ(t)k , Nq = Nk + rt/λ

(t)k .

10: Sample µAt ∼ N(µ0k+NqDq

, σ2

Dq

).

11: end for12: Set Dk = Dk + 1/λ

(t)k , Nk = Nk + rt/λ

(t)k .

13: Recompute robust mean.14: end for

Theorem 1 (Regret Bound). Let K > 1, α ∈ (1, 2), σ ∈ R+, µk:k∈[K] ∈ [−M,M ]. For a K-armedbandit with rewards for each arm k drawn from Sα(β, σ, µk), we have, asymptotically, for ε chosena priori such that ε→ (α− 1)−,

BayesRegret(T, πTS) = O(K1

1+εT1+ε1+2ε )

Proof-sketch. We first utilize the characteristic function representation of the probability densityto obtain the centered (1 + ε) moment of the reward distribution for each arm. We use this momentto derive a concentration bound on the deviation of the empirical mean of α-stable densities. Next,we proceed with the decomposition of the Bayesian Regret in terms of upper-confidence bounds, asdone in [RVR14]. Using our concentration result, we set an upper confidence bound at any time tfor any arm k, and then obtain the result via algebraic manipulation. The complete proof has beenomitted here for brevity, but is included in detail in the appendix.

We note the following: First, the only additional assumption we make on the reward distributionsis that the true means are bounded, which is a standard assumption [RVR14] and easily realizablein most application cases. Next, given the established polynomial deviations of the empirical mean,obtanining a polylogarithmic regret is not possible, and the selection of ε governs the finite-time

regret. As ε → α − 1, we see that while the growth of T1+ε1+2ε decreases, the constants in the finite-

time expression grow, and are not finite at ε = α−1. This behavior arises from the non-compactnessof the set of finite moments for α-stable distributions (see Appendix for detailed analysis).

Compared to the problem-independent regret bound of O(√KT log T ) for Thompson Sampling

on the multi-armed Gaussian bandit problem demonstrated by [AG13], our bound differs in two

aspects: first, we admit a K1

1+ε complexity on the number of arms, which contrasted with theGaussian bandit is identical when ε → 1. Second, we have a superlinear dependence of order

T1+ε1+2ε . Compared to the Gaussian case, we see that we obtain a polynomial regret instead of a

polylogarithmic regret, which can be attributed to the polynomial concentration of heavy-taileddistributions.

In the next section, we address the issue of polynomial concentration by utilizing the more robust,truncated mean estimator instead of the empirical mean, and obtain a modified, robust version ofα-TS.

8

3.5 Robust α-Thompson Sampling

Assume that for all arms k ∈ [K], µk ≤ M . Note that this assumption is equivalent to the bound-edness assumption in the analysis of α-TS, and is a reasonable assumption to make in any practicalscenario with some domain knowledge of the problem. Let δ ∈ (0, 1), ε ∈ (0, α − 1). Now, considerthe truncated mean estimator r∗k(t) given by:

r∗k(t) =1

nk(t)

nk(t)∑i=1

r(i)k 1

{|r(i)k | ≤

(H(ε, α, σ) · i

2 log(T )

) 11+ε

}

where, H(ε, α, σ) =

(ε(M · Γ(−ε/α) + σαΓ(1− ε+1

α ))

σα sin(π·ε2

)Γ(1− ε)

)(14)

r∗k(t) then describes a truncated mean estimator for an arm k (pulled nk(t) times), where a reward

r(i)k at any trial i of the arm is discarded if it is larger than the bound. Intuitively, we see that this

truncated mean will prevent outliers from affecting the posterior. We choose such a form of thetruncation since it allows us to obtain an exponential concentration for α-stable densities, which willassist us in obtaining polylogarithmic regret.

As can be seen, the corresponding Robust α-TS algorithm is identical to the basic α-TS algorithm,except for this step of rejecting a reward (and replacing it with 0) and is outlined in Algorithm 3.We now describe the regret incurred by this modified α-TS algorithm.

Theorem 2 (Regret Bound). Let K > 1, α ∈ (1, 2), σ ∈ R+, µk:k∈[K] ∈ [−M,M ]. For a K-armedbandit with rewards for each arm k drawn from Sα(β, σ, µk), we have, for ε chosen a priori suchthat ε→ (α− 1)− and truncated estimator given in Equation (30),

BayesRegret(T, πRTS) = O(

(KT )1

1+ε

)

Proof-sketch. The truncated mean estimator can be used to obtain an exponential concentrationbound on the empirical mean. This can be understood intuitively as by rejecting values that are veryfar away from the mean, our empirical mean is robust to outliers, and would require less samples toconcentrate around the true mean. Using the concentration result, we follow an analysis identicalto Theorem 1, with exponential concentration in the upper confidence bounds. The full proof isdeferred to the Appendix for brevity.

We see that this bound is tight: when α = 2 (Gaussian), ε can be set to 1, and we see thatthis matches the upper bound of O(

√KT log T ) [RVR14] for the Gaussian case, which has also been

shown to be optimal [AG13]. Algorithm 3 is hence more robust, with performance improvementsincreasing as α decreases: the likelihood of obtaining confounding outliers increases as α → 1, andcan perturb the posterior mean in the naive α-TS algorithm. In the next section, we discuss someexperimental results that cement the value of α-TS for α-stable bandits.

4 Experiments

We conduct simulations with the following benchmarks – (i) an ε-greedy agent with linearly decreas-ing ε, (ii) Regular TS with Gaussian priors and a Gaussian assumption on the data (Gaussian-TS),(iii) Robust-UCB [BCBL13] for heavy-tailed distributions using a truncated mean estimator, and(iv) α-TS and (v) Robust α-TS, both with Q(iterations for sampling) as 50.

Setting Priors for TS: In all the Thompson Sampling benchmarks, setting the priors are crucialto the algorithm’s performance. In the competitive benchmarking, we randomly set the priors foreach arm from the same range we use for setting the mean rewards.

9

Figure 2: (A) Competitive benchmarking for α = 1.3, and (B) α = 1.8; (C) Ablation studies for varying α,and (D) varying prior strength. Shaded areas denote variance scaled by 0.25 in (A) and (B), and scaled by0.5 in (C) and (D). Figures are best viewed in color.

4.1 Performance against Competitive Benchmarks

We run 100 MAB experiments each for all 5 benchmarks for α = 1.8 and α = 1.3, and K = 50 arms,and for each arm, the mean is drawn from [0, 2000] randomly for each experiment, and σ = 2500.Each experiment is run for T = 5000 iterations, and we report the regret averaged over time, i.e.R(t)/t at any time t.

In Figures 2A and 2B, we see the results of the regret averaged over all 100 experiments. Weobserve that for α = 1.3, there are more substantial variations in the regret (low α implies heavyoutliers), yet both algorithms comfortably outperform the other baselines.

In the case of α = 1.8, the variations are not that substantial, the performance follows thesame trend. It is important to note that regular Thompson Sampling (Gaussian-TS) performscompetitively, although in our experiments, we observed that when K is large, the algorithm oftenconcentrates on the incorrect arm, and subsequently earns a larger regret.

Intuitively, we can see that whenever arms have mean rewards close to each other (compared tothe variance), ε-greedy will often converge to the incorrect arm. Robust-UCB, however, makes veryweak assumptions on the data distributions, and hence has a much larger exploration phase, leadingto larger regret. Compared to α-TS, Robust α-TS is less affected by outlying rewards (as is moreevident in α = 1.3 vs. α = 1.8) and hence converges faster.

4.2 Ablation Studies

We additionally run two types of ablation studies - first, we compare the performance of α-TS on theidentical set up as before (same K and reward distributions), but with varying values of α ∈ (1, 2).We report the expected time-normalized regret averaged over 100 trials in Figure 2C, and observethat (i) as α increases, the asymptotic regret decreases faster, and (ii) as expected, for lower valuesof α there is a substantial amount of variation in the regret.

Secondly, we compare the effect of the sharpness of the priors. In the previous experiments, thepriors are drawn randomly from the same range as the means, without any additional information.However, by introducing more information about the means through the priors, we can expectbetter performance. In this experiment, for each mean µk, we randomly draw the prior mean µ0

k

from [µk − δ, µk + δ], and observe the regret after T = 15K trials for δ from 50 to 1000. The resultsfor this experiment are summarized in Figure 2D for K = 10 and σ = 25, and results are averagedover 25 trials each. We see that with uninformative priors, α-TS performs competitively, and onlygets better as the priors get sharper.

10

5 Related Work

A version of the UCB algorithm [ACBF02] has been proposed in [BCBL13] coupled with severalrobust mean estimators to obtain Robust-UCB algorithms with optimal problem-dependent (i.e.dependent on individual µks) regret when rewards are heavy-tailed. However, the optimal versionof their algorithm has a few shortcomings that α-TS addresses: first, their algorithm requires themedian-of-means estimator, which has a space complexity of O(log δ−1) and time complexity ofO(log log δ−1) per update, where δ is the confidence factor. This makes the algorithm expensive asthe confidence is increased (for accurate prediction). Second, their regret bound has a dependenceof (µ∗ − µk)1/(1−α), which can become arbitrarily large for a stable distribution with α close to 1.Finally, there is no mechanism to incorporate prior information, which can be advantageous, as seeneven with weak priors. [VLZ13] introduce a deterministic exploration-exploitation algorithm, whichachieves same order regret as Robust-UCB for heavy-tailed rewards.

There has been work in analysing Thompson Sampling for specific Pareto and Weibull heavy-tailed distributions in [KKM13], however, they do not provide an efficient posterior update rule andrely on approximate inference under the Jeffrey’s prior. More importantly, the Weibull and Paretodistributions typically have “lighter” tails owing to the existence of more higher order moments, andhence cannot typically be used to model very heavy tailed signals.

In related problems, [YSLK] provide a purely exploratory algorithm for best-arm identificationunder heavy-tailed rewards, using a finite (1 + ε)th moment assumption. Similarly, [SYKL18,MY16]explore heavy-tailed rewards in the linear bandit setting.

6 Conclusion

In this paper, we first designed a framework for efficient posterior inference for the α-stable family,which has largely been ignored in the bandits literature owing to its intractable density function.We formulated the first polynomial problem-independent regret bounds for Thompson Samplingwith α-stable densities, and subsequently improved the regret bound to achieve the optimal regretidentical to the sub-Gaussian case, providing an efficient framework for decision-making for thesedistributions.

Additionally, our intermediary concentration results provide a starting point for other machinelearning problems that may be investigated in α-stable settings. There is ample evidence to supportthe existence of α-stability in various modeling problems across economics [Fra], finance [BT03] andbehavioral studies [MCM+13]. With tools such as ours, we hope to usher scientific conclusions inproblems that cannot make sub-Gaussian assumptions, and can lead to more robust empirical find-ings. Future work may include viewing more involved decision-making processes, such as MDPs, inthe same light, leading to more (distributionally) robust algorithms.

Acknowledgements: The authors would like to sincerely thank Peter Krafft, Yan Leng, DhavalAdjodah, Ziv Epstein, Michiel Bakker, Ishaan Grover and Spandan Madan for their comments andsuggestions for this manuscript, and MIT Quest for Intelligence for their generous support withcomputational resources. We also would like to acknowledge Tor Lattimore and Csaba Szepesvari’stext on bandit algorithms [LS] for their excellent treatment of bandit algorithms.

11

Appendix

Posterior Distribution of µk

At any time t ∈ [T ], arm k ∈ [K], assume this arm has been pulled nk(t) times previously, and

hence we have a vector of reward samples rk(t) = {r(1)k , ..., r

(nk(t))k } observed until time t. Addi-

tionally, assume we have nk(t) samples of an auxiliary variable λk(t) = {λ(1)k , ..., λ

(nk(t))i } where

λk ∼ Sα/2(1, 1, 0). Recall that rk ∼ Sα(0, σ, µk) for an unknown (but fixed) µk. From the SMiN

Figure 3: Directed graphical model for arm k where nk(t) = 4.

representation, we know that rk is conditionally Gaussian given the auxiliary variable λk, that isp(rk|λk, µk) ∼ N (µk, λkσ

2). We can then obtain the conditional likelihood as

p(rk(t)|λk(t), µk) ∝ exp

1

2σ2

nk(t)∑i=1

(r(i)k − µk)2

λ(i)k

. (15)

We can now assume a conjugate prior over µk, which is a normal distribution with mean µ0k and

variance σ2. By Bayes’ rule, we know that

p(µk|λk(t), rk(t)) ∝ p (rk(t)|µk,λk(t) )p(µk|λk(t)) (16)

(a)∝ p(rk(t)|µk,λk(t))p(µk) (17)

∝ exp

− 1

2σ2

nk(t)∑i=1

(r(i)k − µk)2

λ(i)k

· exp

(− (µk − µ0

k)2

2σ2

)(18)

∝ exp

− 1

2σ2

nk(t)∑i=1

(r(i)k − µk)2

λ(i)k

+ (µk − µ0k)2

(19)

12

p(µk|λk(t), rk(t)) ∝ exp

− 1

2σ2

µ2k

nk(t)∑i=1

1

λ(i)k

+ 1

− 2µk

nk(t)∑i=1

r(i)k

λ(i)k

+ µ0k

(20)

∝ exp

− 1

2σ2

(∑nk(t)i=1

1

λ(i)k

+ 1

)−1

µ2k − 2µk

(∑nk(t)

i=1r(i)k

λ(i)k

+ µ0k

)(∑nk(t)

i=11

λ(i)k

+ 1

) (21)

∝ exp

−

µk −(∑nk(t)

i=1

r(i)k

λ(i)k

+µ0k

)(∑nk(t)

i=11

λ(i)k

+1

)

2

2σ2

(∑nk(t)i=1

1

λ(i)k

+ 1

)−1

. (22)

Where (a) follows from the independence of λ(t)k and µk. Comparing the above to a standard normal

distribution, we then obtain the posterior density for µk as

p(µk|rk(t),λk(t)) ∝ N(µk(t), σ2

k(t))

where, σ2k(t) =

σ2∑nk(t)i=1

1

λ(i)k

+ 1, µk(t) =

∑nk(t)i=1

r(i)k

λ(i)k

+ µ0k∑nk(t)

i=11

λ(i)k

+ 1.

(23)

Regret Analysis: α-Thompson Sampling

To derive a bound on the regret of α-Thompson Sampling, we first state a few elementary propertiesof α-stable distributions.

Lemma 1 ( [BHW05]). If X ∼ Sα(0, σ1, µ1) and Y ∼ Sα(0, σ2, µ2), then X + Y ∼ Sα(0, σ, µ),where,

σ = (σα1 + σα2 )1/α

, µ = µ1 + µ2

Lemma 2 ( [BHW05]). If X ∼ Sα(0, σ, µ), then for a 6= 0, b ∈ R,

aX + b ∼ Sα(0, |a|σ, aµ+ b)

By composing these two results, we can obtain a representation for the empirical mean of asequence of i.i.d. samples from an α-stable distribution, as described next.

Lemma 3. Let X1, X2, ..., Xn be an i.i.d. sequence of n random variables following Sα(0, σ, µ), andX denote the empirical mean:

X =1

n

n∑i=1

Xi

Then X ∼ Sα(0, σ∗, µ∗), where,

σ∗ = σn( 1α−1), µ∗ = µ

13

Proof. Let Y =∑ni=1Xi. Then Y ∼ Sα(σn1/α, µ), by Lemma 1. Then, X = Y/n, and the result

follows from Lemma 2.

Next, we state an important result that allows us to obtain the moments of symmetric, zero-meanα-stable distributions.

Lemma 4 (Theorem 4 of [SN93]). For X ∼ Sα(0, σ, 0), p ∈ (0, α),

E[|X|p] = C(p, α)σpα where, C(p, α) =

2p+1Γ(p+12 )Γ(−p/α)

α√πΓ(−p/2)

, and Γ(x) =

∫ ∞0

tx−1e−tdt.

The next two results allow us to obtain a concentration bound of the empirical of a fininte numberof samples of a symmetric α-stable distribution.

Lemma 5 (Lemma 3 of [BCBL13]). Let X1, ..., Xn be an i.i.d. sequence of random variables withfinite mean µ, finite (1+ε) centered moments E[|X−µ|1+ε] ≤ vε and finite raw moments E[|X|1+ε] ≤uε, for ε ∈ (0, 1]. Let µ denote the empirical mean:

µ =1

n

n∑i=1

Xi

Then, for any δ ∈ (0, 1), we have, with probability at least 1− δ,

µ ≤ µ+

(3vεδnε

) 11+ε

Lemma 6. Let X1, ..., Xn be an i.i.d. sequence of random variables following Sα(0, σ, µ) for α ∈(1, 2). Let µ denote the empirical mean:

µ =1

n

n∑i=1

Xi

Then, for any δ ∈ (0, 1) and ε ∈ (0, α− 1), we have, with probability at least 1− δ,

|µ− µ| ≤ σ 1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

Proof. From Lemma 2, we know that if X ∼ Sα(0, σ, µ) then X − µ ∼ Sα(0, σ, 0). Applying this

in Lemma 4, we get vε = C(1 + ε, α)σ1+εα . Note that the raw moments of order 1 + ε < α are also

finite for α-stable densities, therefore uε < ∞ ∀ε ∈ (0, α − 1) and Lemma 5 can be applied. Thus,an application of Lemma 5 for both tails with probability δ/2 gets the desired result.

A few remarks are in order. First, we note that this bound is tight, and while this introducespolynomial deviations for the empirical mean, it is not an artefact of the proof method, but theheaviness of the tails themselves [BCBL13].

Using this concentration result, we can now proceed to proving the regret bound. Our analysisis inspired by the sub-Gaussian case analysed in [RVR14], and we prove a few important resultswithin the main proof.

14

Theorem 1 (Regret Bound). Let K > 1, α ∈ (1, 2), σ ∈ R+, µi:i∈[K] ∈ [ −M,M ]. For a K-armedbandit with rewards for each arm drawn from Sα(0, σ, µi), we have, for finite-time T :

BR(T, πTS) ≤ infε<α−1

{2KMT + 2 (g(ε, α))

11+ε K

11+εT

1+ε1+2ε

}Asymptotically, for ε chosen a priori such that ε→ (α− 1)−,

BayesRegret(T, πTS) = O(K1

1+εT1+ε1+2ε )

Proof. Consider a K-armed bandit with rewards for arm k drawn from Sα(0, σ, µk). Let nk(t) denote

the number of times arm k has been pulled until time t. Then t − 1 =∑Kk=1 nk(t). Let us denote

the empirical average reward for arm k up to (and including) time t − 1 as rk(t), and denote thearm pulled at any time t as at, and the optimal arm as a∗t . We then set an upper confidence boundfor arm k at any time t as

Uk(t) = clip[−M,M ]

[rk(t) + σ

1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

](24)

for 0 < ε < 1,M > 0. Let E be the event when for all k ∈ [K] arms, over all iterations t ∈ [T ], wehave:

|rk(t)− µk| ≤ σ1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

. (25)

Lemma 7. For the setup described above, we have, for event E and δ ∈ (0, 1),

P(Ec) ≤ KTδ.

Proof. From Lemma 6, for arm k ∈ [K] at time t ∈ [T ]:

P

(|rk(t)− µk| ≤ σ

1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

)≥ 1− δ

The event Ec holds whenever the bound is violated for at least one arm k at one instance t. Therefore,

P(Ec) ≤ P

K,T⋃k=1t=1

{|rk(t− 1)− µk| > σ

1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

}(a)

≤K,T∑k=1t=1

P

(|rk(t− 1)− µk| > σ

1α

(2C(1 + ε, α)

δnk(t)ε

) 11+ε

)(b)

≤ KTδ.

Where (a) is an application of the union bound, and (b) is obtained using Lemma 6.

Using this lemma, we can now prove Theorem 1. We begin with the seminal result of [RVR14].

15

Lemma 8 (Proposition 1 of [RVR14]). Let πTS be any policy followed by Thompson Sampling. Forany sequence of upper confidence bounds {Ut|t ∈ N},

BayesRegret(T, πTS) = E

[T∑t=1

(Uat(t)− µat)

]+ E

[T∑t=1

(µa∗t − Ua∗t (t)

)]

Let us begin from Lemma 8. By the tower rule, we can condition over event E:

BayesRegret(T, πTS) = E

[T∑t=1

(Uat(t)− µat) +(µa∗t − Ua∗t (t)

) ∣∣∣∣∣E]P(E)+

E

[T∑t=1


) ∣∣∣∣∣Ec]P(Ec)

Since P(E) ≤ 1,

BayesRegret(T, πTS) ≤ E

[T∑t=1


) ∣∣∣∣∣E]

+

E

[T∑t=1


) ∣∣∣∣∣Ec]P(Ec)

When Ec holds, each term in the summation in the conditional expectation is bounded by 4M(Equation 31). Therefore,

BayesRegret(T, πTS) ≤ E

[T∑t=1


) ∣∣∣∣∣E]

+ 4MT · P(Ec)

(a)

≤ E

[T∑t=1


) ∣∣∣∣∣E]

+ 4KMT 2δ

(b)

≤ 2E

[K∑k=1

T∑t=1

1{At = k}(

2C(1 + ε, α)

δnk(t)ε

) 11+ε

]+ 4KMT 2δ

= 2

(2C(1 + ε, α)

δ

) 11+ε

E

[K∑k=1

T∑t=1

1{At = k}(

1

nk(t)ε

) 11+ε

]+ 4KMT 2δ

(c)

≤ 2

(2C(1 + ε, α)

δ

) 11+ε

E

[K∑k=1

∫ nk(T )

s=0

(1

sε

) 11+ε

ds

]+ 4KMT 2δ

= 2(1 + ε)

(2C(1 + ε, α)

δ

) 11+ε

E

[K∑k=1

nk(T )1

1+ε

]+ 4KMT 2δ

(d)

≤ 4

(2C(1 + ε, α)

δ

) 11+ε

E

K 11+ε

(K∑k=1

nk(T )

) 11+ε

+ 4KMT 2δ

(e)

≤ 4

(2C(1 + ε, α)

δ

) 11+ε

(KT )1

1+ε + 4KMT 2δ.

Here, (a) follows from Lemma 7, (b) follows from event E: whenever E holds, each term insidethe summation is bounded by Equation (25), (b) follows from the upper bound of a finite discrete

16

sum with a definite integral, (c) follows from Holder’s Inquality of order 11+ε and (d) follows from

T = 1 +∑Kk=1 nk(T ), and that 1 + ε < 2. For a finite time analysis, we can now choose ε and T

such that it minimizes the RHS. This concludes the proof.

Regret Analysis: Robust α-Thompson Sampling

Bounding the concentration of the robust mean estimator first requires a bound on the raw momentsof symmetric α-stable densities, provided in the next two results.

Lemma 9 (Proposition 2.2 of [MP+16]). For any random variable X ∼ Sα(0, σ, 0), ε ∈ (−∞, α− 1)and ν ∈ R,

E[|X−ν|1+ε] =ε · σ1+ε

sin(π·ε2

)Γ(1− ε)

[ν

σ

∫ ∞0

u−(1+ε)e−uα

sin(νuσ

)du+ α

∫ ∞0

uα−ε−2e−uα

cos(νuσ

)du

](26)

Lemma 10. For any random variable X ∼ Sα(0, σ, µ), ε ∈ (−∞, α− 1), µ ≤M ,

E[|X|1+ε] ≤ε · σ1+ε

(M · Γ(−ε/α) + σαΓ(1− ε+1

α ))

σα sin(π·ε2

)Γ(1− ε)

(27)

Proof. Let X ∼ Sα(0, σ, µ). Then X − µ ∼ Sα(0, σ, 0). Applying Lemma 9 to X − µ with ν =−µ, ε ∈ (−∞, α− 1), we have

E[|X|1+ε] = E[|X − µ− (−µ)|1+ε]

=ε · σ1+ε

sin(π·ε2

)Γ(1− ε)

[−µσ

∫ ∞0

u−(1+ε)e−uα

sin

(−µuσ

)du+ α

∫ ∞0

uα−ε−2e−uα

cos(µuσ

)du

](a)=

ε · σ1+ε

sin(π·ε2

)Γ(1− ε)

[µ

σ

∫ ∞0

u−(1+ε)e−uα

sin(µuσ

)du+ α

∫ ∞0

uα−ε−2e−uα

cos(µuσ

)du

](b)

≤ ε · σ1+ε

sin(π·ε2

)Γ(1− ε)

[µ

σ

∫ ∞0

u−(1+ε)e−uα

du+ α

∫ ∞0

uα−ε−2e−uα

du

](c)=

ε · σ1+ε

sin(π·ε2

)Γ(1− ε)

[µ

σα

∫ ∞0

t−1−ε/αe−tdt+

∫ ∞0

t−1+εα e−tdt

](d)

≤ε · σ1+ε

(M · Γ(−ε/α) + σαΓ(1− ε+1

α ))

σα sin(π·ε2

)Γ(1− ε)

Here, (a) follows from sin(−x) = − sin(x), (b) follows from sin(x) ≤ 1, cos(x) ≤ 1 ∀x, (c) follows bythe substitution t = uα, and (d) follows from µ ≤M .

Before stating the proof, we first describe the concentration results for the raw moments of thetruncated mean estimator.

Lemma 11 (Lemma 1 of [BCBL13]). Let δ ∈ (0, 1), ε ∈ (0, 1], u > 0. Consider the truncatedempirical mean µT defined as,

µT =1

n

n∑t=1

Xt1

{|Xt| ≤

(ut

log(δ−1)

) 11+ε

}(28)

17

If E[|X|1+ε] < u,E[X] = µ, then with probability 1− δ,

µt ≤ µ+ 4u1

1+ε

(log(δ−1)

n

) ε1+ε

(29)

Lemma 12. For any δ ∈ (0, 1), arm k ∈ [K], time t > 0, maximum possible mean M > 0, andε ∈ (0, α− 1), we have, with probability at least 1− δ,

|r∗k(t)− µk| ≤ 4σ

(ε(MΓ(−ε/α) + σαΓ(1− ε+1

α ))

σα sin(π·ε2

)Γ(1− ε)

) 11+ε (

log(δ−1)

n

) ε1+ε

Proof. For any arm k, we know that the rewards follow Sα(0, σ, µk). Applying Lemma 10 to the

rewards distribution for arm k, we obtain u =ε·σ1+ε(M ·Γ(−ε/α)+σαΓ(1− ε+1

α ))σα sin(π·ε

2 )Γ(1−ε). Using this value of u

for both tails of r∗k(t) in Lemma 11 with probability δ/2, we obtain the result.

Using the above concentration result, we can now derive a regret bound on the robust α-

Thompson Sampling algorithm. For convenience, let H(ε, α, σ) =

(ε(M ·Γ(−ε/α)+σαΓ(1− ε+1

α ))σα sin(π·ε

2 )Γ(1−ε)

).

First, let us state the form of the robust mean estimator.

Definition 1. Let δ ∈ (0, 1), ε ∈ (0, α − 1), k ∈ [K] and time horizon be T . The truncated meanestimator r∗k(t) is given by:

r∗k(t) =1

nk(t)

nk(t)∑i=1

r(i)k 1

{|r(i)k | ≤

(H(ε, α, σ) · i

2 log(T )

) 11+ε

}(30)

Theorem 2 (Robust Regret Bound). Let K > 1, α ∈ (1, 2), σ ∈ R+, µk:k∈[K] ∈ [ −M,M ]. For aK-armed bandit with rewards for each arm k drawn from Sα(β, σ, µk), we have, for ε chosen a priorisuch that ε→ (α− 1)− and truncated estimator given in Equation (30),

BayesRegret(T, πRTS) = O(

(KT )1

1+ε (log T )ε

1+ε

)

Proof. Consider a K-armed bandit with rewards for arm k drawn from Sα(0, σ, µk). Let nk(t) denote

the number of times arm k has been pulled until time t. Then t − 1 =∑Kk=1 nk(t). Let us denote

the empirical average reward for arm k up to (and including) time t − 1 as rk(t), and denote thearm pulled at any time t as at, and the optimal arm as a∗t . We then set an upper confidence boundfor arm k at any time t as

Uk(t) = clip[−M,M ]

[rk(t) + 4σH(ε, α, σ)

11+ε

(log(2/δ)

nk(t)

) ε1+ε

](31)

for 0 < ε < 1,M > 0. Let E be the event when for all k ∈ [K] arms, over all iterations t ∈ [T ], wehave:

|rk(t)− µk| ≤ 4σH(ε, α, σ)1

1+ε

(log(2/δ)

nk(t)

) ε1+ε

. (32)

18

Lemma 13. For the setup described above, we have, for event E and δ ∈ (0, 1),

P(Ec) ≤ KTδ.

Proof. The event Ec holds whenever the bound is violated for at least one arm k at one instance t.Therefore,

P(Ec) ≤ P

K,T⋃k=1t=1

{|rk(t− 1)− µk| > 4σH(ε, α, σ)

11+ε

(log(2/δ)

nk(t)

) ε1+ε

}(a)

≤K,T∑k=1t=1

P

(|rk(t− 1)− µk| > 4σH(ε, α, σ)

11+ε

(log(2/δ)

nk(t)

) ε1+ε

)(b)

≤ KTδ.

Where (a) is an application of the union bound, and (b) is obtained using Lemma 12.

We now prove Theorem 2 in a similar manner as Theorem 1, via Lemma 8. By the tower rule,we can condition over event E:

BayesRegret(T, πRTS) = E

[T∑t=1


) ∣∣∣∣∣E]P(E)+

E

[T∑t=1


) ∣∣∣∣∣Ec]P(Ec) (33)

Since P(E) ≤ 1,

BayesRegret(T, πRTS) ≤ E

[T∑t=1


) ∣∣∣∣∣E]

+

E

[T∑t=1


) ∣∣∣∣∣Ec]P(Ec) (34)

When Ec holds, each term in the summation in the conditional expectation is bounded by 4M

19

(Equation 31). Therefore,

BayesRegret(T, πRTS) ≤ 4MT · P(Ec) + E

[T∑t=1


) ∣∣∣∣∣E]

(a)

≤ 4KMT 2δ + E

[T∑t=1


) ∣∣∣∣∣E]

(b)

≤ 4KMT 2δ + 8E

[K∑k=1

T∑t=1

1{At = k}σH(ε, α, σ)1

1+ε

(log(2/δ)

nk(t)

) ε1+ε

]

= 4KMT 2δ + 8σH(ε, α, σ)1

1+ε log(2/δ)ε

1+εE

[K∑k=1

T∑t=1

1{At = k}(

1

nk(t)ε

) 11+ε

](c)

≤ 4KMT 2δ + 8σH(ε, α, σ)1

1+ε log(2/δ)ε

1+εE

[K∑k=1

∫ nk(T )

s=0

(1

sε

) 11+ε

ds

]

= 4KMT 2δ + 8(1 + ε)σH(ε, α, σ)1

1+ε log(2/δ)ε

1+εE

[K∑k=1

nk(T )1

1+ε

](d)

≤ 4KMT 2δ + 16σH(ε, α, σ)1

1+ε log(2/δ)ε

1+εE

K 11+ε

(K∑k=1

nk(T )

) 11+ε

(e)

≤ 4KMT 2δ + 16σH(ε, α, σ)1

1+ε log(2/δ)ε

1+ε (KT )1

1+ε

(f)= 4KM + 16σH(ε, α, σ)

11+ε(log(2)

ε1+ε)

+ 32σH(ε, α, σ)1

1+ε log(T )ε

1+ε (KT )1

1+ε .

Here, (a) follows from Lemma 13, (b) follows from event E: whenever E holds, each term insidethe summation is bounded by Equation (32), (c) follows from the upper bound of a finite discretesum with a definite integral, (d) follows from Holder’s Inquality of order 1

1+ε and (e) follows from

T = 1+∑Kk=1 nk(T ), and that 1+ε < 2, and (f) is obtained by setting δ = 1/T 2. Asymptotically,for

ε chosen a priori close to α− 1,

BayesRegret(T, πRTS) = O((KT )1

1+ε (log T )ε

1+ε ). (35)

We see that this bound is tight: when α = 2, ε can be set to 1, and we see that this matchesthe upper bound of O(

√KT log T ) [RVR14] for the Gaussian case, which has also been shown to be

optimal [AG13]. Hence, this alternative Thompson Sampling can be beneficial, with the performanceimprovements increasing with α.

References

[ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed bandit problem. Machine learning, 47(2-3):235–256, 2002.

[AG12] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armedbandit problem. In Conference on Learning Theory, pages 39–1, 2012.

[AG13] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling.In Artificial Intelligence and Statistics, pages 99–107, 2013.

20

[AM74] David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journalof the Royal Statistical Society: Series B (Methodological), 36(1):99–102, 1974.

[BCBL13] Sebastien Bubeck, Nicolo Cesa-Bianchi, and Gabor Lugosi. Bandits with heavy tail.IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.

[BHW05] Szymon Borak, Wolfgang Hardle, and Rafa l Weron. Stable distributions. In Statisticaltools for finance and insurance, pages 21–44. Springer, 2005.

[BT03] Brendan O Bradley and Murad S Taqqu. Financial risk and heavy tails. In Handbookof heavy tailed distributions in finance, pages 35–103. Elsevier, 2003.

[CGM+13] Olivier Cappe, Aurelien Garivier, Odalric-Ambrym Maillard, Remi Munos, Gilles Stoltz,et al. Kullback–leibler upper confidence bounds for optimal sequential allocation. TheAnnals of Statistics, 41(3):1516–1541, 2013.

[CL11] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. InAdvances in neural information processing systems, pages 2249–2257, 2011.

[CMS76] John M Chambers, Colin L Mallows, and BW Stuck. A method for simulating stablerandom variables. Journal of the american statistical association, 71(354):340–344, 1976.

[CW03] Peter Carr and Liuren Wu. The finite moment log stable process and option pricing.The journal of finance, 58(2):753–777, 2003.

[Fra] John C Frain. Studies on the Application of the Alpha-stable Distribution in Economics.

[GK99] Simon Godsill and Ercan E Kuruoglu. Bayesian inference for time series with heavy-tailed symmetric α-stable noise processes. 1999.

[KKM13] Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information ProcessingSystems, pages 1448–1456, 2013.

[KT10] Ravi Kumar and Andrew Tomkins. A characterization of online browsing behavior.In Proceedings of the 19th international conference on World wide web, pages 561–570.ACM, 2010.

[Lev25] P Levy. Calcul des probabilites, vol. 9. Gauthier-Villars Paris, 1925.

[LS] Tor Lattimore and Csaba Szepesvari. Bandit algorithms.

[MCM+13] Aniket Mahanti, Niklas Carlsson, Anirban Mahanti, Martin Arlitt, and CareyWilliamson. A tale of the tails: Power-laws in internet measurements. IEEE Network,27(1):59–64, 2013.

[MP+16] Muneya Matsui, Zbynek Pawlas, et al. Fractional absolute moments of heavy taileddistributions. Brazilian Journal of Probability and Statistics, 30(2):272–298, 2016.

[MY16] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linearbandits. In International Conference on Machine Learning, pages 1642–1650, 2016.

[RVR14] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling.Mathematics of Operations Research, 39(4):1221–1243, 2014.

[SN93] Min Shao and Chrysostomos L Nikias. Signal processing with fractional lower ordermoments: stable processes and their applications. Proceedings of the IEEE, 81(7):986–1010, 1993.

21

[SYKL18] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms forlinear stochastic bandits with heavy-tailed payoffs. In Advances in Neural InformationProcessing Systems, pages 8430–8439, 2018.

[Tho33] William R Thompson. On the likelihood that one unknown probability exceeds anotherin view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[VLZ13] Sattar Vakili, Keqin Liu, and Qing Zhao. Deterministic sequencing of exploration andexploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in SignalProcessing, 7(5):759–767, 2013.

[Wik19] Wikipedia. Stable distribution — Wikipedia, the free encyclopedia, 2019. [Online;accessed 25-February-2019].

[YSLK] Xiaotian Yu, Han Shao, Michael R Lyu, and Irwin King. Pure exploration of multi-armed bandits with heavy-tailed payoffs.

22

Documents

Thompson Sampling on Symmetric -Stable Banditsweb.mit.edu/dubeya/www/files/alphats_ijcai19.pdfactions in such systems often lead to heavy-tailed and power law distributions, such as