Non%26ParaBoot

Embed Size (px)

Citation preview

  • 8/14/2019 Non%26ParaBoot

    1/10

    Introduction to Monte Carlo Procedures: the

    Non-parametric and Parametric Bootstrap

    1. Review of the Non-parametric Bootstrap

    Given a data set, say {x1, x2, . . . , xn} and a statistic of interest, say , thebasic algorithm for the non-parametric bootstrap consists of the following:

    1. Resample the data with equal probability and with replacement. Thatis each resampling is performed on the entire n data points, so that eachobservation has probability 1

    nof being sampled at every resampling. For

    example, for a original sample of size 5, one bootstrap sample mightbe x = {x4, x1, x3, x2, x2}.

    2. Calculate the statistic of interest, = g(x), call the bth estimate b ,and store the value in a vector.

    3. Repeat (1) and (2) a large number of times.

    The resulting vector of bootstrap statistics then provides an estimate ofthe distribution of the statistic, by way of

    1. Bootstrap estimate of the expected value:

    = 1B

    b

    b (1.1)

    2. Bootstrap quantiles: Let [q] represent the qth quantile of the bootstrap

    statistic. That is, take the vector of statistics produced by the boot-strap procedure and rank them from smallest to largest. The ranks ofthe vector then correspond to the bootstrap estimate of the quantiles ofthe distribution. For example, if the number of bootstrap iterations was1000, then, the 25th element of the ranked vector of bootstrap statisticsis the bootstrap estimate of the 0.025th quantile of the distribution of

    the statistic.

    3. The bootstrap estimate of the standard error

    se() =

    1B 1

    Bb=1

    b

    2(1.2)

    1

  • 8/14/2019 Non%26ParaBoot

    2/10

    4. A (1 )% confidence interval is

    [B 2],

    [B(12)]

    Hypothesis testing and parameter estimation can then be carried outusing the bootstrap estimates. The non-parametric bootstrap will be thebest approach to inference when everything we know about the distributioncomes from the sample. In the case where we know something about thedistribution before we look at the sample, parametric approaches will giveus better results.

    1.1. Inference with the Non-parametric Bootstrap

    Inference with the bootstrap is a direct extension of traditional inference.

    Example 1 The table below shows the results of a small experiment in which7 mice were randomly chosen from 16 to receive a new medical treatment,while the remaining 9 were assigned to the non-treatment group. Investiga-tors wanted to test whether the treatment prolonged life after surgery. Thetable shows the survival times in days.

    Group Data n mean SD Treatment 94,197,16,38,99,141,23 7 86.86 25.24Control 52,104,146,10,51,30,40,27,46 9 56.22 14.14

    Difference 30.63 28.93

    Say we wish to test for treatment differences, and know that the medianis a better measure of the center of distribution than the mean.

    How do we make inferences using the bootstrap?

    2

  • 8/14/2019 Non%26ParaBoot

    3/10

    100 50 0 50 100 150 2000

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    1000 bootstrapped Differences of Treatment Medians

    median difference

    frequency

    Figure 1: Bootstrapped differences in the median lifetimes, in hours, of micereceiving two different post-surgery treatments.

    The bootstrap 95% confidence interval was (29, 101). What is our con-clusion?

    2. The Parametric Bootstrap

    Sometimes we know the distribution of the sample, but we cannot derive the

    distribution of the statistic of interest. Sometimes we can use asymptoticapproximations, but if our sample is small these may be grossly inaccurate.Furthermore, there are cases where the non-parametric bootstrap will fail.

    Can you think of one?

    3

  • 8/14/2019 Non%26ParaBoot

    4/10

    In the case where we know the distribution of the sample, but not of

    the sample statistic, the parametric bootstrap often provides a powerful ap-proach.

    The basic algorithm for the parametric bootstrap is as follows:

    1. Simulate a random sample of the same size as your original sample ofinterest, using sample estimates for the parameters.

    2. Calculate the statistic of interest, , from the simulated sample. Savethe value in a vector.

    3. Repeat (1) and (2) a large number of times.

    The resulting vector then provides an estimate of the distribution of thestatistic, just as for the non-parametric case.

    Example 2 Recall that the distribution of p is approximately normal withmean p and variance np(1 p). Suppose we wish to conduct inference on apopulation proportion using the exact distribution of the underlying samplefrom which we calculate p. We know that the underlying distribution of eachof our sample observations is bernoulli with unknown parameter p. How would we conduct the parametric bootstrap?

    1. First calculate the sample estimate of p, which is p =

    iXi

    n .

    2. Then simulate a random sample of n bernoulli(p) random variablesand calculate p.

    3. Repeat (2) a large number of times.

    4. Plot a histogram, compute quantiles and confidence intervals, etc.

    Below we consider one example where the non-parametric bootstrap failsand the parametric bootstrap proves to be quite useful.

    2.1. Distribution of the Sample Maximum

    Let X1, X2, , Xn be independent and identically distributed random vari-ables whose probability distribution function (pdf) is given by f and whosecumulative distribution function (cdf) is given by F.

    4

  • 8/14/2019 Non%26ParaBoot

    5/10

  • 8/14/2019 Non%26ParaBoot

    6/10

    0 5 10 15 20 25 30 350

    5

    10

    15

    20

    25

    30

    35

    40

    45

    Radiocesium Tissue Concentration in Bass from PAR Pond

    picocuries per gram

    frequency

    Figure 2: An approximately Normal Data set of 137Cs Body Burdens

    How do we conduct inference using the parametric bootstrap?

    6

  • 8/14/2019 Non%26ParaBoot

    7/10

    A parametric bootstrap was performed using the normal distribution for

    the underlying distribution of the data. A histogram of the bootstrapped max-imums is shown below.

    20 25 30 350

    50

    100

    150

    200

    250

    300Bootstrapped Maximum Radiocesium Tissue Concentrations in Bass from PAR Pond

    picocuries per gram

    frequency

    Figure 3: Bootstrapped Maximum Body Burdens

    There were 17 observations in the bootstrapped maximums that were above30 picocuries per gram.

    What is the bootstrap estimate of the probability that the maximum body bur-den in a sample of size 163 will exceed 30 picocuries per gram?

    7

  • 8/14/2019 Non%26ParaBoot

    8/10

    2.2. Code for Non-parametric Bootstrap Two Sample

    Inference

    treatment = [94,197,16,38,99,141,23];

    control = [52,104,146,10,51,30,40,27,46];

    B=1000; mediantreat=zeros(B,1);

    mediancontrol=zeros(B,1);

    medianDiff=zeros(B,1);

    boottreat=zeros(length(treatment),1);

    bootcontrol=zeros(length(control),1);

    for b=1:B

    for j=1:length(treatment);

    pick=unidrnd(length(treatment));

    boottreat(j)=treatment(pick);

    end

    for k=1:length(control);

    pick=unidrnd(length(control));bootcontrol(k)=control(pick);

    end

    mediantreat(b) = median(boottreat);

    mediancontrol(b) = median(bootcontrol);

    medianDiff(b) = mediantreat(b)-mediancontrol(b);

    end

    hist(medianDiff);

    title(1000 bootstrapped Differences of Treatment Medians)

    xlabel(median difference)

    ylabel(frequency)

    8

  • 8/14/2019 Non%26ParaBoot

    9/10

    sortmedian=sort(medianDiff);

    BSCI=[sortmedian(25),sortmedian(975)];

    2.3. Code for Parametric Bootstrap of the Sample Max-

    imum

    hist(bass);

    title(Radiocesium Tissue Concentrations in Bass from PAR Pond);

    xlabel(picocuries per gram);

    ylabel(frequency);

    mu = mean(bass);

    sigma = sqrt(var(bass));

    B=1000;

    maxbass=zeros(B,1);

    for b=1:B

    basspboot = randn(length(bass),1)*sigma + mu;

    maxbass(b)=max(basspboot);

    end

    hist(maxbass);

    title(Bootstrapped Maximum Radiocesium Tissue Concentrations

    in Bass from PAR Pond);

    xlabel(picocuries per gram);

    ylabel(frequency);

    Count30=zeros(B,1);

    for j=1:B

    if maxbass(j)>=30, Count30(j)=1;

    9

  • 8/14/2019 Non%26ParaBoot

    10/10

    endend

    p30=sum(Count30)/B;

    10