fulltext-chater 6

8/6/2019 fulltext-chater 6

1/7

Chapter 6

Introduction

In this second part of the book we discuss statistical methods for thetwo-sample and the K-sample problems. Whereas in the one-sample prob-lem the objective is to compare the distribution of the sample observationswith a hypothesised distribution, we are now concerned with comparing thedistributions of two or more populations from which we have observations atour disposal. As both classes of problems are about comparing distributions,many of the methods developed for the former can be easily adapted tothe latter. We indeed show that many names of tests come back (e.g., the

KolmogorovSmirnov and the AndersonDarling tests). It also further im-plies that many of the building blocks of Chapter 2 are useful again.

This part starts with an introductory chapter, followed in Chapter 7 bysome extra building blocks that were not needed in Part I. In Chapter 8we briefly discuss some graphical tools that may be helpful in comparingdistributions. Chapters 10 and 11 extend the smooth and EDF tests of PartI to tests for the two- and the K-sample problems. In the last chapter wediscuss two final methods, and we conclude with a brief discussion.

We start in Section 6.1 with defining the problem. It becomes clear that the

term two-sample problem has many meanings. Understanding the problemin detail will help us later on to interpret so that an informative statisticalanalysis can be performed. The datasets that are used to demonstrate thestatistical techniques are introduced in Section 6.2. The chapter is concludedwith a discussion of some important tests that are not true two-sample orK-sample tests, but that are closely related. Some of these test statisticsreappear later as components of smooth and EDF statistics.

We continue in the line of the main objective of the book. That is, we focuson classes of tests, we introduce the reader to the basic ideas and theory, and

we illustrate how the methods may be used for providing informative statis-tical analysis. As a consequence not all tests are described. We particularlyfocus on continuous distributions.

O. Thas, Comparing Distributions, Springer Series in Statistics, 163DOI 10.1007/978-0-387-92710-7 6, c Springer Science+Business Media, LLC 2010
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


2/7

164 6 Introduction

6.1 The Problem Defined

6.1.1 The Null Hypothesis of the General Two-Sample

Problem

In defining the two-sample and the K-sample problems it is important to bevery precise about both the null and the alternative hypothesis. We start withthe two-sample problem. Suppose we have two independent samples from twopopulations. Let X11, . . . X 1n1 and X21, . . .X 2n2 denote the n1 and n2 sampleobservations with distribution functions F1 and F2, respectively. Without lossof generality, we consider F1 and F2 to have the same support, say S. Wefurther assume that all observations are mutually independent. The notation

X1 and X2 is used to denote random variables with distribution function F1and F2, respectively. The notation s and

2s

(s = 1, 2) is used to denotethe corresponding means and variances. We define the two-sample problemas the problem concerned with testing the null hypothesis

H0 : F1(x) = F2(x) for all x S. (6.1)

Sometimes we write H0 : F1 = F2 for short. The most general alternativehypothesis is H1 : not H0. Tests that are consistent for testing H0 versusH1

are referred to asomnibus consistent tests

. We refer to it as thegeneral

two-sample problem. Sometimes less general alternative hypotheses are con-sidered, leading to directional tests. Just as in the one-sample problem, mostsmooth tests (Chapter 10) are examples of directional tests. It may be infor-mative to give one well-known example at this point: the two-sample t-testmay be considered as a directional two-sample test. It is used to test the nullhypothesis (6.1) against the directional alternative H1 : 1 = 2.

We like to stress that the null hypothesis (6.1) is very nonparametric inthe sense that the distributions F1 and F2 are not specified. Often some as-sumptions on F1 and F2 are required for the test statistic to have a properdistribution (e.g., finite first four moments), but we try to avoid these techni-calities. Although (6.1) looks very similar to the one-sample null hypothesis,its nonparametric character will make a difference in finding the null distri-bution of a test statistic. In the one-sample problem, the distribution of theobservations is very well defined under the null hypothesis, because this isexactly what is hypothesised in H0. Even with a composite null hypothesis,the distribution is specified up to a very limited number of parameters. Thisstrong distributional restriction implied by H0 makes it possible, for exam-ple, to find the exact null distribution of test statistics under a simple null

hypothesis, and to use the parametric bootstrap for p-value calculation forcomposite null hypotheses. For most tests, however, the distribution theoryrelies on the central limit theorem or the weak convergence of empirical pro-cesses. These asymptotic theories will again play a central role in finding the
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


3/7

6.1 The Problem Defined 165

asymptotic null distribution of the two-sample test statistics. A parametricbootstrap procedure will not apply anymore as (6.1) does not specify anydistribution. Despite the very nonparametric nature of (6.1) we are now evenoften in the position to obtain the exact null distribution of a test statistic,

whatever F = F1 = F2 may be and whatever the sample size. The reasonis that the null hypothesis (6.1) implies an invariance of the null distribu-tion of the test statistic under permutations of the observations over the twosamples. This allows for exact p-value calculations, however small the samplesizes are. More details of permutation tests are given in Section 7.1.

Many of the test statistics for the two-sample problem are very closely re-lated to those discussed in Part I. This is very easy to understand. Considerthe simple null hypothesis F(x) = G(x), where F and G represent the trueand the hypothesised distributions, respectively. Whereas the latter is com-

pletely specified, the former is completely unknown, but can be estimatedconsistently by the EDF Fn. In Section 2.1.2 we gave a very generic formof test statistics in (2.2): Tn = c(n)d(Fn, G), where c(n) is a scaling factor,and d(., .) is a distance or divergence functional. If we apply the same ideahere, we now replace the two unknown distribution functions F1 and F2 bytheir respective EDFs, say F1n and F2n. As min(n1, n2) , both EDFsconverge to the true distribution functions (see Section 2.1.1 for more detailson the modes of convergence). A general form of a two-sample test statisticmay then be represented by

Tn = c(n)d(F1n, F2n),

where c(.) and d(., .) are as before, and thus resulting in test statistics ofthe same form as for the one-sample problem. Later we come back to thechoice of the function d(., .), and how this relates to the specification of thealternative hypothesis.

6.1.2 The Null Hypothesis of the General K-SampleProblem

In the K-sample problem we are concerned with testing whether K (K 2)independent samples come from the same population. It is thus a generalisa-tion of the two-sample problem to K samples. Denoting the sth distributionfunction by Fs (s = 1, . . . ,K ), and assuming that all Fs have the same sup-port, we may write the general K-sample null hypothesis as

H0 : F1(x) = F2(x) = . . . = FK(x) for all x S.

Just as with the two-sample problem, we often consider the alternative hy-pothesis as the negation ofH0. Tests that are consistent against this generalalternative are omnibus tests, otherwise they are directional.
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


4/7

166 6 Introduction

One may ask why we treat the two- and the K-sample problem seperately.We could just as well have introduced only the K-sample problem, leav-ing K = 2 as a special case. There are several reasons for doing this. Firstthere is the history argument. Many of the tests were introduced for the two-

sample problem; extensions appeared only later in the statistical literature.Second, there are some tests that are only available for the two-sample prob-lem. Third, although many K-sample test statistics reduce to the two-samplestatistics, they apparently have a different form. The last argument is basi-cally a didactic argument: we believe that many methods and concepts are

just easier introduced in the two-sample setting.

6.2 Example Datasets

6.2.1 Gene Expression in Colorectal Cancer Patients

In recent years there is an increasing interest in data analysis methods forhigh-throughput data. A typical example of these huge datasets arises frommicroarrays or DNA chips. Microarray experiments are used to measure theexpression levels of often more than 20,000 genes simultaneously. For eachgene, they essentially measure the concentration of gene-specific mRNA,which is a transcription product of the gene that triggers the productionsof a specific protein. For more details on the statistical analyses of microar-ray experiments, see, e.g., Speed (2003), Gentleman et al. (2005), or Allisonet al. (2006). These experiments are often performed for comparison purposes.For example, gene expression levels in a control group of healthy people anda group of cancer patients are measured with the aim of finding genes thatare differentially expressed in the cancer groups. These genes may play animportant role in the onset or the development of the cancer. The identi-fication of such genes may be helpful in understanding the biology of the

disease, or it may be used as a biomarker in a diagnostic assay to detect thecancer in an early stage. Because microarray experiments are quite costly,they are typically performed on small groups of people. Having 20 subjectsin each of the two groups is considered to be a moderately large experiment.The datasets are thus massive by the dimensionality, but not in terms of thenumber of independent subjects in the sample. However, here we select onlya few genes for illustrating the two-sample tests, thus ignoring the problemof multiplicity of tests completely.

Most textbooks on the statistical analysis of microarray experiments advise

using the traditional parametric t-test, or the nonparametric Wilcoxon ranksum test. Some specifically designed tests have been suggested (e.g., the SAMmethod of Tuscher et al. (2001)), but most of them are simple modificationsof the t-test.
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


5/7

6.2 Example Datasets 167

The data that we present here, was collected at the VUUniversity MedicalCenter (VUmc), Amsterdam, The Netherlands. The objective of the studywas to find out which genes are involved in the progression from adenomato carcinoma in colorectal cancer. The microarray experiment was performed

on RNA isolated from 68 snap-frozen colorectal tumour samples: 37 nonpro-gressed adenomas and 31 carcinomas. The microarray measured expressionlevels of 28,830 unique genes. More details on the study and its conclusionscan be found in Carvalho et al. (2008). The paper also gives details on howthe expression data were preprocessed (background correction, normalisa-tion, and summarisation). In the next paragraph we give some biologicalbackground.

Not all adenomas progress to carcinomas; this happens in only a smallsubset of tumours. Initiation of genomic instability is a crucial step in this

progression and occurs in two ways in colorectal cancer. First DNA mis-match repair deficiency leading to microsatellite instability has been mostextensively studied, but it explains only about 15% of adenoma to carcinomaprogression. In the other 85% of the cases where colorectal adenomas progressto carcinomas, genomic instability occurs at the chromosomal level giving riseto aneuploidy. Although for a long time these chromosomal aberrations wereregarded as random noise, secondary to cancer development, it has now beenwell established that these DNA copy number changes occur in specific pat-terns and are associated with different clinical behaviour. Nevertheless, de-

spite extensive efforts, neither the cause of chromosomal instability in humancancer progression nor its biological consequences have been fully established.

For illustrative purposes we have selected four genes. They have sequencereferences NM 152299, AK021616, AK0550915, and NM 012469, but we sim-ply refer to them as genes 1, 2, 3, and 4, respectively. Figure 6.1 shows thekernel density estimates of the expression levels.

6.2.2 Travel Times

A taxi company often brings clients from the central railway station to theairport. Because many of these passangers are in a hurry to catch their planes,it is important to guarantee a short travel time. Although there is a highwayconnection to the airport, there are frequently traffic jams. The owner of thetaxi company sets up an experiment to compare five routes from the railwaystation to the airport:

1. Route 1: this is the route as suggested by the GPS installed in the car.

2. Route 2: this is the preferred route by a local taxi driver who has lived formany years in the area.

3. Route 3: this route has a preference for small roads through a residentialarea.
http://-/?-http://-/?-http://-/?-http://-/?-


6/7

168 6 Introduction

0.5 0.0 0.5 1.0 1.5

0.0

0.4

0.

8

1.2

gene 1

expression level

Density

2 1 0 1 2 3

0.0

0.2

0.4

0.6

gene 2

expression level

Density

2 1 0 1 2

0.0

0.2

0.4

0.6

gene 3

expression level

Density

1.0 0.0 1.0 2.0

0.0

0.4

0.8

gene 4

expression level

Density

Fig. 6.1 The kernel density estimates of the four genes. Each plot shows the density

estimates of the two patient groups: the full line and the dashed line represent the non-progressed adenomas and the carcinomas, respectively

4. Route 4: this route has a preference for big roads (i.e. two lanes for eachdirection), but not the highway.

5. Route 5: the taxi driver first listens to the latest traffic information on the

radio, and he decides to take the highway when no problems are reported;otherwise Route 1 is selected.

As the taxi drivers usually take the routes as suggested by their GPS, route1 is considered as the reference route. In a time period of one month, 250 taxirides from the railway station to the airport were randomly assigned to these5 routes, resulting in a balanced design. The travel times were recorded inseconds and coverted to minutes. Boxplots of the data are shown in Figure6.2. The dataset is referred to as the traffic data.


7/7

Documents

fulltext-chater 6