Kernel Dimension Reduction in Regressionfbach/fukumizu-bach-jordan...Michael I. Jordan Department of Statistics Department of Computer Science and Electrical Engineering University

Kernel Dimension Reduction in

Regression

Kenji Fukumizu

Institute of Statistical Mathematics

4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan

[email protected]

Francis R. Bach

INRIA - WILLOW Project-Team

Laboratoire d’Informatique de l’Ecole Normale Superieure

CNRS/ENS/INRIA UMR 8548

45, rue d’Ulm 75230 Paris, France

[email protected]

Michael I. Jordan

Department of Statistics

Department of Computer Science and Electrical Engineering

University of California, Berkeley, CA 04720, USA

[email protected]

July 17, 2008

1

Running title. Kernel Dimension Reduction.

AMS 2000 subject classifications. Primary-62H99; secondary-62J02.

Keywords. dimension reduction, regression, positive definite kernel, repro-

ducing kernel, consistency.

Acknowledgements. The authors thank the editor and anonymous refer-

ees for their helpful comments. The authors also thank Dr. Yoichi Nishiyama

for his helpful comments on the uniform convergence of empirical processes.

We would like to acknowledge support from JSPS KAKENHI 15700241,

a research grant from the Inamori Foundation, a Scientific Grant from the

Mitsubishi Foundation, and Grant 0509559 from the National Science Foun-

dation.

2

Summary

We present a new methodology for sufficient dimension reduction

(SDR). Our methodology derives directly from the formulation of SDR

in terms of the conditional independence of the covariate X from the

response Y , given the projection of X on the central subspace (cf.

[23, 6]). We show that this conditional independence assertion can be

characterized in terms of conditional covariance operators on reproduc-

ing kernel Hilbert spaces and we show how this characterization leads

to an M-estimator for the central subspace. The resulting estimator is

shown to be consistent under weak conditions; in particular, we do not

have to impose linearity or ellipticity conditions of the kinds that are

generally invoked for SDR methods. We also present empirical results

showing that the new methodology is competitive in practice.

3

1 Introduction

The problem of sufficient dimension reduction (SDR) for regression is that

of finding a subspace S such that the projection of the covariate vector X

onto S captures the statistical dependency of the response Y on X. More

formally, let us characterize a dimension-reduction subspace S in terms of

the following conditional independence assertion:

Y⊥⊥X | ΠSX, (1)

where ΠSX denotes the orthogonal projection of X onto S. It is possible

to show that under weak conditions the intersection of dimension reduc-

tion subspaces is itself a dimension reduction subspace, in which case the

intersection is referred to as a central subspace [6, 5]. As suggested in a

seminal paper by Li [23], it is of great interest to develop procedures for

estimating this subspace, quite apart from any interest in the conditional

distribution P (Y | X) or the conditional mean E(Y | X). Once the central

subspace is identified, subsequent analysis can attempt to infer a conditional

distribution or a regression function using the (low-dimensional) coordinates

ΠSX.

The line of research on SDR initiated by Li is to be distinguished from

the large and heterogeneous collection of methods for dimension reduction in

regression in which specific modeling assumptions are imposed on the con-

ditional distribution P (Y | X) or the regression E(Y | X). These methods

include ordinary least squares, partial least squares, canonical correlation

4

analysis, ACE [4], projection pursuit regression [12], neural networks, and

LASSO [29]. These methods can be effective if the modeling assumptions

that they embody are met, but if these assumptions do not hold there is no

guarantee of finding the central subspace.

Li’s paper not only provided a formulation of SDR as a semiparamet-

ric inference problem—with subsequent contributions by Cook and others

bringing it to its elegant expression in terms of conditional independence—

but also suggested a specific inferential methodology that has had significant

influence on the ensuing literature. Specifically, Li suggested approaching

the SDR problem as an inverse regression problem. Roughly speaking, the

idea is that if the conditional distribution P (Y | X) varies solely along a sub-

space of the covariate space, then the inverse regression E(X | Y ) should

lie in that same subspace. Moreover, it should be easier to regress X on

Y than vice versa, given that Y is generally low-dimensional (indeed, one-

dimensional in the majority of applications) while X is high-dimensional. Li

[23] proposed a particularly simple instantiation of this idea—known as sliced

inverse regression (SIR)—in which E(X | Y ) is estimated as a constant vec-

tor within each slice of the response variable Y , and principal component

analysis is used to aggregate these constant vectors into an estimate of the

central subspace. The past decade has seen a number of further develop-

ments in this vein. Some focus on finding a central subspace [e.g., 9, 10],

while others aim at finding a central mean subspace, which is a subspace of

the central subspace that is effective only for the regression E[Y |X]. The

5

latter include principal Hessian directions (pHd, [24]) and contour regres-

sion [22]. A particular focus of these more recent developments has been the

exploitation of second moments within an inverse regression framework.

While the inverse regression perspective has been quite useful, it is not

without its drawbacks. In particular, performing a regression of X on Y

generally requires making assumptions with respect to the probability dis-

tribution of X, assumptions that can be difficult to justify. In particular,

most of the inverse regression methods make the assumption of linearity of

the conditional mean of the covariate along the central subspace (or make

a related assumption for the conditional covariance). These assumptions

hold in particular if the distribution of X is elliptic. In practice, however,

we do not necessarily expect that the covariate vector will follow an el-

liptic distribution, nor is it easy to assess departures from ellipticity in a

high-dimensional setting. In general it seems unfortunate to have to impose

probabilistic assumptions on X in the setting of a regression methodology.

Many of inverse regression methods can also exhibit some additional

limitations depending on the specific nature of the response variable Y .

In particular, pHd and contour regression are applicable only to a one-

dimensional response. Also, if the response variable takes its values in a

finite set of p elements, SIR yields a subspace of dimension at most p − 1;

thus, for the important problem of binary classification SIR yields only a

one-dimensional subspace. Finally, in the binary classification setting, if the

covariance matrices of the two classes are the same, SAVE and pHd also

6

provide only a one-dimensional subspace [7]. The general problem in these

cases is that the estimated subspace is smaller than the central subspace.

One approach to tackling these limitations is to incorporate higher-order

moments of Y |X [34], but in practice the gains achievable by the use of

higher-order moments are limited by robustness issues.

In this paper we present a new methodology for SDR that is rather differ-

ent from the approaches considered in the literature discussed above. Rather

than focusing on a limited set of moments within an inverse regression frame-

work, we focus instead on the criterion of conditional independence in terms

of which the SDR problem is defined. We develop a contrast function for

evaluating subspaces that is minimized precisely when the conditional inde-

pendence assertion in Eq. (1) is realized. As befits a criterion that measures

departure from conditional independence, our contrast function is not based

solely on low-order moments.

Our approach involves the use of conditional covariance operators on

reproducing kernel Hilbert spaces (RKHS’s). Our use of RKHS’s is related

to their use in nonparametric regression and classification; in particular, the

RKHS’s given by some positive definite kernels are Hilbert spaces of smooth

functions that are “small” enough to yield computationally-tractable pro-

cedures, but are rich enough to capture nonparametric phenomena of inter-

est [32], and this computational focus is an important aspect of our work.

On the other hand, whereas in nonparametric regression and classification

the role of RKHS’s is to provide basis expansions of regression functions and

7

discriminant functions, in our case the RKHS plays a different role. Our in-

terest is not in the functions in the RKHS per se, but rather in conditional

covariance operators defined on the RKHS. We show that these operators

can be used to measure departures from conditional independence. We also

show that these operators can be estimated from data and that these esti-

mates are functions of Gram matrices. Thus our approach—which we refer

to as kernel dimension reduction (KDR)—involves computing Gram matri-

ces from data and optimizing a particular functional of these Gram matrices

to yield an estimate of the central subspace.

This approach makes no strong assumptions on either the conditional

distribution pY |ΠSX(y | ΠSx) or the marginal distribution pX(x). As we

show, KDR is consistent as an estimator of the central subspace under weak

conditions.

There are alternatives to the inverse regression approach in the litera-

ture that have some similarities to KDR. In particular, minimum average

variance estimation (MAVE, [33]) is based on nonparametric estimation of

the conditional covariance of Y given X, an idea related to KDR. This

method explicitly estimates the regressor, however, assuming an additive

noise model Y = f(X) + Z, where Z is independent of X. While the

purpose of MAVE is to find a central mean subspace, KDR tries to find

a central subspace, and does not need to estimate the regressor explicitly.

Other related approaches include methods that estimate the derivative of

the regression function; these are based on the fact that the derivative of

8

the conditional expectation g(x) = E[y | BTx] with respect to x belongs

to a dimension reduction subspace [27, 18]. The purpose of these methods

is again to extract a central mean subspace; this differs from the central

subspace which is the focus of KDR. The difference is clear, for example, if

we consider the situation in which a direction b in a central subspace sat-

isfies E[g′(bTX)] = 0; a condition that occurs if g and the distribution of

X exhibit certain symmetries. The direction cannot be found by methods

based on the derivative. Also, there has also been some recent work on non-

parametric methods for estimation of central subspaces. One such method

estimates the central subspace based on an expected log likelihood [35]. This

requires, however, an estimate of the joint probability density, and is lim-

ited to single-index regression. Finally, Zhu and Zeng [36] have proposed a

method for estimating the central subspace based on the Fourier transform.

This method is similar to the KDR method in its use of Hilbert space meth-

ods and in its use of a contrast function that can characterize conditional

independence under weak assumptions. It differs from KDR, however, in

that it requires an estimate of the derivative of the marginal density of the

covariate X; in practice this requires assuming a parametric model for the

covariate X. In general, we are aware of no practical method that attacks

SDR directly by using nonparametric methodology to assess departures from

conditional independence.

We presented an earlier kernel dimension reduction method in [13]. The

contrast function presented in that paper, however, was not derived as an

9

estimator of a conditional covariance operator, and it was not possible to es-

tablish a consistency result for that approach. The contrast function that we

present here is derived directly from the conditional covariance perspective;

moreover, it is simpler than the earlier estimator and it is possible to estab-

lish consistency for the new formulation. We should note, however, that the

empirical performance of the earlier KDR method was shown by Fukumizu

et al. [13] to yield a significant improvement on SIR and pHd in the case

of non-elliptic data, and these empirical results motivated us to pursue the

general approach further.

While KDR has advantages over other SDR methods because of its gen-

erality and its directness in capturing the semiparametric nature of the SDR

problem, it also reposes on a more complex mathematical framework that

presents new theoretical challenges. Thus, while consistency for SIR and

related methods follows from a straightforward appeal to the central limit

theorem (under ellipticity assumptions), more effort is required to study

the statistical behavior of KDR theoretically. This effort is of some general

value, however; in particular, to establish the consistency of KDR we prove

the uniform O(n−1/2) convergence of an empirical process that takes values

in a reproducing kernel Hilbert space. This result, which accords with the

order of uniform convergence of an ordinary real-valued empirical process,

may be of independent theoretical interest.

It should be noted at the outset that we do not attempt to provide

distribution theory for KDR in this paper, and in particular we do not

10

address the problem of inferring the dimensionality of the central subspace.

The paper is organized as follows. In Section 2 we show how condi-

tional independence can be characterized by cross-covariance operators on

an RKHS and use this characterization to derive the KDR method. Section

3 presents numerical examples of the KDR method. We present a consis-

tency theorem and its proof in Section 4. Section 5 provides concluding

remarks. Some of the details in the proof of consistency are provided in the

Appendix.

2 Kernel Dimension Reduction for Regression

The method of kernel dimension reduction is based on a characterization

of conditional independence using operators on RKHS’s. We present this

characterization in Section 2.1 and show how it yields a population criterion

for SDR in Section 2.2. This population criterion is then turned into a

finite-sample estimation procedure in Section 2.3.

In this paper, a Hilbert space means a separable Hilbert space, and an

operator always means a linear operator. The operator norm of a bounded

operator T is denoted by ‖T‖. The null space and the range of an operator

T are denoted by N (T ) and R(T ), respectively.

11

2.1 Characterization of conditional independence

Let (X ,BX ) and (Y,BY) denote measurable spaces. When the base space is

a topological space, the Borel σ-field is always assumed. Let (HX , kX ) and

(HY , kY) be RKHS’s of functions on X and Y, respectively, with measurable

positive definite kernels kX and kY [1]. We consider a random vector (X,Y ) :

Ω → X × Y with the law PXY . The marginal distribution of X and Y are

denoted by PX and PY , respectively. It is always assumed that the positive

definite kernels satisfy

EX [kX (X,X)] <∞ and EY [kY(Y, Y )] <∞. (2)

Note that any bounded kernels satisfy this assumption. Also, under this

assumption, HX and HY are included in L2(PX) and L2(PY ), respectively,

where L2(µ) denotes the Hilbert space of square integrable functions with

respect to the measure µ, and the inclusions JX : HX → L2(PX) and JY :

HY → L2(PY ) are continuous, because EX [f(X)2] = EX [〈f, kX ( · , X)〉2HX] ≤

‖f‖2HX

EX [kX (X,X)] for f ∈ HX .

The cross-covariance operator of (X,Y ) is an operator from HX to HY

so that

〈g,ΣY Xf〉HY= EXY

[(f(X) − EX [f(X)])(g(Y ) − EY [g(Y )])

](3)

holds for all f ∈ HX and g ∈ HY [3, 13]. Obviously, ΣY X = Σ∗XY , where

T ∗ denotes the adjoint of an operator T . If Y is equal to X, the positive

self-adjoint operator ΣXX is called the covariance operator.

12

For a random variable X : Ω → X , the mean element mX ∈ HX is

defined by the element that satisfies

〈f,mX〉HX= EX [f(X)] (4)

for all f ∈ HX ; that is, mX = J∗X 1, where 1 is the constant function.

The explicit function form of mX is given by mX(u) = 〈mX , k(·, u)〉HX=

E[k(X,u)]. Using the mean elements, Eq. (3), which characterizes ΣY X ,

can be written as

〈g,ΣY Xf〉HY= EXY [〈f, kX (·, X) −mX〉HX

〈kY(·, Y ) −mY , g〉HY].

Let QX and QY be the orthogonal projections which map HX onto

R(ΣXX) and HY onto R(ΣY Y ), respectively. It is known [3, Theorem 1]

that ΣY X has a representation of the form

ΣY X = Σ1/2Y Y VY XΣ

1/2XX , (5)

where VY X : HX → HY is a unique bounded operator such that ‖VY X‖ ≤ 1

and VY X = QY VY XQX .

A cross-covariance operator on an RKHS can be represented explicitly

as an integral operator. For arbitrary ϕ ∈ L2(PX) and y ∈ Y, the integral

Gϕ(y) =

∫

X×YkY(y, y)(ϕ(x) − EX [ϕ(X)])dPXY (x, y) (6)

always exists and Gϕ is an element of L2(PY ). It is not difficult to see that

SY X : L2(PX) → L2(PY ), ϕ 7→ Gϕ

13

is a bounded linear operator with ‖SY X‖ ≤ EY [kY(Y, Y )]. If f is a function

in HX , we have for any y ∈ Y

Gf (y) = 〈kY( · , y),ΣY Xf〉HY=

(ΣY Xf

)(y),

which implies the following proposition:

Proposition 1. The covariance operator ΣY X : HX → HY is the restriction

of the integral operator SY X to HX . More precisely,

JYΣY X = SY XJX .

Conditional variance can be also represented by covariance operators.

Define the conditional covariance operator ΣY Y |X by

ΣY Y |X = ΣY Y − Σ1/2Y Y VY XVXY Σ

1/2Y Y ,

where VY X is the bounded operator in Eq. (5). For convenience we some-

times write ΣY Y |X as

ΣY Y |X = ΣY Y − ΣY XΣ−1XXΣXY ,

which is an abuse of notation, because Σ−1XX may not exist.

The following two propositions provide insights into the meaning of a

conditional covariance operator. The former proposition relates the operator

to the residual error of regression, and the latter proposition expresses the

residual error in terms of the conditional variance.

14

Proposition 2. For any g ∈ HY ,

〈g,ΣY Y |Xg〉HY= inf

f∈HX

EXY

∣∣(g(Y ) − EY [g(Y )]) − (f(X) − EX [f(X)])∣∣2.

Proof. Let ΣY X = Σ1/2Y Y VY XΣ

1/2XX be the decomposition in Eq. (5), and define

Eg(f) = EY X

∣∣(g(Y )−EY [g(Y )])− (f(X)−EX [f(X)])∣∣2. From the equality

Eg(f) = ‖Σ1/2XXf‖2

HX− 2〈VXY Σ

1/2Y Y g,Σ

1/2XXf〉HX

+ ‖Σ1/2Y Y g‖2

HY,

replacing Σ1/2XXf with an arbitrary φ ∈ HX yields

inff∈HX

Eg(f) ≥ infφ∈HX

‖φ‖2

HX− 2〈VXY Σ

1/2Y Y g, φ〉HX

+ ‖Σ1/2Y Y g‖2

HY

= infφ∈HX

‖φ− VXY Σ1/2Y Y g‖2

HX+ 〈g,ΣY Y |Xg〉HY

= 〈g,ΣY Y |Xg〉HY.

For the opposite inequality, take an arbitrary ε > 0. From the fact

that VXY Σ1/2Y Y g ∈ R(ΣXX) = R(Σ

1/2XX), there exists f∗ ∈ HX such that

‖Σ1/2XXf∗ − VXY Σ

1/2Y Y g‖HX

≤ ε. For such f∗,

Eg(f∗) = ‖Σ1/2XXf∗‖2

HX− 2〈VXY Σ

1/2Y Y g,Σ

1/2XXf∗〉HX

+ ‖Σ1/2Y Y g‖2

HY

= ‖Σ1/2XXf∗ − VY XΣ

1/2Y Y g‖2

HX+ ‖Σ1/2

Y Y g‖HY− ‖VXY Σ

1/2Y Y g‖2

HX

≤ 〈g,ΣY Y |Xg〉HY+ ε2.

Because ε is arbitrary, we have inff∈HXEg(f) ≤ 〈g,ΣY Y |Xg〉HY

.

Proposition 2 is an analog for operators of a well-known result on co-

variance matrices and linear regression: the conditional covariance matrix

15

CY Y |X = CY Y − CY XC−1XXCXY expresses the residual error of the least

square regression problem as bTCY Y |Xb = minaE‖bTY − aTX‖2.

To relate the residual error in Proposition 2 to the conditional variance

of g(Y ) given X, we make the following mild assumption:

(AS) HX + R is dense in L2(PX), where HX + R denotes the direct sum

of the RKHS HX and the RKHS R [1].

As seen later in Section 2.2, there are many positive definite kernels that

satisfy the assumption (AS). Examples include the Gaussian radial basis

function (RBF) kernel k(x, y) = exp(−‖x− y‖2/σ2) on Rm or on a compact

subset of Rm.

Proposition 3. Under the assumption (AS),

〈g,ΣY Y |Xg〉HY= EX

[VarY |X [g(Y )|X]

](7)

for all g ∈ HY .

Proof. From Proposition 2, we have

〈g,ΣY Y |Xg〉HY= inf

f∈HX

Var[g(Y ) − f(X)]

= inff∈HX

VarX

[EY |X [g(Y ) − f(X)|X]

]+ EX

[VarY |X [g(Y ) − f(X)|X]

]

= inff∈HX

VarX

[EY |X [g(Y )|X] − f(X)

]+ EX

[VarY |X [g(Y )|X]

].

Let ϕ(x) = EY |X [g(Y )|X = x]. Since ϕ ∈ L2(PX) from Var[ϕ(X)] ≤

Var[g(Y )] < ∞, the assumption (AS) implies that for an arbitrary ε > 0

16

there exists f ∈ HX and c ∈ R such that h = f+c satisfies ‖ϕ−h‖L2(PX) < ε.

Because Var[ϕ(X)−f(X)] ≤ ‖ϕ−h‖2L2(PX) ≤ ε2 and ε is arbitrary, we have

inff∈HXVarX

[EY |X [g(Y )|X] − f(X)

]= 0, which completes the proof.

Proposition 3 improves a result due to Fukumizu et al. [13, Proposition

5], where the much stronger assumption E[g(Y )|X = · ] ∈ HX was imposed.

Propositions 2 and 3 imply that the operator ΣY Y |X can be interpreted

as capturing the predictive ability for Y of the explanatory variable X.

2.2 Criterion of kernel dimension reduction

Let M(m × n; R) be the set of real-valued m × n matrices. For a natural

number d ≤ m, the Stiefel manifold Smd (R) is defined by

Smd (R) = B ∈M(m× d; R) | BTB = Id,

which is the set of all d orthonormal vectors in Rm. It is well known that

Smd (R) is a compact smooth manifold. For B ∈ S

md (R), the matrix BBT

defines an orthogonal projection of Rm onto the d-dimensional subspace

spanned by the column vectors of B. Although the Grassmann manifold

is often used in the study of sets of subspaces in Rm, we find the Stiefel

manifold more convenient as it allows us to use matrix notation explicitly.

Hereafter, X is assumed to be either a closed ball Dm(r) = x ∈ Rm |

‖x‖ ≤ r or the entire Euclidean space Rm so that the projection BBTX is

included in X for all B ∈ Smd (R).

17

Let Bmd ⊆ S

md (R) denote the subset of matrices whose columns span a

dimension reduction subspace; for each B0 ∈ Bmd , we have

pY |X(y|x) = pY |BT0

X(y|BT0 x), (8)

where pY |X(y|x) and pY |BT X(y|u) are the conditional probability densities

of Y given X, and Y given BTX, respectively. The existence and positivity

of these conditional probability densities are always assumed hereafter. As

we have discussed in Introduction, under conditions given by [6, Section 6.4]

this subset represents the central subspace (under the assumption that d is

the minimum dimensionality of the dimension reduction subspaces).

We now turn to the key problem of characterizing the subset Bmd using

conditional covariance operators on reproducing kernel Hilbert spaces. In

the following, we assume that kd(z, z) is a positive definite kernel on Z =

Dd(r) or Rd such that EX [kd(B

TX,BTX)] <∞ for all B ∈ Smd (R), and we

let kBX denote a positive definite kernel on X given by

kBX (x, x) = kd(B

Tx,BT x), (9)

for each B ∈ Smd (R). The RKHS associated with kB

X is denoted by HBX . Note

that HBX = f : X → R | there exists g ∈ Hkd

such that f(x) = g(BTx),

where Hkdis the RKHS given by kd. As seen later in Theorem 4, if X and

Y are subsets of Euclidean spaces and Gaussian RBF kernels are used for

kX and kY , under some conditions the subset Bmd is characterized by the set

18

of solutions of an optimization problem:

Bmd = arg min

B∈Smd

(R)ΣB

Y Y |X , (10)

where ΣBY X and ΣB

XX denote the (cross-) covariance operators with respect

to the kernel kB, and

ΣBY Y |X = ΣY Y − ΣB

Y XΣBXX

−1ΣB

XY .

The minimization in Eq. (10) refers to the minimal operators in the partial

order of self-adjoint operators.

We use the trace to evaluate the partial order of self-adjoint operators.

While other possibilities exist (e.g., the determinant), the trace has the

advantage of yielding a relatively simple theoretical analysis, which is con-

ducted in Section 4. The operator ΣBY Y |X is trace class for all B ∈ S

md (R),

since ΣBY Y |X ≤ ΣY Y and Tr[ΣY Y ] < ∞, which is shown in Section 4.2.

Henceforth the minimization in Eq. (10) should thus be understood as that

of minimizing Tr[ΣBY Y |X ].

From Propositions 2 and 3, minimization of Tr[ΣBY Y |X ] is equivalent to

the minimization of the sum of the residual errors for the optimal predic-

tion of functions of Y using BTX, where the sum is taken over a complete

orthonormal system ξa∞a=1 of HY . Thus, the objective of dimension re-

duction is rewritten as

minB∈Sm

d(R)

∞∑

a=1

minf∈HB

X

E∣∣(ξa(Y ) − E[ξa(Y )]) − (f(X) − E[f(X)])

∣∣2. (11)

19

This is intuitively reasonable as a criterion of choosing B, and we will see

that this is equivalent to finding the central subspace under some conditions.

We now introduce a class of kernels to characterize conditional indepen-

dence. Let (Ω,B) be a measurable space, let (H, k) be an RKHS over Ω with

the kernel k measurable and bounded, and let S be the set of all probability

measures on (Ω,B). The RKHS H is called characteristic (with respect to

B) if the map

S ∋ P 7→ mP = EX∼P [k(·, X)] ∈ H (12)

is one-to-one, where mP is the mean element of the random variable with

law P . It is easy to see that H is characteristic if and only if the equality

∫fdP =

∫fdQ for all f ∈ H means P = Q. We also call a positive definite

kernel k characteristic if the associated RKHS is characteristic.

It is known that the Gaussian RBF kernel exp(−‖x−y‖2/σ2) and the so-

called Laplacian kernel exp(−α∑mi=1 |xi − yi|) (α > 0) are characteristic on

Rm or on a compact subset of R

m with respect to the Borel σ-field [2, 15, 28].

The following theorem improves Theorem 7 in [13], and is the theoretical

basis of kernel dimension reduction. In the following, let PB denote the

probability on X induced from PX by the projection BBT : X → X .

Theorem 4. Suppose that the closure of the HBX in L2(PX) is included in

the closure of HX in L2(PX) for any B ∈ Smd (R). Then,

ΣBY Y |X ≥ ΣY Y |X , (13)

20

where the inequality refers to the order of self-adjoint operators. If further

(HX , PX) and (HBX , PB) satisfy (AS) for every B ∈ S

md (R) and HY is char-

acteristic, the following equivalence holds

ΣY Y |X = ΣBY Y |X ⇐⇒ Y⊥⊥X | BTX. (14)

Proof. The first assertion is obvious from Proposition 2. For the second

assertion, let C be an m×(m−d) matrix whose columns span the orthogonal

complement to the subspace spanned by the columns of B, and let (U, V ) =

(BTX,CTX) for notational simplicity. By taking the expectation of the

well-known relation

VarY |U [g(Y )|U ] = EV |U[VarY |U,V [g(Y )|U, V ]

]+ VarV |U

[EY |U,V [g(Y )|U, V ]

]

with respect to V , we have

EU

[VarY |U [g(Y )|U ]

]= EX [VarY |X [g(Y )|X]

]+EU

[VarV |U

[EY |U,V [g(Y )|U, V ]

]],

from which Proposition 3 yields

〈g, (ΣBY Y |X − ΣY Y |X)g〉HY

= EU

[VarV |U

[EY |U,V [g(Y )|U, V ]

]].

It follows that the right hand side of the equivalence in Eq. (14) holds if

and only if EY |U,V [g(Y )|U, V ] does not depend on V almost surely. This is

equivalent to

EY |X [g(Y )|X] = EY |U [g(Y )|U ]

almost surely. Since HY is characteristic, this means that the conditional

probability of Y given X is reduced to that of Y given U .

21

The assumption (AS) and the notion of characteristic kernel are closely

related. In fact, from the following Proposition, (AS) is satisfied if a charac-

teristic kernel is used. Thus, if Y is Euclidean, the choice of Gaussian RBF

kernels for kd, kX and kY is sufficient to guarantee the equivalence given by

Eq. (14).

Proposition 5. Let (Ω,B) be a measurable space, and (k,H) be a bounded

measurable positive definite kernel on Ω and its RKHS. Then, k is charac-

teristic if and only if H + R is dense in L2(P ) for any probability measure

P on (Ω,B).

Proof. For the proof of “if” part, suppose mP = mQ for P 6= Q. Denote

the total variation of P − Q by |P − Q|. Since H + R is dense in L2(|P −

Q|), for arbitrary ε > 0 and A ∈ B, there exists f ∈ H + R such that

∫|f − IA|d|P − Q| < ε, where IA is the index function of A. It follows

that |(EP [f(X)] − P (A)) − (EQ[f(X)] −Q(A))| < ε. Because EP [f(X)] =

EQ[f(X)] from mP = mQ, we have |P (A)−Q(A)| < ε for any ε > 0, which

contradicts P 6= Q.

For the opposite direction, suppose H + R is not dense in L2(P ). There

is non-zero f ∈ L2(P ) such that∫fdP = 0 and

∫fϕdP = 0 for any

ϕ ∈ H. Let c = 1/‖f‖L1(P ), and define two probability measures Q1 and

Q2 by Q1(E) = c∫E |f |dP and Q2(E) = c

∫E(|f |− f)dP for any measurable

set E. By f 6= 0, we have Q1 6= Q2, while EQ1[k(·, X)] − EQ2

[k(·, X)] =

c∫f(x)k(·, x)dP (x) = 0, which means k is not characteristic.

22

2.3 Kernel dimension reduction procedure

We now use the characterization given in Theorem 4 to develop an opti-

mization procedure for estimating the central subspace from an empirical

sample (X1, Y1), . . . , (Xn, Yn). We assume that (X1, Y1), . . . , (Xn, Yn) is

sampled i.i.d. from PXY and we assume that there exists B0 ∈ Smd (R) such

that pY |X(y|x) = pY |BT0

X(y|BT0 x).

We define the empirical cross-covariance operator Σ(n)Y X by evaluating

the cross-covariance operator at the empirical distribution 1n

∑ni=1 δXi

δYi.

When acting on functions f ∈ HX and g ∈ HY , the operator Σ(n)Y X gives the

empirical covariance:

〈g, Σ(n)Y Xf〉HY

=1

n

n∑

i=1

g(Yi)f(Xi) −(

1

n

n∑

i=1

g(Yi)

)(1

n

n∑

i=1

f(Xi)

).

Also, for B ∈ Smd (R), let Σ

B(n)Y Y |X denote the empirical conditional covariance

operator :

ΣB(n)Y Y |X = Σ

(n)Y Y − Σ

B(n)Y X

(Σ

B(n)XX + εnI

)−1Σ

B(n)XY . (15)

The regularization term εnI (εn > 0) is required to enable operator inversion

and is thus analogous to Tikhonov regularization [17]. We will see that the

regularization term is also needed for consistency.

We now define the KDR estimator B(n) as any minimizer of Tr[ΣB(n)Y Y |X ]

on the manifold Smd (R); that is, any matrix in S

md (R) that minimizes

Tr[Σ

(n)Y Y − Σ

B(n)Y X

(Σ

B(n)XX + εnI

)−1Σ

B(n)XY

]. (16)

23

In view of Eq. (11), this is equivalent to minimizing

∞∑

a=1

minf∈HB

X

[ n∑

i=1

∣∣∣∣

ξa(Yj)−

1

n

n∑

j=1

ξa(Yj)

−

f(Xj)−

1

n

n∑

j=1

f(Xj)

∣∣∣∣2

+εn‖f‖2HB

X

]

over B ∈ Smd (R), where ξa∞a=1 is a complete orthonormal system for HY .

The KDR contrast function in Eq. (16) can also be expressed in terms

of Gram matrices (given a kernel k, the Gram matrix is the n × n matrix

whose entries are the evaluations of the kernel on all pairs of n data points).

Let φBi ∈ HB

X and ψi ∈ HY (1 ≤ i ≤ n) be functions defined by

φBi = kB(·, Xi) −

1

n

n∑

j=1

kB(·, Xj), ψi = kY(·, Yi) −1

n

n∑

j=1

kY(·, Yj).

Because R(ΣB(n)XX ) = N (Σ

B(n)XX )⊥ and R(Σ

(n)Y Y ) = N (Σ

(n)Y Y )⊥ are spanned

by (φBi )n

i=1 and (ψi)ni=1, respectively, the trace of Σ

B(n)Y Y |X is equal to that

of the matrix representation of ΣB(n)Y Y |X on the linear hull of (ψi)

ni=1. Note

that although the vectors (ψi)ni=1 are over-complete, the trace of the matrix

representation with respect to these vectors is equal to the trace of the

operator.

For B ∈ Smd (R), the centered Gram matrix GB

X with respect to the kernel

kB is defined by

(GBX)ij = 〈φB

i , φBj 〉HB

X= kB

X (Xi, Xj)−1

n

n∑

b=1

kBX (Xi, Xb)−

1

n

n∑

a=1

kBX (Xa, Xj)

+1

n2

n∑

a=1

n∑

b=1

kBX (Xa, Xb),

and GY is defined similarly. By direct calculation, it is easy to obtain

ΣB(n)Y Y |Xψi =

1

n

n∑

j=1

ψj

(GY

)ji− 1

n

n∑

j=1

ψj

(GB

X(GBX + nεnIn)−1GY

)ji.

24

It follows that the matrix representation of ΣB(n)Y Y |X with respect to (ψi)

ni=1

is 1nGY −GB

X(GBX + nεnIn)−1GY and its trace is

Tr[Σ

B(n)Y Y |X

]=

1

nTr

[GY −GB

X(GBX + nεnIn)−1GY

]

= εnTr[GY (GB

X + nεnIn)−1].

Omitting the constant factor, the KDR contrast function in Eq. (16) thus

reduces to

Tr[GY (GB

X + nεnIn)−1]. (17)

The KDR method is defined as the optimization of this function over the

manifold Smd (R).

Theorem 4 is the population justification of the KDR method. Note

that this derivation imposes no strong assumptions either on the conditional

probability of Y given X, or on the marginal distributions of X and Y . In

particular, it does not require ellipticity of the marginal distribution of X,

nor does it require an additive noise model. The response variable Y may

be either continuous or discrete. We confirm this general applicability of the

KDR method by the numerical results presented in the next section.

Because the contrast function Eq. (17) is nonconvex, the minimization

requires a nonlinear optimization technique; in our experiments we use the

steepest descent method with line search. To alleviate potential problems

with local optima, we use a continuation method in which the scale param-

eter σ in Gaussian RBF kernel exp(−‖x − y‖/σ2) is gradually decreased

during the iterative optimization process. In numerical examples shown in

25

the next section, we used a fixed number of iterations, and decreased σ2

linearly from σ2 = 100 to σ2 = 10 for standardized data with standard de-

viation 5.0. Since the covariance operator approaches to the one given by

the linear kernel as σ → ∞, we initialized the matrix B by the solution for

the linear kernel, which is solvable as an eigenproblem.

In addition to σ, there is another tuning parameter εn, the regularization

coefficient. As both of these tuning parameters have a similar smoothing

effect, it is reasonable to fix one of them and select the other; in our experi-

ments we fixed εn = 0.1 as an arbitrary choice and varied σ2. While there is

no theoretical guarantee for this choice, we observe the results are generally

stable if the optimization process is successful. There also exist heuristics

for choosing kernel parameters in similar RKHS-based dependency analysis;

an example is to use the median of pairwise distances of the data for the

parameter σ in the Gaussian RBF kernel [16]. Currently, however, we are

not aware of theoretically-justified methods of choosing these parameters;

this is an important open problem.

The proposed estimator is shown to be consistent as the sample size goes

to infinity. We defer the proof to Section 4.

26

3 Numerical Results

3.1 Simulation studies

In this section we compare the performance of the KDR method with that of

several well-known dimension reduction methods. Specifically, we compare

to SIR, pHd, and SAVE on synthetic data sets generated by the regressions

in Examples 6.2, 6.3, and 6.4 of [22]. The results are evaluated by comput-

ing the Frobenius distance between the projection matrix of the estimated

subspace and that of the true subspace; this evaluation measure is invariant

under change of basis and is equal to

‖B0BT0 − BBT ‖F ,

where B0 and B are matrices in the Stiefel manifold Smd (R) representing

the true subspace and the estimated subspace, respectively. For the KDR

method, a Gaussian RBF kernel exp(−‖z1 − z2‖2/c) was used, with c =

2.0 for regression (A) and regression (C) and c = 0.5 for regression (B).

The parameter estimate B was updated 100 times by the steepest descent

method. The regularization parameter was fixed at ε = 0.1. For SIR and

SAVE, we optimized the number of slices for each simulation so as to obtain

the best average norm.

Regression (A) is given by

(A) Y =X1

0.5 + (X2 + 1.5)2+ (1 +X2)

2 + σE,

where X ∼ N(0, I4) is a four-dimensional explanatory variable, and E ∼

27

N(0, 1) is independent of X. Thus, the central subspace is spanned by the

vectors (1, 0, 0, 0) and (0, 1, 0, 0). For the noise level σ, three different values

were used: σ = 0.1, 0.4 and 0.8. We used 100 random replications with

100 samples each. Note that the distribution of the explanatory variable X

satisfies the ellipticity assumption, as required by the SIR, SAVE, and pHd

methods.

Table 1 shows the mean and the standard deviation of the Frobenius

norm over 100 samples. We see that the KDR method outperforms the

other three methods in terms of estimation accuracy. It is also worth noting

that in the results presented by Li et al. [22] for their GCR method, the

average norm was 0.28, 0.33, 0.45 for σ = 0.1, 0.4, 0.8, respectively; again,

this is worse than the performance of KDR.

The second regression is given by

(B) Y = sin2(πX2 + 1) + σE,

where X ∈ R4 is distributed uniformly on the set

[0, 1]4\x ∈ R4 | xi ≤ 0.7 (i = 1, 2, 3, 4),

and E ∼ N(0, 1) is independent noise. The standard deviation σ is fixed at

σ = 0.1, 0.2 and 0.3. Note that in this example the distribution of X does

not satisfy the ellipticity assumption.

Table 2 shows the results of the simulation experiments for this regres-

sion. We see that KDR again outperforms the other methods.

28

The third regression is given by

(C) Y =1

2(X1 − a)2E,

where X ∼ N(0, I10) is a ten-dimensional variable and E ∼ N(0, 1) is inde-

pendent noise. The parameter a is fixed at a = 0, 0.5 and 1. Note that in

this example the conditional probability p(y|x) does not obey an additive

noise assumption. The mean of Y is zero and the variance is a quadratic

function of X1. We generated 100 samples of 500 data.

The results for KDR and the other methods are shown by Table 3, in

which we again confirm that the KDR method yields significantly better

performance than the other methods. In this case, pHd fails to find the

true subspace; this is due to the fact that pHd is incapable of estimating

a direction that only appears in the variance [8]. We note also that the

results in [22] show that the contour regression methods SCR and GCR

yield average norms larger than 1.3.

Although the estimation of variance structure is generally more difficult

than that of estimating mean structure, the KDR method nonetheless is

effective at finding the central subspace in this case.

3.2 Applications

We apply the KDR method to two data sets; one is a binary classification

problem and the other is a regression with a continuous response variable.

These data sets have been used previously in studies of dimension reduction

29

KDR SIR SAVE pHd

σ NORM SD NORM SD NORM SD NORM SD

0.1 0.11 0.07 0.55 0.28 0.77 0.35 1.04 0.34

0.4 0.17 0.09 0.60 0.27 0.82 0.34 1.03 0.33

0.8 0.34 0.22 0.69 0.25 0.94 0.35 1.06 0.33

Table 1: Comparison of KDR and other methods for regression (A).

KDR SIR SAVE pHd

σ NORM SD NORM SD NORM SD NORM SD

0.1 0.05 0.02 0.24 0.10 0.23 0.13 0.43 0.19

0.2 0.11 0.06 0.32 0.15 0.29 0.16 0.51 0.23

0.3 0.13 0.07 0.41 0.19 0.41 0.21 0.63 0.29

Table 2: Comparison of KDR and other methods for regression (B).

KDR SIR SAVE pHd

a NORM SD NORM SD NORM SD NORM SD

0.0 0.17 0.05 1.83 0.22 0.30 0.07 1.48 0.27

0.5 0.17 0.04 0.58 0.19 0.35 0.08 1.52 0.28

1.0 0.18 0.05 0.30 0.08 0.57 0.20 1.58 0.28

Table 3: Comparison of KDR and other methods for regression (C).

30

methods.

The first data set that we studied is Swiss bank notes which has been

previously studied in the dimension reduction context by Cook and Lee [7],

with the data taken from [11]. The problem is that of classifying counterfeit

and genuine Swiss bank notes. The data is a sample of 100 counterfeit

and 100 genuine notes. There are six continuous explanatory variables that

represent aspects of the size of a note: length, height on the left, height

on the right, distance of inner frame to the lower border, distance of inner

frame to the upper border, and length of the diagonal. We standardize each

of explanatory variables so that their standard deviation is 5.0.

As we have discussed in the Introduction, many dimension reduction

methods (including SIR) are not generally suitable for binary classification

problems. Because among inverse regression methods the estimated sub-

space given by SAVE is necessarily larger than that given by pHd and SIR

[7], we compared the KDR method only with SAVE for this data set.

Figure 1 shows two-dimensional plots of the data projected onto the

subspaces estimated by the KDR method and by SAVE. The figure shows

that the results for KDR appear to be robust with respect to the values of

the scale parameter a in the Gaussian RBF kernel. (Note that if a goes

to infinity, the result approaches that obtained by a linear kernel, since the

linear term in the Taylor expansion of the exponential function is dominant.)

In the KDR case, using a Gaussian RBF with scale parameter a = 10 and

100 we obtain clear separation of genuine and counterfeit notes. Slightly less

31

separation is obtained for the Gaussian RBF kernel with a = 10, 000, for the

linear kernel, and for SAVE; in these cases there is an isolated genuine data

point that lies close to the class boundary, which is similar to the results

using linear discriminant analysis and specification analysis [11]. We see

that KDR finds a more effective subspace to separate the two classes than

SAVE and the existing analysis. Finally, note that there are two clusters of

counterfeit notes in the result of SAVE, while KDR does not show multiple

clusters in either class. Although clusters have also been reported in other

analyses [11, Section 12], the KDR results suggest that the cluster structure

may not be relevant to the classification.

We also analyzed the Evaporation data set, available in the Arc pack-

age (http://www.stat.umn.edu/arc/software.html). The data set is con-

cerned with the effect on soil evaporation of various air and soil condi-

tions. The number of explanatory variables is ten: maximum daily soil

temperature (Maxst), minimum daily soil temperature (Minst), area under

the daily soil temperature curve (Avst), maximum daily air temperature

(Maxat), minimum daily air temperature (Minat), average daily air temper-

ature (Avat), maximum daily humidity (Maxh), minimum daily humidity

(Minh), area under the daily humidity curve (Avh), and total wind speed in

miles/hour (Wind). The response variable is daily soil evaporation (Evap).

The data were collected daily during 46 days; thus the number of data

points is 46. This data set was studied in the context of contour regression

methods for dimension reduction in [22]. We standardize each variable so

32

−15 −10 −5 0 5 10 15

−10

0

10

Gaussian (a = 10)

−15 −10 −5 0 5 10 15

−10

0

10

Gaussian (a = 100)

−15 −10 −5 0 5 10 15

−10

0

10

Gaussian (a = 10000)

−15 −10 −5 0 5 10 15

−10

0

10

Linear

−15 −10 −5 0 5 10 15

−10

0

10

SAVE

Figure 1: Two-dimensional plots of Swiss bank notes. The crosses and circles

show genuine and counterfeit notes, respectively. For the KDR methods, the

Gaussian RBF kernel exp(−‖z1−z2‖2/a) is used with a = 10, 100 and 10000.

For comparison, the plots given by KDR with a linear kernel and SAVE are

shown.33

that the sample variance is equal to 5.0, and use the Gaussian RBF kernel

exp(−‖z1 − z2‖2/10

).

Our analysis yielded an estimated two-dimensional subspace which is

spanned by the vectors:

KDR1 : −0.25MAXST + 0.32MINST + 0.00AV ST + (−0.28)MAXAT

+ (−0.23)MINAT + (−0.44)AV AT + 0.39MAXH + 0.25MINH

+ (−0.07)AVH + (−0.54)WIND.

KDR2 : 0.09MAXST + (−0.02)MINST + 0.00AV ST + 0.10MAXAT

+ (−0.45)MINAT + 0.23AV AT + 0.21MAXH + (−0.41)MINH

+ (−0.71)AVH + (−0.05)WIND.

In the first direction, Wind and Avat have a large factor with the same sign,

while both have weak contributions on the second direction. In the second

direction, Avh is dominant.

Figure 2 presents the scatter plots representing the response Y plotted

with respect to each of the first two directions given by the KDR method.

Both of these directions show a clear relation with Y . Figure 3 presents

the scatter plot of Y versus the two-dimensional subspace found by KDR.

The obtained two-dimensional subspace is different from the one given by

the existing analysis in [22]; the contour regression method gives a subspace

in which the first direction shows a clear monotonic trend, but the second

direction suggests a U -shaped pattern. In the result of KDR, we do not see

a clear folded pattern. Although without further analysis it is difficult to

34

−20 −10 0 10 20

−12

−8

−4

0

4

8

KDR1

Y

−20 −10 0 10 20

−12

−8

−4

0

4

8

KDR2

Y

Figure 2: Two-dimensional representation of Evaporation data for each of

the first two directions

say which result expresses more clearly the statistical dependence, the plots

suggest that the KDR method successfully captured the effective directions

for regression.

4 Consistency of kernel dimension reduction

In this section we prove that the KDR estimator is consistent. Our proof of

consistency requires tools from empirical process theory, suitably elaborated

to handle the RKHS setting. We establish convergence of the empirical

contrast function to the population contrast function under a condition on

the regularization coefficient εn, and from this result infer the consistency

of B(n).

35

−20 −10 0 10 20 −20

0

20−15

−10

−5

0

5

10

KDR2KDR

1

Y

Figure 3: Three dimensional representation of Evaporation data.

4.1 Main result

We assume hereafter that Y is a topological space. The Stiefel manifold

Smd (R) is assumed to be equipped with a distance D which is compatible

with the topology of Smd (R). It is known that geodesics define such a distance

(see, for example, [19, Chapter IV]).

The following technical assumptions are needed to guarantee the consis-

tency of kernel dimension reduction:

(A-1) For any bounded continuous function g on Y, the function

B 7→ EX

[EY |BT X [g(Y )|BTX]2

]

is continuous on Smd (R).

(A-2) For B ∈ Smd (R), let PB be the probability distribution of the random

36

variable BBTX on X . The Hilbert space HBX + R is dense in L2(PB) for

any B ∈ Smd (R).

(A-3) There exists a measurable function φ : X → R such that E|φ(X)|2 <

∞ and the Lipschitz condition

‖kd(BTx, ·) − kd(B

Tx, ·)‖Hd≤ φ(x)D(B, B)

holds for all B, B ∈ Smd (R) and x ∈ X .

Theorem 6. Suppose kd in Eq. (9) is continuous and bounded, and suppose

the regularization parameter εn in Eq. (15) satisfies

εn → 0, n1/2εn → ∞ (n→ ∞). (18)

Define the set of the optimum parameters Bmd by

Bmd = arg min

B∈Smd

(R)Tr

[ΣB

Y Y |X].

Under the assumptions (A-1), (A-2), and (A-3), the set Bmd is nonempty,

and for an arbitrary open set U in Smd (R) with B

md ⊂ U we have

limn→∞

Pr(B(n) ∈ U

)= 1.

Note that Theorem 6 holds independently of any requirement that the

population contrast function characterizes conditional independence. If the

additional conditions of Theorem 4 are satisfied, then the estimator con-

verges in probability to the set of sufficient dimension reduction subspaces.

37

The assumptions (A-1) and (A-2) are used to establish the continuity of

Tr[ΣBY Y |X ] in Lemma 13, and (A-3) is needed to derive the order of uniform

convergence of ΣB(n)Y Y |X in Lemma 9.

The assumption (A-1) is satisfied in various cases. Let f(x) = EY |X [g(Y )|X =

x], and assume f(x) is continuous. This assumption holds, for example, if

the conditional probability density pY |X(y|x) is bounded and continuous

with respect to x. Let C be an element of Smm−d(R) such that the subspaces

spanned by the column vectors of B and C are orthogonal; that is, the m×m

matrix (B,C) is an orthogonal matrix. Define random variables U and V

by U = BTX and V = CTX. If X has the probability density function

pX(x), the probability density function of (U, V ) is given by pU,V (u, v) =

pX(Bu + Cv). Consider the situation in which u is given by u = BT x for

B ∈ Smd (R) and x ∈ X , and let VB,x = v ∈ R

m−d | BBT x+ Cv ∈ X. We

have

E[g(Y )|BTX = BT x] =

∫VB,x

f(BBT x+ Cv)pX(BBT x+ Cv)dv∫VB,x

pX(BBT x+ Cv)dv.

If there exists an integrable function r(v) such that χVB,x(v)pX(BBT x +

Cv) ≤ r(v) for all B ∈ Smd (R) and x ∈ X , the dominated convergence

theorem ensures (A-1). Thus, it is easy to see that a sufficient condition for

(A-1) is that X is bounded, pX(x) is bounded, and pY |X(y|x) is bounded

and continuous on x, which is satisfied by a wide class of distributions.

The assumption (A-2) holds if X is compact and kd + 1 is a universal

kernel on Z. The assumption (A-3) is satisfied by many useful kernels; for

38

example, kernels with the property

∣∣∣∂2

∂za∂zbkd(z1, z2)

∣∣∣ ≤ L‖z1 − z2‖ (a, b = 1, 2),

for some L > 0. In particular Gaussian RBF kernels satisfy this property.

4.2 Proof of the consistency theorem

If the following proposition is shown, Theorem 6 follows straightforwardly

by standard arguments establishing the consistency of M-estimators (see,

for example, [31, Section 5.2]).

Proposition 7. Under the same assumptions as Theorem 6, the functions

Tr[Σ

B(n)Y Y |X

]and Tr

[ΣB

Y Y |X]

are continuous on Smd (R), and

supB∈Sm

d(R)

∣∣Tr[Σ

B(n)Y Y |X

]− Tr

[ΣB

Y Y |X]∣∣ → 0 (n→ ∞)

in probability.

The proof of Proposition 7 is divided into several lemmas. We decompose

supB

∣∣Tr[ΣBY Y |X ] − Tr[Σ

B(n)Y Y |X ]

∣∣ into two parts: supB

∣∣Tr[ΣBY Y |X ] − Tr[ΣY Y −

ΣBY X(ΣB

XX + εnI)−1ΣB

XY ]∣∣ and supB

∣∣Tr[ΣY Y − ΣBY X(ΣB

XX + εnI)−1ΣB

XY ] −

Tr[ΣB(n)Y Y |X ]

∣∣. Lemmas 8, 9, and 10 establish the convergence of the second

part. The convergence of the first part is shown by Lemmas 11–14; in

particular, Lemmas 12 and 13 establish the key result that the trace of the

population conditional covariance operator is a continuous function of B.

The following lemmas make use of the trace norm and the Hilbert-

Schmidt norm of operators. For a discussion of these norms, see [26, Section

39

VI] of [20, Section 30]. Recall that the trace of a positive operator A on a

Hilbert space H is defined by

Tr[A] =∞∑

i=1

〈ϕi, Aϕi〉H,

where ϕi∞i=1 is a complete orthonormal system (CONS) of H. A bounded

operator T on a Hilbert space H is called trace class if Tr[(T ∗T )1/2] is finite.

The set of all trace class operators on a Hilbert space is a Banach space with

the trace norm ‖T‖tr = Tr[(T ∗T )1/2]. For a trace class operator T on H,

the series∑∞

i=1〈ϕi, Tϕi〉 converges absolutely for any CONS ϕi∞i=1, and

the limit does not depend on the choice of CONS. The limit is called the

trace of T , and denoted by Tr[T ]. It is known that |Tr[T ]| ≤ ‖T‖tr.

A bounded operator T : H1 → H2, where H1 and H2 are Hilbert spaces,

is called Hilbert-Schmidt if Tr[T ∗T ] < ∞, or equivalently,∑∞

i=1 ‖Tϕi‖2H2

<

∞ for a CONS ϕi∞i=1 of H1. The set of all Hilbert-Schmidt operators from

H1 to H2 is a Hilbert space with Hilbert-Schmidt inner product

〈T1, T2〉HS =∞∑

i=1

〈T1ϕi, T2ϕi〉H2,

where ϕi∞i=1 is a CONS of H1. Thus, the Hilbert-Schmidt norm ‖T‖HS

satisfies ‖T‖2HS =

∑∞i=1 ‖Tϕi‖2

H2.

Obviously, ‖T‖ ≤ ‖T‖HS ≤ ‖T‖tr holds, if T is trace class or Hilbert-

Schmidt. Recall also that if A is trace class (Hilbert-Schmidt) and B is

bounded, AB and BA are trace class (Hilbert-Schmidt, resp.), for which

‖BA‖tr ≤ ‖B‖ ‖A‖tr and ‖AB‖tr ≤ ‖B‖ ‖A‖tr (‖AB‖HS ≤ ‖A‖ ‖B‖HS and

40

‖BA‖HS ≤ ‖A‖ ‖B‖HS). If A : H1 → H2 and B : H2 → H1 are Hilbert-

Schmidt, the product AB is trace-class with ‖AB‖tr ≤ ‖A‖HS‖B‖HS .

It is known that cross-covariance operators and covariance operators are

Hilbert-Schmidt and trace class, respectively, under the assumption Eq. (2)

[16, 14]. The Hilbert-Schmidt norm of ΣY X is given by

‖ΣY X‖2HS =

∥∥EY X [(kX (·, X) −mX)(kY(·, Y ) −mY )]∥∥2

HX⊗HY, (19)

where HX ⊗ HY is the direct product of HX and HY , and the trace norm

of ΣXX is

Tr[ΣXX ] = EX

[‖kX ( · , X) −mX‖2

HX

]. (20)

Lemma 8.

∣∣∣Tr[Σ

(n)Y Y |X

]− Tr

[ΣY Y − ΣY X

(ΣXX + εnI

)−1ΣXY

]∣∣∣

≤ 1

εn

(∥∥Σ(n)Y X

∥∥HS

+∥∥ΣY X

∥∥HS

)∥∥Σ(n)Y X − ΣY X

∥∥HS

+∥∥ΣY Y

∥∥tr

∥∥Σ(n)XX − ΣXX

∥∥

+∣∣Tr

[Σ

(n)Y Y − ΣY Y

]∣∣.

Proof. Noting that the self-adjoint operator ΣY X(ΣXX+εnI)−1ΣXY is trace

class from ΣY X(ΣXX+εnI)−1ΣXY ≤ ΣY Y , the left hand side of the assertion

is bounded from above by

∣∣Tr[Σ

(n)Y Y −ΣY Y

]∣∣+∣∣Tr

[Σ

(n)Y X

(Σ

(n)XX+εnI

)−1Σ

(n)XY −ΣY X

(ΣXX+εnI

)−1ΣXY

]∣∣.

41

The second term is upper-bounded by

∣∣Tr[(

Σ(n)Y X − ΣY X

)(Σ

(n)XX + εnI

)−1Σ

(n)XY

]∣∣

+∣∣Tr

[ΣY X

(Σ

(n)XX + εnI

)−1(Σ

(n)XY − ΣXY

)]∣∣

+∣∣Tr

[ΣY X

(Σ

(n)XX + εnI

)−1 −(ΣXX + εnI

)−1ΣXY

]∣∣

≤∥∥(

Σ(n)Y X − ΣY X

)(Σ

(n)XX + εnI

)−1Σ

(n)XY

∥∥tr

+∥∥ΣY X

(Σ

(n)XX + εnI

)−1(Σ

(n)XY − ΣXY

)∥∥tr

+∣∣∣Tr

[ΣY X

(ΣXX + εnI

)−1/2

×(

ΣXX + εnI)1/2(

Σ(n)XX + εnI

)−1(ΣXX + εnI

)1/2 − I

×(ΣXX + εnI

)−1/2ΣXY

]∣∣∣

≤ 1

εn

∥∥Σ(n)Y X − ΣY X‖HS

∥∥Σ(n)XY

∥∥HS

+1

εn

∥∥ΣY X

∥∥HS

∥∥Σ(n)XY − ΣXY

∥∥HS

+∥∥(

ΣXX + εnI)1/2(

Σ(n)XX + εnI

)−1(ΣXX + εnI

)1/2 − I∥∥

×∥∥(

ΣXX + εnI)−1/2

ΣXY ΣY X

(ΣXX + εnI

)−1/2∥∥tr.

In the last line, we use |Tr[ABA∗]| ≤ ‖B‖ ‖A∗A‖tr for a Hilbert-Schmidt

operator A and a bounded operator B. This is confirmed easily by the

singular decomposition of A.

Since the spectrum of A∗A and AA∗ are identical, we have

∥∥(ΣXX + εnI

)1/2(Σ

(n)XX + εnI

)−1(ΣXX + εnI

)1/2 − I∥∥

=∥∥(

Σ(n)XX + εnI

)−1/2(ΣXX + εnI

)(Σ

(n)XX + εnI

)−1/2 − I∥∥

≤∥∥(

Σ(n)XX + εnI

)−1/2(ΣXX − Σ

(n)XX

)(Σ

(n)XX + εnI

)−1/2∥∥

≤ 1

εn

∥∥Σ(n)XX − ΣXX

∥∥.

42

The bound ‖(ΣXX + εnI)−1/2Σ

1/2XXVXY ‖ ≤ 1 yields

∥∥(ΣXX + εnI

)−1/2ΣXY ΣY X

(ΣXX + εnI

)−1/2∥∥tr≤

∥∥ΣY Y

∥∥tr,

which concludes the proof.

Lemma 9. Under the assumption (A-3),

supB∈Sm

d(R)

∥∥ΣB(n)XX − ΣB

XX

∥∥HS, sup

B∈Smd

(R)

∥∥ΣB(n)XY − ΣB

XY

∥∥HS,

and supB∈Sm

d(R)

∣∣Tr[Σ

B(n)Y Y − ΣB

Y Y

]∣∣

are of order Op(1/√n) as n→ ∞.

The proof of Lemma 9 is deferred to the Appendix. From Lemmas 8 and

9, the following lemma is obvious.

Lemma 10. If the regularization parameter (εn)∞n=1 satisfies Eq. (18), under

the assumption (A-3) we have

supB∈Sm

d(R)

∣∣∣Tr[Σ

B(n)Y Y |X

]−Tr

[ΣY Y −ΣB

Y X

(ΣB

XX +εnI)−1

ΣBXY

]∣∣∣ = Op

(ε−1n n−1/2

),

as n→ ∞.

In the next four lemmas, we establish the uniform convergence of Lε to

L0 (ε ↓ 0), where Lε(B) is a function on Smd (R) defined by

Lε(B) = Tr[ΣB

Y X

(ΣB

XX + εI)−1

ΣBXY

],

for ε > 0 and L0(B) = Tr[Σ1/2Y Y V

BY XV

BXY Σ

1/2Y Y ]. We begin by establishing

pointwise convergence.

43

Lemma 11. For arbitrary kernels with Eq. (2),

Tr[ΣY X

(ΣXX + εI

)−1ΣXY

]→ Tr

[Σ

1/2Y Y VY XVXY Σ

1/2Y Y

](ε ↓ 0).

Proof. With a CONS ψi∞i=1 for HY , the difference of the right hand side

and the left hand side can be written as

∞∑

i=1

〈ψi,Σ1/2Y Y VY X

I − Σ

1/2XX(ΣXX + εI)−1Σ

1/2XX

VXY Σ

1/2Y Y ψi〉HY

.

Since each summand is positive and upper bounded by 〈ψi,Σ1/2Y Y VY XVXY Σ

1/2Y Y ψi〉HY

,

and the sum over i is finite, by the dominated convergence theorem it suffices

to show

limε↓0

〈ψ,Σ1/2Y Y VY X

I − Σ


1/2XX

VXY Σ

1/2Y Y ψ〉HY

= 0,

for each ψ ∈ HY .

Fix arbitrary ψ ∈ HY and δ > 0. From the fact R(VXY ) ⊂ R(ΣXX),

there exists h ∈ HX such that ‖VXY Σ1/2Y Y ψ−ΣXXh‖HX

< δ. Using the fact

I − Σ1/2XX(ΣXX + εnI)

−1Σ1/2XX = εn(ΣXX + εnI)

−1, we have

∥∥I − Σ


1/2XX

VXY Σ

1/2Y Y ψ

∥∥HX

=∥∥ε

(ΣXX + εI

)−1ΣXXh

∥∥HX

+∥∥ε

(ΣXX + εI

)−1(VXY Σ

1/2Y Y ψ − ΣXXh

)∥∥HX

≤ ε‖h‖HX+ δ,

which is arbitrarily small if ε is sufficiently small. This completes the proof.

Lemma 12. Suppose kd is continuous and bounded. Then, for any ε > 0,

the function Lε(B) is continuous on Smd (R).

44

Proof. By an argument similar to that in the proof of Lemma 11, it suffices

to show the continuity of B 7→ 〈ψ,ΣBY X(ΣB

XX + εI)−1ΣBXY ψ〉HY

for each

ψ ∈ HY .

Let JBX : HB

X → L2(PX) and JY : HY → L2(PY ) be inclusions. As seen in

Proposition 1, the operators ΣBY X and ΣB

XX can be extended to the integral

operators SBY X and SB

XX on L2(PX), respectively, so that JY ΣBY X = SB

Y XJBX

and JBXΣB

XX = SBXXJ

BX . It is not difficult to see also JB

X (ΣBXX + εI)−1 =

(SBXX + εI)−1JB

X for ε > 0. These relations yield

〈ψ,ΣBY X

(ΣB

XX + εI)−1

ΣBXY ψ〉HY

= EXY

[ψ(Y )

((SB

XX + εI)−1

SBXY ψ

)(X)

]

− EY [ψ(Y )]EX

[((SB

XX + εI)−1

SBXY ψ

)(X)

],

where JY ψ is identified with ψ. The assertion is obtained if we prove that

the operators SBXY and (SB

XX +εI)−1 are continuous with respect to B in op-

erator norm. To see this, let X be identically and independently distributed

with X. We have

∥∥(SB

XY − SB0

XY

)ψ

∥∥2

L2(PX)= EX

[CovY X

[kBX (X, X) − kB0

X (X, X), ψ(Y )]2

]

≤ EX

[VarX

[kd(B

TX,BT X) − kd(BT0 X,B

T0 X)

]VarY [ψ(Y )]

]

≤ EXEX

[(kd(B

TX,BT X) − kd(BT0 X,B

T0 X)

)2]‖ψ‖2L2(PY ),

from which the continuity of B 7→ SBXY is obtained by the continuity and

boundedness of kd. The continuity of (SBXX + εI)−1 is shown by ‖(SB

XX +

εI)−1 − (SB0

XX + εI)−1‖ = ‖(SBXX + εI)−1(SB0

XX − SBXX)(SB0

XX + εI)−1‖ ≤1ε2 ‖SB0

XX − SBXX‖.

45

To establish the continuity of L0(B) = Tr[ΣB

Y XΣBXX

−1ΣB

XY

], the argu-

ment in the proof of Lemma 12 cannot be applied, because ΣBXX

−1is not

bounded in general. The assumptions (A-1) and (A-2) are used for the proof.

Lemma 13. Suppose kd is continuous and bounded. Under the assumptions

(A-1) and (A-2), the function L0(B) is continuous on Smd (R).

Proof. By the same argument as in the proof of Lemma 11, it suffices to

establish the continuity of B 7→ 〈ψ,ΣBY Y |Xψ〉 for ψ ∈ HY . From Proposition

2, the proof is completed if the continuity of the map

B 7→ inff∈HB

X

VarXY [g(Y ) − f(X)]

is proved for any continuous and bounded function g.

Since f(x) depends only on BTx for any f ∈ HBX , under the assumption

(A-2), we use the same argument as in the proof of Proposition 3 to obtain

inff∈HB

X

VarXY [g(Y ) − f(X)]

= inff∈HB

X

VarX

[EY |BBT X [g(Y )|BBTX] − f(X)

]+ EX

[VarY |BBT X [g(Y )|BBTX]

]

= EY [g(Y )2] − EX

[EY |BT X [g(Y )|BTX]2

],

which is a continuous function of B ∈ Smd (R) from Assumption (A-1).

Lemma 14. Suppose that kd is continuous and bounded, and that εn con-

verges to zero as n goes to infinity. Under the assumptions (A-1) and (A-2),

we have

supB∈Sm

d(R)

Tr[ΣB

Y Y |X −ΣY Y −ΣB

Y X(ΣBXX + εnI)

−1ΣBXY

]→ 0 (n→ ∞).

46

Proof. From Lemmas 11, 12 and 13, the continuous function Tr[ΣY Y −

ΣY X

(ΣB

XX + εnI)−1

ΣBXY ] converges to the continuous function Tr[ΣB

Y Y |X ]

for every B ∈ Smd (R). Because this convergence is monotone and S

md (R) is

compact, it is necessarily uniform.

The proof of Proposition 7 is now easily obtained.

Proof of Proposition 7. Lemmas 12 and 13 show the continuity of Tr[Σ

B(n)Y Y |X

]

and Tr[ΣB

Y Y |X]. Lemmas 10 and 14 prove the uniform convergence.

5 Conclusions

This paper has presented KDR, a new method for sufficient dimension re-

duction in regression. The method is based on a characterization of con-

ditional independence using covariance operators on reproducing Hilbert

spaces. This characterization is not restricted to first- or second-order con-

ditional moments, but exploits high-order moments in the estimation of the

central subspace. The KDR method is widely applicable; in distinction to

most of the existing literature on SDR it does not impose strong assump-

tions on the probability distribution of the covariate vector X. It is also

applicable to problems in which the response Y is discrete.

We have developed some asymptotic theory for the estimator, resulting

in a proof of consistency of the estimator under weak conditions. The proof

of consistency reposes on a result establishing the uniform convergence of the

empirical process in a Hilbert space. In particular, we have established the

47

rate Op(n−1/2) for uniform convergence, paralleling the results for ordinary

real-valued empirical processes.

We have not yet developed distribution theory for the KDR method, and

have left open the important problem of inferring the dimensionality of the

central subspace. Our proof techniques do not straightforwardly extend to

yield the asymptotic distribution of the KDR estimator, and new techniques

may be required.

It should be noted, however, that inference of the dimensionality of the

central subspace is not necessary for many of the applications of SDR. In

particular, SDR is often used in the context of graphical exploration of data,

where a data analyst may wish to explore views of varying dimensionality.

Also, in high-dimensional prediction problems of the kind studied in statisti-

cal machine learning, dimension reduction may be carried out in the context

of predictive modeling, in which case cross-validation and related techniques

may be used to choose the dimensionality.

Finally, while we have focused our discussion on the central subspace as

the object of inference, it is also worth noting that KDR applies even to

situations in which a central subspace does not exist. As we have shown,

the KDR estimate converges to the subset of projection matrices that sat-

isfy Eq. (1); this result holds regardless of the existence of a central sub-

space. That is, if the intersection of dimension-reduction subspaces is not a

dimension-reduction subspace, but if the dimensionality chosen for KDR is

chosen to be large enough such that subspaces satisfying Eq. (1) exist, then

48

KDR will converge to one of those subspaces.

A Uniform convergence of cross-covariance oper-

ators

In this appendix we present a proof of Lemma 9. The proof involves the

use of random elements in a Hilbert space [30, 3]. Let H be a Hilbert space

equipped with a Borel σ-field. A random element in the Hilbert space H is

a measurable map F : Ω → H from a measurable space (Ω,S). If H is an

RKHS on a measurable set X with a measurable positive definite kernel k,

a random variable X in X defines a random element in H by k(·, X).

A random element F in a Hilbert space H is said to have strong order

p (0 < p < ∞) if E‖F‖p is finite. For a random element F of strong

order one, the expectation of F , which is defined as the element mF ∈ H

such that 〈mF , g〉H = E[〈F, g〉H] for all g ∈ H, is denoted by E[F ]. With

this notation, the interchange of the expectation and the inner product is

justified: 〈E[F ], g〉H = E[〈F, g〉H]. Note also that for independent random

elements F and G of strong order two, the relation

E[〈F,G〉H] = 〈E[F ], E[G]〉H

holds.

Let (X,Y ) be a random vector on X × Y with law PXY , and let HX

and HY be the RKHS with positive definite kernels kX and kY , respectively,

which satisfy Eq. (2). The random element kX (·, X) has strong order two,

49

and E[k(·, X)] equals mX , where mX is given by Eq. (4). The random

element kX (·, X)kY(·, Y ) in the direct product HX ⊗ HY has strong order

one. Define the zero mean random elements F = kX (·, X)−E[kX (·, X)] and

G = kY(·, Y ) − E[kY(·, Y )].

For an i.i.d. sample (X1, Y1), . . . , (Xn, Yn) on X × Y with law PXY , de-

fine random elements Fi = kX (·, Xi) − E[kX (·, X)] and Gi = kY(·, Yi) −

E[kY(·, Y )]. Then, F, F1, . . . , Fn and G,G1, . . . , Gn are zero mean i.i.d. ran-

dom elements in HX and HY , respectively. In the following, the notation

F = HX ⊗HY is used for simplicity.

As shown in the proof of Lemma 4 in [14], we have

∥∥Σ(n)Y X − ΣY X

∥∥HS

=∥∥∥

1

n

n∑

i=1

(Fi −

1

n

n∑

j=1

Fj

)(Gi −

1

n

n∑

j=1

Gj

)− E[FG]

∥∥∥F,

which provides a bound

supB∈Sm

d(R)

∥∥ΣB(n)Y X − ΣB

Y X

∥∥HS

≤ supB∈Sm

d(R)

∥∥∥1

n

n∑

i=1

(FB

i Gi − E[FG])∥∥∥

FB

+ supB∈Sm

d(R)

∥∥∥1

n

n∑

j=1

FBj

∥∥∥HB

X

∥∥∥1

n

n∑

j=1

Gj

∥∥∥HY

, (21)

where FBi are defined with the kernel kB, and FB = HB

X ⊗ HY . Also,

Eq. (20) implies

Tr[Σ

(n)XX − ΣXX

]=

1

n

n∑

i=1

∥∥∥Fi −1

n

n∑

j=1

Fj

∥∥∥2

HX

− E‖F‖2HX

=1

n

n∑

i=1

‖Fi‖2HX

− E‖F‖2HX

−∥∥∥

1

n

n∑

i=1

Fi

∥∥∥2

HX

,

50

from which we have

supB∈Sm

d(R)

∣∣Tr[Σ

B(n)XX − ΣB

XX

]∣∣ ≤ supB∈Sm

d(R)

∣∣∣1

n

n∑

i=1

‖FBi ‖2

HBX

− E‖FB‖2HB

X

∣∣∣

+ supB∈Sm

d(R)

∥∥∥1

n

n∑

i=1

FBi

∥∥∥2

HBX

. (22)

It follows that Lemma 9 is proved if all the four terms on the right hand

side of Eqs. (21) and (22) are of order Op(1/√n).

Hereafter, the kernel kd is assumed to be bounded. We begin by consid-

ering the first term on the right hand side of Eq. (22). This is the supremum

of a process which consists of real-valued random variables ‖FBi ‖2

HBX

. Let

UB be a random element in Hd defined by

UB = kd(·, BTX) − E[kd(·, BTX)],

and let C > 0 be a constant such that |kd(z, z)| ≤ C2 for all z ∈ Z. From

‖UB‖Hd≤ 2C, we have for B, B ∈ S

md (R)

∣∣‖FB‖2HB

X

− ‖F B‖2

HBX

∣∣ =∣∣〈UB − U B, UB + U B〉Hd

∣∣

≤ ‖UB − U B‖Hd‖UB + U B‖Hd

≤ 4C‖UB − U B‖Hd.

The above inequality, combined with the bound

‖UB − U B‖Hd≤ 2φ(x)D(B, B) (23)

obtained from Assumption (A-3), provides a Lipschitz condition∣∣‖FB‖2

HBX

−

‖F B‖2

HBX

∣∣ ≤ 8Cφ(x)D(B, B), which works as a sufficient condition for the

51

uniform central limit theorem [31, Example 19.7]. This yields

supB∈Sm

d(R)

∣∣∣1

n

n∑

i=1

‖FBi ‖2

HBX

− E‖FB‖2HB

X

∣∣∣ = Op(1/√n).

Our approach to the other three terms is based on a treatment of empir-

ical processes in a Hilbert space. For B ∈ Smd (R), let UB

i = kd(·, BTXi) −

E[kd(·, BTX)] be a random element in Hd. Then the relation 〈kB(·, x), kB(·, x)〉HBX

=

kd(BTx,BT x) = 〈kd(·, BTx), kd(·, BT x)〉Hd

implies

∥∥∥1

n

n∑

j=1

FBj

∥∥∥HB

X

=∥∥∥

1

n

n∑

j=1

UBj

∥∥∥Hd

, (24)

∥∥∥1

n

n∑

j=1

FBj G− E[FG]

∥∥∥HB

X⊗HY

=∥∥∥

1

n

n∑

j=1

UBj G− E[UBG]

∥∥∥Hd⊗HY

. (25)

Note also that the assumption (A-3) gives

‖UBG− U BG‖Hd⊗HY≤ 2

√kY(y, y)φ(x)D(B, B). (26)

From Eqs. (23), (24), (25), and (26), the proof of Lemma 9 is completed

from the following proposition:

Proposition 15. Let (X ,BX ) be a measurable space, let Θ be a compact

metric space with distance D, and let H be a Hilbert space. Suppose that

X,X1, . . . , Xn are i.i.d. random variables on X , and suppose F : X×Θ → H

is a Borel measurable map. If supθ∈Θ ‖F (x; θ)‖H < ∞ for all x ∈ X and

there exists a measurable function φ : X → R such that E[φ(X)2] <∞ and

‖F (x; θ1) − F (x; θ2)‖H ≤ φ(x)D(θ1, θ2) (∀ θ1, θ2 ∈ Θ), (27)

then we have

supθ∈Θ

∥∥∥1√n

n∑

i=1

(F (Xi; θ) − E[F (X; θ)]

)∥∥∥H

= Op(1) (n→ ∞).

52

The proof of Proposition 15 is similar to that for a real-valued random

process, and is divided into several lemmas.

I.i.d. random variables σ1, . . . , σn taking values in +1,−1 with equal

probability are called Rademacher variables. The following concentration

inequality is known for a Rademacher average in a Banach space:

Proposition 16. Let a1, . . . , an be elements in a Banach space, and let

σ1, . . . , σn be Rademacher variables. Then, for every t > 0

Pr(∥∥∑n

i=1σiai

∥∥ > t)≤ 2 exp

(− t2

32∑n

i=1 ‖ai‖2

).

Proof. See [21, Theorem 4.7 and the remark thereafter].

With Proposition 16, the following exponential inequality is obtained

with a slight modification of the standard symmetrization argument for em-

pirical processes.

Lemma 17. Let X,X1, . . . , Xn and H be as in Proposition 15, and denote

(X1, . . . , Xn) by Xn. Let F : X → H be a Borel measurable map with

E‖F (X)‖2H < ∞. For a positive number M such that E‖F (X)‖2

H < M ,

define an event An by 1n

∑ni=1 ‖F (Xi)‖2 ≤ M . Then, for every t > 0 and

sufficiently large n,

Pr(

Xn

∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − E[F (X)]

)∥∥∥H> t

∩An

)≤ 8 exp

(− nt2

1024M

).

Proof. First, note that for any sufficiently large n we have Pr(An) ≥ 34

and Pr(∥∥ 1

n

∑ni=1(F (Xi) − E[F (X)])

∥∥ ≤ t2

)≥ 3

4 . We consider only such

53

n in the following. Let Xn be an independent copy of Xn, and let An =

Xn

∣∣ 1n

∑ni=1 ‖F (Xi)‖2 ≤M

. The obvious inequality

Pr(

Xn

∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − E[F (X)])∥∥∥H> t

∩An

)

× Pr(

Xn

∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − E[F (X)])∥∥∥H≤ t

2

∩ An

)

≤ Pr(

(Xn, Xn)∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − F (Xi))∥∥∥H>t

2

∩An ∩ An

)

and the fact that Bn :=(Xn, Xn)

∣∣ 12n

∑ni=1

(‖F (Xi)‖2 + ‖F (Xi)‖2

)≤M

includes An ∩ An gives a symmetrized bound

Pr(

Xn

∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − E[F (X)])∥∥∥H> t

∩An

)

≤ 2 Pr(

(Xn, Xn)∣∣∣∥∥∥

1

n

n∑

i=1

(F (Xi) − F (Xi))∥∥∥H>t

2

∩Bn

).

Introducing Rademacher variables σ1, . . . , σn, the right hand side is equal

to

2 Pr(

(Xn, Xn, σi)∣∣∣∥∥∥

1

n

n∑

i=1

σi(F (Xi) − F (Xi))∥∥∥H>t

2

∩Bn

),

which is upper-bounded by

4 Pr(∥∥∥

1

n

n∑

i=1

σiF (Xi)∥∥∥H>t

4and

1

2n

n∑

i=1

‖F (Xi)‖2H ≤M

)

= 4EXn

[Pr

(∥∥∥1

n

n∑

i=1

σiF (Xi)∥∥∥H>t

4

∣∣∣ Xn

)1Xn∈Cn

],

where Cn =Xn

∣∣ 1n

∑ni=1 ‖F (Xi)‖2

H ≤ 2M. From Proposition 16, the last

line is upper-bounded by 4 exp(− (nt/4)2

32Pn

i=1‖F (Xi)‖2

)≤ 4 exp

(− nt2

1024M

).

54

Let Θ be a set with semimetric d. For any δ > 0, the covering number

N(δ, d,Θ) is the smallest m ∈ N for which there exist m points θ1, . . . , θm

in Θ such that min1≤i≤m d(θ, θi) ≤ δ holds for any θ ∈ Θ. We write N(δ)

for N(δ, d,Θ) if there is no confusion. For δ > 0, the covering integral J(δ)

for Θ is defined by

J(δ) =

∫ δ

0

(8 log(N(u)2/u

)1/2du.

The chaining lemma [25], which plays a crucial role in the uniform central

limit theorem, is readily extendable to a random process in a Banach space.

Lemma 18 (Chaining Lemma). Let Θ be a set with semimetric d, and let

Z(θ) | θ ∈ Θ be a family of random elements in a Banach space. Suppose

Θ has a finite covering integral J(δ) for 0 < δ < 1 and suppose there exists

a positive constant R > 0 such that for all θ, η ∈ Θ and t > 0 the inequality

Pr(‖Z(θ) − Z(η)‖ > td(θ, η)

)≤ 8 exp

(− 1

2R t2)

holds. Then, there exists a countable subset Θ∗ of Θ such that for any

0 < ε < 1

Pr(

supθ,η∈Θ∗,d(θ,η)≤ε

‖Z(θ) − Z(η)‖ > 26RJ(d(θ, η)))≤ 2ε

holds. If Z(θ) has continuous sample paths, then Θ∗ can be replaced by Θ.

Proof. By noting that the proof of the chaining lemma for a real-valued

random process does not use any special properties of real numbers but the

property of the norm (absolute value) for Z(θ), the proof applies directly to

a process in a Banach space. See [25, Section VII.2].

55

Proof of Proposition 15. Note that Eq. (27) means

∥∥∥1

n

n∑

i=1

(F (Xi; θ1) − F (Xi; θ2))∥∥∥

2

H≤ D(θ1, θ2)

2 1

n

n∑

i=1

φ(Xi)2.

Let M > 0 be a constant such that E[φ(X)2] < M , and let An =Xn |

∥∥ 1n

∑ni=1(F (Xi; θ1) − F (Xi; θ2))

∥∥2

H ≤ MD(θ1, θ2)2. Since the probability

of An converges to zero as n→ ∞, it suffices to show that there exists δ > 0

such that the probability

Pn = Pr(Xn

∣∣∣An ∩

supθ∈Θ

∥∥∥1√n

n∑

i=1

(F (Xi; θ) − E[F (X; θ)]

)∥∥∥H> δ

)

satisfies lim supn→∞ Pn = 0.

With the notation Fθ(x) = F (x; θ)−E[F (X; θ)], from Lemma 17 we can

derive

Pr(An ∩

Xn

∣∣∣∥∥∥

1√n

n∑

i=1

(Fθ1

(Xi) − Fθ2(Xi)

)∥∥∥H> t

)

≤ 8 exp(− t2

512 · 2MD(θ1, θ2)2

),

for any t > 0 and sufficiently large n. Because the covering integral J(δ)

with respect to D is finite by the compactness of Θ, and the sample path

Θ ∋ θ 7→ 1√n

∑ni=1 Fθ(Xi) ∈ H is continuous, the chaining lemma implies

that for any 0 < ε < 1

Pr

(An ∩

Xn

∣∣∣ supθ1,θ2∈Θ,D(θ1,θ2)≤ε

∥∥∥1√n

n∑

i=1

(Fθ1

(Xi) − Fθ2(Xi)

)∥∥∥H

> 26 · 512M · J(ε))

≤ 2ε.

56

Take an arbitrary ε ∈ (0, 1). We can find a finite number of partitions

Θ = ∪ν(ε)a=1Θa (ν(ε) ∈ N) so that any two points in each Θa are within the

distance ε. Let θa be an arbitrary point in Θa. Then the probability Pn is

bounded by

Pn ≤Pr(

max1≤a≤ν(ε)

∥∥∥1√n

n∑

i=1

Fθa(Xi)

∥∥∥H>δ

2

)

+ Pr

(An ∩

Xn

∣∣∣ supθ,η∈Θ,D(θ,η)≤ε

∥∥∥1√n

n∑

i=1

(Fθ(Xi) − Fη(Xi)

)∥∥∥H>δ

2

).

(28)

From Chebyshev’s inequality the first term is upper-bounded by

ν(ε) Pr(∥∥∥

1√n

n∑

i=1

Fθa(Xi)

∥∥∥H>δ

2

)≤ 4ν(ε)E‖Fθa

(X)‖2H

δ2.

If we take sufficiently large δ so that 512MJ(ε) < δ/2 and4ν(ε)E‖Fθa (X)‖2

H

ε <

δ2, the right hand side of Eq. (28) is bounded by 3ε, which completes the

proof.

References

[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the

American Mathematical Society, 68(3):337–404, 1950.

[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis.

Journal of Machine Learning Research, 3:1–48, 2002.

[3] C. R. Baker. Joint measures and cross-covariance operators. Transac-

tions of the American Mathematical Society, 186:273–289, 1973.

57

[4] L. Breiman and J. H. Friedman. Estimating optimal transformations for

multiple regression and correlation. Journal of the American Statistical

Association, 80:580–598, 1985.

[5] F. Chiaromonte and R. D. Cook. Sufficient dimension reduction and

graphics in regression. Annals of the Institute of Statistical Mathemat-

ics, 54(4):768–795, 2002.

[6] R. D. Cook. Regression Graphics. Wiley Inter-Science, 1998.

[7] R. D. Cook and H. Lee. Dimension reduction in regression with a

binary response. Journal of the American Statistical Association, 94:

1187–1200, 1999.

[8] R. D. Cook and B. Li. Dimension reduction for conditional mean in

regression. The Annals of Statistics, 30(2):455–474, 2002.

[9] R. D. Cook and S. Weisberg. Discussion of Li (1991). Journal of the

American Statistical Association, 86:328–332, 1991.

[10] R. D. Cook and X. Yin. Dimension reduction and visualization in dis-

criminant analysis (with discussion). Australian & New Zealand Journal

of Statistics, 43(2):147–199, 2001.

[11] B. Flury and H. Riedwyl. Multivariate Statistics: A Practical Approach.

Chapman and Hall, 1988.

58

[12] J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal

of the American Statistical Association, 76:817–823, 1981.

[13] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction

for supervised learning with reproducing kernel Hilbert spaces. Journal

of Machine Learning Research, 5:73–99, 2004.

[14] K. Fukumizu, F. R. Bach, and A. Gretton. Consistency of kernel canon-

ical correlation analysis. Research Memorandum 942, Institute of Sta-

tistical Mathematics, 2005.

[15] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf. Kernel measures of

conditional dependence. In J. Platt, D. Koller, Y. Singer, and S. Roweis,

editors, Advances in Neural Information Processing Systems 20, pages

489–496. MIT Press, Cambridge, MA, 2008.

[16] A. Gretton, O. Bousquet, A. J. Smola, and B. Scholkopf. Measuring

statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U.

Simon, and E. Tomita, editors, 16th International Conference on Algo-

rithmic Learning Theory, pages 63–77. Springer, 2005.

[17] C. W. Groetsch. The Theory of Tikhonov Regularization for Fredholm

Equations of the First Kind. Pitman, 1984.

[18] M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adap-

tive approach for dimension reduction. The Annals of Statistics, 29(6):

1537–1566, 2001.

59

[19] S. Kobayashi and K. Nomizu. Foundations of Differential Geometry,

Vol. 1. John Wiley & Sons, 1963.

[20] P. D. Lax. Functional Analysis. Wiley, 2002.

[21] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-

Verlag, 1991.

[22] B. Li, H. Zha, and F. Chiaromonte. Contour regression: A general

approach to dimension reduction. The Annals of Statistics, 33(4):1580–

1616, 2005.

[23] K.-C. Li. Sliced inverse regression for dimension reduction (with discus-

sion). Journal of American Statistical Association, 86:316–342, 1991.

[24] K.-C. Li. On principal Hessian directions for data visualization and

dimension reduction: Another application of Stein’s lemma. Journal of

American Statistical Association, 87:1025–1039, 1992.

[25] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.

[26] M. Reed and B. Simon. Functional Analysis. Academic Press, 1980.

[27] A. M. Samarov. Exploring regression structure using nonparametric

functional estimation. Journal of the American Statistical Association,

88(423):836–847, 1993.

[28] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and

B. Scholkopf. Injective Hilbert space embeddings of probability mea-

60

sures. In Proceedings of The 21st Annual Conference on Learning The-

ory (COLT2008). 2008.

[29] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal

of Royal Statistical Society, Series B, 58(1):267–288, 1996.

[30] N. N. Vakhania, V. I. Tarieladze, and S. A. Chobanyan. Probability

Distributions on Banach Spaces. D. Reidel Publishing Company, 1987.

[31] A. W. van der Vaart. Asymptotic Statistics. Cambridge University

Press, 1998.

[32] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-

NSF Regional Conference Series in Applied Mathematics. SIAM,

Philadelphia, PA, 1990.

[33] Y. Xia, H. Tong, W. Li, and L.-X. Zhu. An adaptive estimation of

dimension reduction space. Journal of the Royal Statistical Society B,

64(3):363–410, 2002.

[34] X. Yin and E. Bura. Moment-based dimension reduction for multivari-

ate response regression. Journal of Statistical Planning and Inference,

136(10):3675–3688, 2006.

[35] X. Yin and R. D. Cook. Direction estimation in single-index regressions.

Biometrika, 92(2):371–384, 2005.

61

[36] Y. Zhu and P. Zeng. Fourier methods for estimating the central sub-

space and the central mean subspace in regression. Journal of the Amer-

ican Statistical Association, 101(476):1638–1651, 2006.

62

Documents

Kernel Dimension Reduction in Regressionfbach/fukumizu-bach-jordan...Michael I. Jordan Department of Statistics Department of Computer Science and Electrical Engineering University