A direct approach to sparse discriminant analysis in ultra-high

Biometrika (2012), 99,1,pp. 29–42 doi: 10.1093/biomet/asr066

 2012 Biometrika Trust Advance Access publication 5 January 2012

Printed in Great Britain

A direct approach to sparse discriminant analysis

in ultra-high dimensions

BY QING MAI, HUI ZOU

School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455, U.S.A.

[email protected] [email protected]

AND MING YUAN

M. Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology,

Atlanta, Georgia 30332, U.S.A.

m[email protected]

UMMARY

Sparse discriminant methods based on independence rules, such as the nearest shrunken cen-

troids classifier (Tibshirani et al., 2002) and features annealed independence rules (Fan & Fan,

2008), have been proposed as computationally attractive tools for feature selection and classi-

fication with high-dimensional data. A fundamental drawback of these rules is that they ignore

correlations among features and thus could produce misleading feature selection and inferior

classification. We propose a new procedure for sparse discriminant analysis, motivated by the

least squares formulation of linear discriminant analysis. To demonstrate our proposal, we study

the numerical and theoretical properties of discriminant analysis constructed via lasso penalized

least squares. Our theory shows that the method proposed can consistently identify the subset of

discriminative features contributing to the Bayes rule and at the same time consistently estimate

the Bayes classification direction, even when the dimension can grow faster than any polynomial

order of the sample size. The theory allows for general dependence among features. Simulated

and real data examples show that lassoed discriminant analysis compares favourably with other

popular sparse discriminant proposals.

Some key words: Discriminant analysis; Features annealed independence rule; Lasso; Nearest shrunken centroids

classifier; Nonpolynomial-dimension asymptotics.

1. INTRODUCTION

Consider a binary classification problem where x = (x

,...,x

)

represents the predictor

vector and G = 1, 2 denotes the class label. Linear discriminant analysis is perhaps the oldest

classification technique that is still being used routinely in applications. The linear discrimi-

nant analysis model assumes x | G = g ∼ N (μ

,),pr (G = 1) = π

, pr(G = 2) = π

. Then, the

Bayes rule, which is the theoretically optimal classifier minimizing the 0–1 loss, classifies a data

point to class 2 if and only if

{x − (μ

+ μ

)/2}



−1

(μ

− μ

) + log(π

/π

)>0. (1)

Let ˆμ

, n

and ˆμ

, n

be the sample mean vector and sample size within classes 1 and 2, respec-

tively. Let

 be the pooled sample covariance estimate of . Linear discriminant analysis sets

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

30 QING MAI,HUI ZOU &MING YUAN

=ˆμ

, μ

=ˆμ

,  =

, π

= n

/n,π

= n

/n in (1). Despite its simplicity, it has proved to

be a reasonably good classifier in many applications. For example, Michie et al. (1994)andHand

(2006) have shown that linear discriminant analysis has very competitive performance for many

real-world benchmark datasets.

With the rapid advance of technology, high-dimensional data appear more and more frequently.

In such data, the dimension, p, can be much larger than the sample size, n. It has been empiri-

cally observed that, for classification problems with high-dimension-and-low-sample-size data,

some simple linear classifiers perform as well as more sophisticated classification algorithms

such as the support vector machine and boosting. See, e.g., the comparison study by Dettling

(2004). Hall et al. (2005) provided some geometric insight into this phenomenon. In recent years,

many papers have considered ways to modify linear discriminant analysis to be suitable for high-

dimensional classification. One approach is to use more sophisticated estimates of the inverse

covariance matrix 

−1

to replace the naive sample estimate. Under sparsity assumptions, one

can obtain good estimators of  and 

−1

even when p is much larger than n (Bickel & Levina,

2008; Cai et al., 2010; Rothman et al., 2008). However, a better estimate of 

−1

does not neces-

sarily lead to a better classifier. In an ideal scenario where we know that  is an identity matrix

and π

= π

= 0·5, then we could classify x to class 2 if {x − ( ˆμ

+ˆμ

)/2}

( ˆμ

−ˆμ

)>0.

Although this classifier does not suffer from the difficulty of estimating a large covariance matrix,

Fan & Fan (2008) showed that this classifier performs no better than random guessing when p

is sufficiently large, due to noise accumulation in estimating μ

and μ

. Therefore, effectively

exploiting sparsity is critically important for high-dimensional classification.

Tibshirani et al. (2002) proposed the nearest shrunken centroids classifier for tumour classi-

fication and gene selection using microarray data. This classifier is defined as follows. For each

variable x

, we compute d

= (1/n

+ 1/n)

−1/2

( ˆμ

−¯x

)(s

+ s

)

−1

(g = 1, 2),where ˆμ

the within-class sample mean, s

is the sample estimate of 

and s

is a small positive constant

added for robustness. For simplicity, we set s

= 0. Define the shrunken centroids mean by

ˆμ



=¯x

+ (n

−1

+ n

−1

)

1/2

(g = 1, 2),

where ¯x

= (n

ˆμ

+ n

ˆμ

)/n is the marginal sample mean of x

, λ is a pre-chosen positive

constant and d

is computed by soft-thresholding d

: d

= sign(d

)(|d

|−λ)

(g = 1, 2).

The nearest shrunken centroids classifier classifies x to class 2 if



j=1

− ( ˆμ



2 j

+ˆμ



1 j

)/2}s

−2

( ˆμ



2 j

−ˆμ



1 j

) + log(n

)>0. (2)

Comparing (2)and(1), we see that this classifier modifies the usual linear discriminant analysis

in two ways. First, it uses only the diagonal of the sample covariance matrix to estimate .If

λ = 0, this classifier reduces to diagonal linear discriminant analysis, which has been shown in

Bickel & Levina (2004) t o work much better than linear discriminant analysis in high dimen-

sions. Secondly, this classifier uses the shrunken centroids means to estimate μ

, μ

for fea-

ture selection. If we use a sufficiently large λ, then the soft-thresholding operation will force

ˆμ



=ˆμ



=¯x

for some variables, which then make no contribution to the classifier defined

in (2). Many experiments have shown that the nearest shrunken centroids classifier is very com-

petitive for high-dimensional classification. More recently, Fan & Fan (2008) proposed features

annealed independence rules, in which feature selection is done by hard-thresholding marginal

t-statistics for testing whether μ

1 j

= μ

2 j

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 31

Since the goal of sparse discriminant analysis is to find those features that contribute most to

classification, the target of an ideal feature selection should be the discriminative set that contains

all the discriminative features that contribute to the Bayes rule, because we would use the Bayes

rule for classification if it was available. Feature selection is needed when the cardinality of the

discriminative set is much smaller than the total number of features. The performance of feature

selection by a sparse method is measured by its ability to discover the discriminative set. There is

little theoretical work for justifying the nearest shrunken centroids classifier and its variants. To

our knowledge, only Fan & Fan (2008) provided detailed theoretical analysis of features annealed

independence rules, under the fundamental assumption that  is a diagonal matrix. However,

such an assumption is too restrictive to hold in applications, because strong correlations can

exist in high-dimensional data, and ignoring them may lead to misleading feature selection. We

argue that independence rules aim to discover the so-called signal set, whose definition is given

explicitly in

§ 2. We further provide a necessary and sufficient condition under which this set is

identical to the discriminative set. This condition can easily be violated and hence independence

rules could be problematic.

In this work, we propose a new procedure for sparse discriminant analysis in high dimen-

sions. Our proposal is motivated by the fact that classical linear discriminant analysis can be

reconstructed exactly via least squares (Hastie et al., 2008). We suggest using penalized sparse

least squares to derive sparse discriminant methods. Our proposal is computationally effi-

cient in high dimensions owing to efficient algorithms for computing penalized least squares.

We further provide theoretical justifications for our proposal. If the Bayes rule has a sparse

representation, our theoretical results show that the proposed sparse method can simultane-

ously identify the discriminative set and estimate the Bayes classification direction consistently.

The theory is valid even when the dimension can grow faster than any polynomial order of

the sample size and does not impose strong assumptions on the correlation structure among

predictors.

2. S

IGNAL SET AND THE DISCRIMINATIVE SET

Consider the problem of tumour classification with gene expression arrays. It is an intuitively

sound claim that differentially expressed genes should be responsible for the tumour classifica-

tion and equally expressed genes can safely be discarded. However, we show in this section that

a differentially expressed gene can have no role in classification and an equally expressed gene

can significantly influence classification.

By definition, the discriminative set is equal to A ={j : {

−1

(μ

− μ

)}

|= 0}, since the

Bayes classification direction is 

−1

(μ

− μ

). Variables in A are called discriminative vari-

ables. Define the signal set

A ={j : μ

1 j

|= μ

2 j

}; variables in

A are called signals. Ideally,

A is

the variable selection outcome of an independence rule. Practically, independence rules pick the

strongest signals indicated by the data. When  is diagonal, A =

A. For a general covariance

matrix, however, the discriminative and the signal sets can be very different, as shown in the

following proposition.

ROPOSITION 1. Let

 =





A, A



A, A





,=







at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

32 QING MAI,HUI ZOU &MING YUAN

1. If and only if 



−1

(μ

− μ

) = 0, we have A ⊆

2. If and only if μ

2,A

= μ

1,A

or 



−1

A, A

(μ

2,A

− μ

1,A

) = 0, we have

A ⊆ A.

Based on Proposition 1 we can construct examples that a non-signal can be discriminative and

a nondiscriminative feature can be a signal. Consider a linear discriminant analysis model with

p = 25, μ

= 0

, 

= 1and

= 0·5(i, j = 1,...,25; i |= j). If μ

= (1

, 0

)

,then

A =

{1, 2, 3, 4, 5} and A ={j : j = 1,...,25},since

−1

(μ

− μ

) = (1·62 × 1

, −0·38 × 1

)

Similarly, if we let μ

= (3 × 1

, 2·5 × 1

)

, then all variables are signals but A ={1, 2, 3, 4, 5},

because 

−1

(μ

− μ

) = (1

, 0

)

The above arguments warn us that independence rules could select a wrong set of features. Dif-

ferent sparse discriminant analysis methods have been proposed based on Fisher’s view of linear

discriminant analysis: the discriminant direction is obtained by maximizing β

Bβ/β

β,where

B = ( ˆμ

−ˆμ

)

( ˆμ

−ˆμ

). Wu et al. (2009) proposed the 

-constrained Fisher discriminant:

min

β, subject to (β

Bβ)

1/2

= 1, β

 τ. (3)

A referee pointed out that a similar estimator was proposed by Trendafilov & Jolliffe (2007).

When revising this paper, it came to our attention that Witten & Tibshirani (2011) proposed

another 

-penalized linear discriminant:

max

⎧

⎨

⎩

Bβ − λ



j=1

⎫

⎬

⎭

, subject to β

β

 1. (4)

Little is known about the theoretical proper ties of the estimators defined in (3)and(4), but we

include them in our numerical experiments.

3. M

ETHOD AND THEORY

3·1. Sparse discriminant analysis via penalized least squares

Our approach is motivated by the intimate connection between linear discriminant analysis

and least squares in the classical p < n setting; see Chapter 4 of Hastie et al. (2008). We code

the class labels as y

=−n/n

and y

= n/n

,wheren = n

+ n

.Let

(

ols

) = arg min

β,β



i=1

− β

− x

β)

. (5)

Then

ols

= c



−1

( ˆμ

−ˆμ

) for some positive constant c. In other words, the least squares for-

mulation in (5) exactly derives the usual linear discriminant analysis direction.

This connection is lost in high-dimensional problems because the sample covariance estimate

is not invertible and the linear discriminant analysis direction is not well defined. However, we

may consider a penalized least squares formulation to produce a classification direction. Let

(·) be a generic sparsity-inducing penalty, such as the lasso penalty (Tibshirani, 1996)where

(t) = λt (t  0). We first compute the solution to a penalized least squares problem,

(

) = arg min

β,β

⎧

⎨

⎩

−1



i=1

− β

− x

β)



j=1

(|β

⎫

⎬

⎭

. (6)

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 33

Then our classification rule is to assign x to class 2 if

> 0. (7)

Note that

in (7) differs from

in (6). In the p  n case, consider the ordinary least

squares estimator and the usual linear discriminant analysis. Let us write

ols

= c

LDA

,where

LDA



−1

( ˆμ

−ˆμ

). We should use

= c

LDA

in (7), where

LDA

= log(n

) −{( ˆμ

ˆμ

)/2}

LDA

, such that the ordinar y least squares classifier and t he linear discriminant analysis

rule yield an identical classification. If we use

ols

in (7), then these two classifiers are not the

same in general. Finding the right intercept is critical for classification but receives little attention

in the literature. Hastie et al. (2008) mentioned that one could choose the intercept

empirically

by minimizing the training error. For tunately, there is a closed-form formula for computing the

optimal intercept.

ROPOSITION 2. Suppose that a linear classifier assigns x to class 2 if x

β +

> 0. Given

β,if(μ

− μ

)

β>0, then the optimal intercept is

opt

=−(μ

+ μ

)

β/2 +



β{(μ

− μ

)

β}

−1

log(π

/π

), (8)

which can be estimated by

opt

=−( ˆμ

+ˆμ

)

β/2 +



β{( ˆμ

−ˆμ

)

β}

−1

log(n

). (9)

By Proposition 2, we calculate

and then the classifier given in (7) assigns x to class 2 if

{x − ( ˆμ

+ˆμ

)/2}

+ (

)



{( ˆμ

−ˆμ

)

}

−1

log(n

)>0.

Remark 1. The condition (μ

− μ

)

β>0 in Proposition 2 is very mild. If the linear clas-

sifier actually yields (μ

− μ

)

β<0, then we can always use

new

=−

β, which obeys (μ

−

)

new

> 0.

Remark 2. If π

= π

= 0·5, then we can take

opt

=−( ˆμ

+ˆμ

)

β/2. In general, we need

to include the second term in the right-hand side of (9). If π

|= π

and without the sparsity

condition on

β, the second term in (9) would not work well when p  n. Fortunately, when

β is

sparse, we have



β =



i, j:

|= 0,

|= 0



β =



i, j:

|= 0,

|= 0



Thus, even when p  n, as long as 

β

 n,



β is a good estimator for



β.Usinga

regularized estimate of  could provide some further improvement. For example, for banded

covariance matrices, the banding estimator (Bickel & Levina, 2008) and the tapering estimator

(Caietal., 2010) are better estimators for  than the sample covariance. However, in this work,

our primary focus is

and we do not want to entangle estimating large covariance matrices with

feature selection.

In principle, (6) can work with any sparsity-inducing penalty function. We choose the lasso in

this work because it is the most popular penalty in the literature. Other popular penalty functions

include the smoothly clipped absolute deviation (Fan & Li, 2001), the elastic net (Zou & Hastie,

2005), the adaptive lasso (Zou, 2006), the fused lasso (Tibshirani et al., 2005), the grouped

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

34 QING MAI,HUI ZOU &MING YUAN

lasso (Yu an & L i n, 2006) and the minimum concavity penalty (Zhang, 2010). For convenience,

we call the resulting classifier lassoed discriminant analysis from now on. We can use either

the least angle regression algorithm (Efron et al., 2004) or the coordinate descent algorithm

(Friedman et al., 2010) to c ompute the lasso-penalized least squares estimator.

3·2. Theory

We first introduce some necessary notation. For a general m × n matrix M,defineM

∞

max

i=1,...,m



j=1

|. For any vector b,letb

∞

= max

| and |b|

min

= min

|.We

let β

Bayes

= 

−1

(μ

− μ

) represent the Bayes classifier coefficient vector. So, A ={j :

Bayes

|= 0} and let s be the cardinality of A.WeuseC = cov(x) to represent the marginal covari-

ance matrix of the predictors and partition C as

C =





We define three quantities that frequently appear in our analysis:

κ =C

)

−1



∞

,ϕ=(C

)

−1



∞

,=μ

− μ



∞

Suppose that X is the predictor matrix and let

X be the centred predictor matrix, whose column-

wise mean is zero. Obviously, C

(n)

X/n is an empirical version of C. Likewise, we can

write

/n = C

(n)

and

/n = C

(n)

Denote β

∗

= (C

)

−1

(μ

− μ

). Now we can define

Bayes

by letting

Bayes

= β

∗

and

Bayes

= 0. The following proposition shows the equivalence between

Bayes

and β

Bayes

ROPOSITION 3. The quantities

Bayes

and β

Bayes

are equivalent in the sense that

Bayes

cβ

Bayes

for some positive constant c and the Bayes classifier can be written as assigning x to

class 2 if

{x − (μ

+ μ

)/2}

Bayes

+ (

Bayes

)



Bayes

{(μ

− μ

)

Bayes

}

−1

log (π

/π

)>0.

Proposition 3 tells us that it suffices to show that the proposed sparse discriminant analysis

can consistently recover the support of

Bayes

and estimate β

∗

Throughout our analysis we assume that the variance of each variable is bounded by a finite

constant. In practice, one often standardizes the data beforehand. Then the finite constant can be

taken as unity. In this subsection, 

and c

, c

are positive constants.

The lassoed discriminant analysis direction is computed by

(

lasso

) = arg min

β,β

⎧

⎨

⎩

−1



i=1

− β

− x

β)

+ λ



j=1

|β

⎫

⎬

⎭

. (10)

If lassoed discriminant analysis finds the support of the Bayes rule, then we should have

lasso

= 0

and

lasso

should be identical to

,where

= arg min

β,β

⎧

⎨

⎩

−1



i=1

− β

−



j∈A

)



j∈A

λ|β

⎫

⎬

⎭

. (11)

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 35

We introduce

only for mathematical analysis. It is not a real estimator, because its definition

depends on knowing A. To ensure that the lassoed discriminant analysis has the variable selection

consistency property, we impose a condition on the covariance matrix of the predictors:

κ =C

)

−1



∞

< 1. (12)

The condition in (12) is an analogue of the irrepresentability condition for the lasso regression

estimator (Meinshausen & B

uhlmann, 2006; Zou, 2006; Zhao & Yu, 2006; Wainwright, 2009).

This condition can be relaxed if using a concave penalty to derive

; see § 5.

HEOREM 1. Pick any λ such that λ<min{|β

∗

min

/(2ϕ), }. Then:

1. assuming the condition in (12), with probability at least 1 − δ

lasso

and

lasso

= 0,

where

= 2 ps exp(−c

−2



) + 2p exp{−c

nλ

(1 − κ − 2ϕ)

(1 + κ)

−2

/16} (13)

and  is any positive constant less than min[

,λ(1 − κ)(4ϕ)

−1

{λ/2 + (1 + κ)}

−1

];

2. with probability at least 1 − δ

, none of the elements of

is zero, where

= 2s

exp(−c

−2



) + 2s exp(−c

n

) (14)

and  is any positive constant less than min{

,ζ(3 + ζ)

−1

/ϕ, ζ (6 + 2ζ)

−1

};

3. for any positive  satisfying <min{

,λ(2ϕ)

−1

,λ}, we have

pr(

− β

∗



∞

 4ϕλ)  1 − 2s

exp(−c

−2



) − 2s exp(−c

n

). (15)

The non-asymptotic results in Theorem 1 can be easily translated into asymptotic arguments

when allowing the triple (n, s, p) to tend to infinity at suitable rates. To highlight the main points,

we assume that ,κ, ϕ are constants. In addition, we need the following regularity conditions:

Condition 1. n, p →∞and log(ps )s

/n → 0;

Condition 2. |β

∗

min

{log( ps )s

/n}

1/2

Condition 1 restricts p. Clearly, we cannot expect the proposed method to work for an arbi-

trarily large p. However, t he restriction is rather loose. Consider the case where s = O(n

1/2−γ

)

for some γ<1/2. Condition 1 holds as long as p  e

2γ

. Therefore, p is allowed to grow faster

than any polynomial order of n, referred to as nonpolynomial-dimension asymptotics. Condition

2 requires the nonzero elements of the Bayes rule t o be large enough such that we could consis-

tently separate them from zero using the observed data. The lower bound actually converges to

zero under Condition 1, so Condition 2 is not strong.

HEOREM 2. Let

A ={j :

lasso

|= 0}. Under Conditions 1 and 2, if we choose λ = λ

such

that λ

|β

∗

min

and λ

{log( ps )s

/n}

1/2

, and further assume κ<1,thenpr(

A = A) → 1

and pr(

lasso

− β

∗



∞

 4ϕλ

) → 1.

Remark 3. Although we use penalized least squares to estimate the classification direc-

tion, there is a fundamental difference between Theorem 1 and theoretical results derived for

lasso-penalized least squares regression (Meinshausen & B

uhlmann, 2006; Zhao & Yu, 2006;

Wainwright, 2009). The previous work assumes that the data obey a linear regression model

with additive noise, which is not true for y and x in (10).

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

36 QING MAI,HUI ZOU &MING YUAN

Table 1. Simulation models. The choices of n, p, and β

Bayes

are listed

Model np β

Bayes

1 100 400 

= 0·5

|i− j|

.0·556 (3, 1·5, 0, 0, 2, 0

p−5

)

2 100 400 

= 0·5

|i− j|

.0·582 (3, 2·5, −2·8, 0

p−3

)

3 400 800 

= 1,

= 0·5, i |= j.0·395 (3, 1·7, −2·2, −2·1, 2·55, 0

p−5

)

4 300 800  = I

⊗

,I

is an identity matrix; 0·916 (1·2, −1·4, 1·15, −1·64, 1·5, −1, 2, 0

p−7

)



= 1,



= 0·6, i |= j.

5 400 800 

= 1,

= 0·5, i |= j.0·551 (3, 1·7, −2·2,−2·1, 2·55,(p − 5)

−1

p−5

)

6 400 800 

= 1,

= 0·5, i |= j.0·362 (3, 1·7, −2·2, −2·1, 2·55,(p − 5)

−1

p−5

)

Remark 4. Our method is also fundamentally different from those based on high-dimensional

covariance estimation. In the current literature on covariance or inverse-covariance matrix

estimation, a commonly used assumption is that the target matrix has some sparsity structure

(Bickel & Levina, 2008; Cai et al., 2010). Such assumptions are not needed in our method.

4. N

UMERICAL RESULTS

4·1. Simulation

We use simulated data to demonstrate the good performance of our proposal. For com-

parison, we included the nearest shrunken centroids classifier (Tibshirani et al., 2002), the

features annealed independence rule (Fan & Fan, 2008), the 

-penalized linear discriminant

(Witten & Tibshirani, 2011)andthe

-constrained Fisher discriminant (Wu et al., 2009). The

nearest shrunken centroids classifier is implemented in the R package pamr; see http://

cran.r-project.org/web/packages/pamr/index.html.The

-penalized linear discriminant is

implemented in the R package penalizedLDA; see http://cran.r-project.org/web/packages/

penalizedLDA/index.html. We used the code of Wu et al. (2009) to implement the 

-constrained

Fisher discriminant. We also considered the t-test classifier: we first performed Bonferroni-

adjusted t-tests with size 0·05 and then did linear discriminant analysis only using these features

that passed the t-test.

We randomly generated n class labels such that π

= π

= 0·5. Conditioning on the class labels

g (g = 1, 2), we generated the p-dimensional predictor x from a multivariate normal distribution

with mean vector μ

and covariance . Without loss of generality, we set μ

= 0andμ

β

Bayes

. We considered six different simulation models. The choices of n, p,  and β

Bayes

are

showninTable1. Models 1–4 are sparse discriminant models with different covariance and mean

structure; Models 5 and 6 are practically sparse in the sense that their Bayes rules depend on

all variables in theory but can be well approximated by sparse discriminant functions. Table 2

summarizes the simulation results based on 2000 replications.

Our method, lassoed discriminant analysis, is the only one that shows good performance in

all six simulation settings. It closely mimics the Bayes rule, regardless of the Bayes error and

covariance structure. Tibshirani’s method and Fan’s method have very comparable performance,

but they are much worse than ours except for Model 1. In Model 1, the first five elements of

− μ

are much larger than the rest, which implies that independence rules can include all three

discriminative variables. On the other hand, although Model 2 uses the same  as in Model 1,

it has a very different mean structure: the first two elements of μ

− μ

are huge while the rest

are much smaller. This means that independence rules have difficulty in selecting variable 3,

resulting in inferior classification. Wu’s method has good classification accuracy overall, but

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 37

Table 2. Simulation results. The methods are named after the first author of the original papers.

The reported numbers are medians with their standard errors, obtained by bootstrap, in parenthe-

ses. TRUE selection and FALSE selection denote the numbers of selected variables f rom the dis-

criminative set and its complement, respectively. Fitted model size is the total number of selected

variables

Bayes Our method Wu Witten Tibshirani Fan t-test classifier

Model 1

Error (%) 10 10·89 13·71 10·81 10·94 11·47 10·46

(0·03)(0·01)(0·01)(0·02)(0·05)(0·01)

TRUE selection 3 3 3 1 3 3 3

(0) (0) (0) (0) (0) (0)

FALSE selection 0 2 0 26 6 7 1

(0·16)(0·49)(0·11)(0·61)(0·66) (0)

Model 2

Error (%) 10 12·84 14·514·25 15·12 15·67 14·12

(0·05)(0·01)(0·02)(0·05)(0·07

)(0·01)

UE selection 3 3 1 2 2 2 2

(0) (0·14) (0) (0·34) (0) (0)

FALSE selection 0 6 0 4 9 8 0

(0·27) (0) (0·61)(0·73)(0·29) (0)

Model 3

Error (%) 20 21·93 22·37 33·69 27·48 25·69 38·94

(0·03)(0·05)(0·01)(0·07)(0·02)(0·03)

TRUE selection 5 5 5 3 3 2 3

(0) (0) (0) (0) (0) (0)

FALSE selection 0 14 2 419·5 2 0 790

(0·59) (0) (10·19)(0·31) (0) (

0·42)

Model

Error (%) 10 12·50 13·99 23·90 19·25 18·56 24·59

(0·02)(0·03)(0·01)(0·04)(0·00)(0·06)

TRUE selection 7 7 6 4 4 3 6

(0) (0) (0) (0) (0) (0)

FALSE selection 0 18 2 35 1 0 153

(0·70) (0) (4·43)(0·48) (0) (0)

Model 5

Error (%) 10 11·11 12·07 21·99 14·72 14·27 25·79

(0·02)(0·07)(0·01)(0·03)(0·01)(0·04)

Fitted model size 800 21 7 737 3 3 800

(0·65)(0·16)(2·29)(0·46) (0)

(0)

Model 6

Error (%) 20 22·22 23·34 30·43 26·13 24·14 36·80

(0·03)(0·05)(0·01)(0·07) (0) (0·04)

Fitted model size 800 20 5 592·5 8 3 798

(0·53)(0·49)(7·46)(0·51) (0) (0)

it can often miss some important features. Witten and Tibshirani’s method has rather poor per-

formance, which is somewhat surprising because the basic idea is similar to Wu’s. Witten and

Tibshirani’s formulation in (4) is nonconvex while Wu’s formulation in (3) is convex, which may

help explain their different performances. The t-test classifier is best for Model 1, second best

for Model 2 and worst for Models 3–6.

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

38 QING MAI,HUI ZOU &MING YUAN

Table 3. Comparing lassoed discriminant analysis with other approaches on the colon and the

prostate datasets

Our method Witten Wu Tibshirani Fan t-test classifier

Colon Error (%) 86·4 (1·54) 86·4 (0·49) 84·1 (2·17) 86·4 (1·20) 86·4 (0·61) 81·0 (1·67)

Fitted model size 5 (0·63) 10 (1·39) 1 (0) 89 (29·95) 11 (1·19) 16 (1·15)

Prostate Error (%) 94·1 (0·55) 91·2 (0·24) 91·2 (

0·70) 91·2 (0·96) 76·5 (0·54) 76·5 (1·62)

itted model size 10 (0·77) 18 (4·45) 1 (0) 10 (0·84) 4 (0·40) 58 (2·65)

Table 4. Adjusted classification accuracies of four methods by forcing them to

select a similar number of genes as does our method

Witten Wu Tibshirani Fan

Colon 86·4 (0·51) 86·4 (1·06) 63·6 (0·70) 77·3 (2·16)

Prostate 94·1 (1·25) 91·2 (1·24) 91·2 (1·39) 73·5 (1·11)

4·2. Real data

We further compare the methods on two benchmark datasets: the colon and prostate cancer

datasets. The basic task here is to predict whether an observation is tumour or is normal tissue.

We randomly split the datasets into the training and test sets with ratio 2:1. Model fitting was

done on the training set and classification accuracy was evaluated on the test set. This procedure

was repeated 100 times. Shown in Table 3 are the classification accuracies and the numbers of

genes selected by each competitor.

The colon and prostate datasets have been previously used to test classification and fea-

ture selection methods. See Alon et al. (1999), Singh et al. (2002)andDettling (2004). Dettling

(2004) reported that BagBoost was the most accurate classifier for the prostate data, with 92·5%

classification accuracy, and the nearest shrunken centroids classifier was the most accurate clas-

sifier for the colon data. Table 2 shows that our method is as accurate as the nearest shrunken

centroids classifier on the colon data and significantly outperforms BagBoost on the prostate

data. Since BagBoost does not do gene selection, we do not include it in Table 2. Witten and

Tibshirani’s method works quite well on these two real datasets. As suggested by a referee, we

adjusted the tuning parameters of Witten and Tibshirani’s, Wu’s, Tibshirani’s and Fan’s meth-

ods such that they selected a similar number of genes to that by our method on each dataset.

Their adjusted classification errors are reported in Table 4. This adjustment helps Witten and

Tibshirani’s method but degrades Tibshirani’s and Fan’s methods.

5. D

ISCUSSION

Sparse discriminant analysis based on independence rules is computationally attractive for

high-dimensional classification. However, independence rules may lead to misleading feature

selection and hence poor classification performance, due to the difference between discrimi-

native and signal variables. When doing feature selection in classification, one should aim to

recover the discriminative set, not the signal set, which is the goal of large-scale hypothesis test-

ing. Discovering the signal set is the fundamental question of research in many scientific studies

(Efron, 2010), but identifying features for classification could be very different from identifying

interesting signals, and hence the statistical tools for data analysis should be carefully chosen.

If one feels that the condition (12) is somewhat strong for establishing the nonpolynomial-

dimension theory, one may use a concave penalty other than the lasso penalty. We have tried

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 39

using the smoothly clipped absolute deviation penalty ( Fan & Li, 2001) and have shown that

the resulting sparse discriminant algorithm enjoys a strong oracle property without requiring the

condition (12). These results are given in a technical report which is available upon request. We

only focus on lassoed discriminant analysis in the current paper because our primary goal is

to demonstrate the effectiveness of the penalized least squares formulation of sparse discrim-

inant analysis. We do not want to overly emphasize the penalty function. For a given dataset,

one penalty function may be more appropriate than others. For example, in some situations, the

predictors may have a natural ordering, so ordered variable selection is preferred, and the nested

lasso (Levina et al., 2008) is a better choice than the lasso or smoothly clipped absolute devia-

tion penalty. It is straightforward to implement the nested lasso in our proposal, but a detailed

treatment is beyond the scope of this paper.

CKNOWLEDGEMENT

The authors thank the editor, associate editor and referees for their helpful comments and sug-

gestions. Mai is supported by an Alumni Fellowship from the School of Statistics at the University

of Minnesota. Zou and Yuan are supported by the National Science Foundation, U.S.A.

PPENDIX

Proof of Proposition 1. 1. Let  = 

−1

and β

Bayes

= (μ

− μ

). Note that A ⊆

A is equivalent to

Bayes

= 0, and β

Bayes

= 

(μ

− μ

), where 

=−(

− 



−1



)

−1



−1

Therefore, part 1 is proven.

2. By definition,

A ⊆ A is equivalent to μ

2,A

= μ

1,A

.Nowusingμ

− μ

= β

Bayes

we have μ

2,A

−

1,A

= 

A,A

Bayes

and μ

2,A

− μ

1,A

= 

Bayes

. Hence, μ

2,A

− μ

1,A

= 



−1

A,A

(μ

2,A

− μ

1,A

Thus part 2 is proven. 

Proof of Proposition 2. We recode the response variable as y

∗

=±1. Note that

opt

arg min

E{y

∗

new

|= sign(x

new

β +

) | training data}. Since y

∗

new

, x

new

are independent from the train-

ing data, (Y

∗

new

, z

new

= x

new

β) obeys a one-dimensional linear discriminant analysis model, that is,

new

| y

∗

new

= 1 ∼ N (



β), pr(y

∗

new

= 1) = π

and z

new

| y

∗

new

=−1 ∼ N (



β), pr(y

∗

new

−1) = π

. Then a straightforward calculation gives (8). 

Proof of Proposition 3. By definition we can write C

= 

+ π

(μ

− μ

)(μ

− μ

)

and

Bayes

= μ

− μ

. Let c = n{n − 2 + π

(μ

− μ

)



−1

(μ

− μ

)}

−1

> 0, then

Bayes

cβ

Bayes

. 

We now prove Theorems 1 and 2. With y = n/n

, −n/n

and the centred predictor matrix

X, we can

write

lasso

= arg min

⎧

⎨

⎩

−1

(

X)β − 2( ˆμ

−ˆμ

)

β + λ



j=1

|β

⎫

⎬

⎭

The following two lemmas are repeatedly used in the proof.

EMMA A1. There exist constants 

and c

, c

such that for any   

we have

pr(|C

(n)

− C

|  )  2exp(−n

)(i, j = 1,..., p); (A1)

pr{|( ˆμ

2 j

−ˆμ

1 j

) − (μ

2 j

− μ

1 j

)|  }  2exp(−n

)(j = 1,..., p); (A2)

pr(C

(n)

− C



∞

 )  2s

exp(−ns

−2



); (A3)

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

40 QING MAI,HUI ZOU &MING YUAN

pr(C

(n)

− C



∞

 )  2( p − s)s exp(−ns

−2



); (A4)

pr{( ˆμ

−ˆμ

) − (μ

− μ

)

∞

 }  2 p exp(−n

); (A5)

pr{( ˆμ

−ˆμ

) − (μ

− μ

)

∞

 }  2s exp(−n

). (A6)

EMMA A2. There exist constants 

such that for any   min(

, 1/ϕ), we have

pr{C

(n)

)

−1

− C

)

−1



∞

 ϕ(κ + 1)(1 − ϕ)

−1

}  2 ps exp(−ns

−2



Proof of Lemma A1. We only prove (A1) and (A2). Inequalities (A3)–(A6) can be obtained from

(A1)–(A2) by union bounds. First, pr(|ˆμ

1 j

− μ

1 j

|   | Y )  2exp{−n



/(2σ

)}.Also,n

∼ Ber(n,π

)

and pr(|n

− π

n|  n)  2exp(−nc





). Therefore, pr(|ˆμ

− μ

|  )  2exp{−nπ



/(4σ

)}+

2exp(−nc



/4)  2exp(−nc

(1)



). The same inequality also holds for class 2. Thus (A2) holds.

To prove (A1), note that C

(n)

= n

−1



k=1

−¯x

¯x

. Since ¯x

=ˆπ

ˆμ

+ˆπ

ˆμ

(v = i, j),bypre-

vious arguments, we know that there exists c



> 0 such that

pr{| ¯x

¯x

− E(x

)E(x

)|  }  2exp(−nc





We further have that n

−1



k=1

− E(x

) =



l=1

/n{n

−1



− E(x

| g = l)}+



l=1

E(x

| g = l)(n

/n − π

) and E(x

| g = l) = 

+ μ

for l = 1, 2. Note that

−1



= n

−1



− μ

)(x

− μ

) + μ

(μ

−ˆμ

) + μ

(μ

−ˆμ

) + μ

Bickel & Levina (2008) showed that, for <

pr{|n

−1



− μ

)(x

− μ

) − 

| >| Y }  2exp(−c

n

). (A7)

Combining the concentration results for ˆμ

lν

, n

and (A7), we have (A1). 

Proof of Lemma A2. Let η

=C

− C

(n)



∞

, η

=C

− C

(n)



∞

and η

=(C

(n)

)

−1

−

)

−1



∞

.Firstwehave

C

(n)

)

−1

− C

)

−1



∞

 C

(n)

− C



∞

×(C

(n)

)

−1

− (C

)

−1



∞

+C

(n)

− C



∞

×(C

)

−1



∞

+C

)

−1



∞

×C

− C

(n)



∞

×(C

)

−1



∞

+C

)

−1



∞

×C

− C

(n)



∞

×(C

(n)

)

−1

− (C

)

−1



∞

 (κη

+ η

)(ϕ + η

Moreover, η

 (C

(n)

)

−1



∞

×(C

(n)

− C

)

∞

×(C

)

−1



∞

= (ϕ + η

)ϕη

. So, as long as ϕη

< 1,

we have η

 ϕ

(1 − ϕη

)

−1

and hence

C

(n)

)

−1

− C

)

−1



∞

 (κη

+ η

)ϕ(1 − ϕη

)

−1

Then we consider the event max(η

,η

)   and use Lemma 1 to obtain Lemma 2. 

Proof of Theorem 1. We first prove conclusion 1. By (11) we can write

= (n

−1

)

−1

{( ˆμ

−

ˆμ

) − λt

/2}, where t

represents t he subgradient such that t

= sign(

) if

|= 0 and −1 < t

< 1if

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

Direct sparse discriminant analysis 41

= 0. Write

= (C

)

−1

(μ

− μ

) + (C

(n)

)

−1

{( ˆμ

−ˆμ

) − (μ

− μ

)}

−{(C

(n)

)

−1

− (C

)

−1

}(μ

− μ

) − λ(C

(n)

)

−1

/2. (A8)

In order to show that

lasso

= (

, 0), it suffices to verify that

n

−1

− ( ˆμ

−ˆμ

)

∞

 λ/2. (A9)

The left-hand side of (A9) is equal to

C

(n)

)

−1

( ˆμ

−ˆμ

) − C

(n)

)

−1

λt

/2 − ( ˆμ

−ˆμ

)

∞

. (A10)

Using C

−1

(μ

− μ

) = (μ

− μ

),(A10) is bounded from above by

=C

(n)

)

−1

− C

−1



∞

 +( ˆμ

−ˆμ

) − (μ

− μ

)

∞

+{C

(n)

)

−1

− C

−1



∞

+ κ}( ˆμ

−ˆμ

) − (μ

− μ

)

∞

+ (C

(n)

)

−1

− C

−1



∞

+ κ)λ/2.

If C

(n)

)

−1

− C

−1



∞

 (κ + 1)ϕ(1 − ϕ)

−1

, and

( ˆμ

−ˆμ

) − (μ

− μ

)

∞

 4

−1

λ(1 − κ − 2ϕ)/(1 + κ),

then U

 λ/2. Therefore, by Lemmas A1 and A2,wehave

1 − δ

≡ pr{n

−1

− (μ

− μ

)

∞

 λ/2}

 1 − 2 ps exp(−ns

−2



) − 2p exp[−nc

−1

λ(1 − κ − 2ϕ)/(1 + κ)}

Thus (13) is proven.

We now prove conclusion 2. Let ζ =|β

∗

min

/(ϕ). Write η

=C

− C

(n)



∞

and η

=(C

(n)

)

−1

−

)

−1



∞

. Then for any j ∈ A,

|  ζϕ − (η

+ ϕ){λ/2 +( ˆμ

−ˆμ

) − (μ

− μ

)

∞

}−η

.

When η

ϕ<1 we have shown that η

<ϕ

(1 − η

ϕ)

−1

, thus

|  ζϕ − (1 − η

ϕ)

−1

{λϕ/2 +( ˆμ

−ˆμ

) − (μ

− μ

)

∞

ϕ + ϕ

}≡ L

Because β

∗



∞

 ϕ, ζ  1. Hence λ  |β

∗

min

/(2ϕ)  2|β

∗

min

/{(3 + ζ)ϕ}. Under the events η

 

and ( ˆμ

−ˆμ

) − (μ

− μ

)

∞

  we have L

> 0. Therefore,

pr(L

> 0)  1 − 2s

exp(−c

−2



) − 2s exp(−nc



Thus (14) is proven.

We now prove conclusion 3. By (A8) and η

ϕ<1, we have



− β

∗



∞

 (1 − η

ϕ)

−1

{λϕ/2 +( ˆμ

−ˆμ

) − (μ

− μ

)

∞

ϕ + ϕ

}.

Under the events η

<and ( ˆμ

−ˆμ

) − (μ

− μ

)

∞

  we have 

− β

∗



∞

 4ϕλ. Thus,

pr(

− β

∗



∞

 4ϕλ)  1 − 2s

exp(−c

−2



) − 2s exp(−nc



Thus (15) is proven. 

Proof of Theorem 2. Theorem 2 directly follows from Theorem 1, so its proof is omitted. 

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from

42 QING MAI,HUI ZOU &MING YUAN

REFERENCES

ALON,U.,BARKAI,N.,NOTTERMAN,D.A.,GISH,K.,MACK,S.&LEVINE,A.J.(1999). Broad patterns of gene

expression revealed by clustering analysis of tumor and nor mal colon tissues probed by oligonucleotide arrays.

Proc. Nat. Acad. Sci. 96, 6745–50.

ICKEL,P.J.&LEVINA,E.(2004). Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some

alternatives when there are many more variables than observations. Bernoulli 10, 989–1010.

ICKEL,P.J.&LEVINA,E.(2008). Regularized estimation of large covariance matrices. Ann. Statist. 36, 199–227.

AI,T.,ZHANG,C.H.&ZHOU,H.(2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist.

38, 2118–44.

ETTLING,M.(2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20, 3583–93.

FRON,B.(2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing and Prediction.

Cambridge: Cambridge University Press.

FRON,B.,HASTIE,T.J.,JOHNSTONE,I.M.&TIBSHIRANI,R.J.(2004). Least angle r egression. Ann. Statist. 32,

407–99.

AN,J.&FAN,Y.(2008). High dimensional classification using features annealed independence rules. Ann. Statist.

36, 2605–37.

AN,J.&LI,R.(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist.

Assoc. 96, 1348–60.

RIEDMAN,J.H.,HASTIE,T.J.&TIBSHIRANI,R.J.(2010). Regularization paths for generalized linear models via

coordinate descent. J. Statist. Software 33, 1–22.

ALL,P.,MARRON,J.S.&NEEMAN,A.(2005). Geometric representation of high dimension, low sample size data.

J. R. Statist. Soc. B 67, 427–44.

AND,D.J.(2006). Classifier technology and the illusion of progress. Statist. Sci. 21, 1–14.

ASTIE,T.J.,TIBSHIRANI,R.J.&FRIEDMAN,J.H.(2008). Elements of Statistical Learning: Data Mining, Inference,

and Prediction, 2nd edn. Berlin: Springer.

EVINA,E.,ROTHMAN,A.J.&ZHU,J.(2008). Sparse estimation of large covariance matrices via a nested lasso

penalty. Ann. Appl. Statist. 2, 245–63.

EINSHAUSEN,N.&B

UHLMANN,P.(2006). High dimensional graphs and variable selection with the lasso. Ann.

Statist. 34, 1436–62.

ICHIE,D.,SPIEGELHALTER,D.J.&TAY LO R,C.C.(1994). Machine Learning, Neural and Statistical Classification,

Chichester: Ellis Horwood.

OTHMAN,A.J.,BICKEL,P.,LEVINA,E.&ZHU,J.(2008). Sparse permutation invariant covariance estimation. Elec-

tron. J. Statist. 2, 494–515.

INGH,D.,FEBBO,P.G.,ROSS,K.,JACKSON,D.G.,MANOLA,J.,LADD,C.,TAMAYO,P.,RENSHAW,A.A.,D’AMICO,

A. V., R

ICHIE,J.P.,ET AL. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1,

203–9.

IBSHIRANI,R.J.(1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88.

IBSHIRANI,R.J.,HASTIE,T.J.,NARASIMHAN,B.&CHU,G.(2002). Diagnosis of multiple cancer types by shrunken

centroids of gene expression. Proc. Nat. Acad. Sci. 99, 6567–72.

IBSHIRANI,R.J.,SAUNDERS,M.,ROSSET,S.,ZHU,J.&KEITH,K.(2005). Sparsity and smoothness via the fused

lasso. J. R. Statist. Soc. B 67, 91–108.

RENDAFILOV,N.T.&JOLLIFFE,I.T.(2007). DALASS: Variable selection in discriminant analysis via the lasso.

Comp. Statist. Data Anal. 51, 3718–36.

AINWRIGHT,M.J.(2009). Sharp thresholds for noisy and high-dimensional recovery of sparsity using 

-constrained

quadratic programming (lasso). IEEE Trans. Info. Theory 55, 2183–202.

ITTEN,D.M.&TIBSHIRANI,R.J.(2011). Penalized classification using Fisher’s linear discriminant. J. R. Statist.

Soc. B 73, 753–72.

U,M.C.,ZHANG,L.,WANG,Z.,CHRISTIANI,D.C.&LIN,X.(2009). Sparse linear discriminant analysis for

simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics 25, 1145–51.

UAN,M.&LIN,Y.(2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B

68, 49–67.

HANG,C.H.(2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942.

HAO,P.&YU,B.(2006). On model selection consistency of lasso. J.Mach.Learn.Res.7, 2541–63.

OU,H.(2006). The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 101, 1418–29.

OU,H.&HASTIE,T.J.(2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. B 67,

301–20.

[Received February 2011. Revised September 2011]

at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from