Biometrika (2012), 99,1,pp. 29–42 doi: 10.1093/biomet/asr066
C
2012 Biometrika Trust Advance Access publication 5 January 2012
Printed in Great Britain
A direct approach to sparse discriminant analysis
in ultra-high dimensions
BY QING MAI, HUI ZOU
School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455, U.S.A.
AND MING YUAN
M. Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology,
Atlanta, Georgia 30332, U.S.A.
S
UMMARY
Sparse discriminant methods based on independence rules, such as the nearest shrunken cen-
troids classifier (Tibshirani et al., 2002) and features annealed independence rules (Fan & Fan,
2008), have been proposed as computationally attractive tools for feature selection and classi-
fication with high-dimensional data. A fundamental drawback of these rules is that they ignore
correlations among features and thus could produce misleading feature selection and inferior
classification. We propose a new procedure for sparse discriminant analysis, motivated by the
least squares formulation of linear discriminant analysis. To demonstrate our proposal, we study
the numerical and theoretical properties of discriminant analysis constructed via lasso penalized
least squares. Our theory shows that the method proposed can consistently identify the subset of
discriminative features contributing to the Bayes rule and at the same time consistently estimate
the Bayes classification direction, even when the dimension can grow faster than any polynomial
order of the sample size. The theory allows for general dependence among features. Simulated
and real data examples show that lassoed discriminant analysis compares favourably with other
popular sparse discriminant proposals.
Some key words: Discriminant analysis; Features annealed independence rule; Lasso; Nearest shrunken centroids
classifier; Nonpolynomial-dimension asymptotics.
1. INTRODUCTION
Consider a binary classification problem where x = (x
1
,...,x
p
)
T
represents the predictor
vector and G = 1, 2 denotes the class label. Linear discriminant analysis is perhaps the oldest
classification technique that is still being used routinely in applications. The linear discrimi-
nant analysis model assumes x | G = g N
g
,),pr (G = 1) = π
1
, pr(G = 2) = π
2
. Then, the
Bayes rule, which is the theoretically optimal classifier minimizing the 0–1 loss, classifies a data
point to class 2 if and only if
{x
1
+ μ
2
)/2}
T
1
2
μ
1
) + log
2
1
)>0. (1)
Let ˆμ
1
, n
1
and ˆμ
2
, n
2
be the sample mean vector and sample size within classes 1 and 2, respec-
tively. Let
ˆ
be the pooled sample covariance estimate of . Linear discriminant analysis sets
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
30 QING MAI,HUI ZOU &MING YUAN
μ
1
μ
1
, μ
2
μ
2
, =
ˆ
, π
1
= n
1
/n
2
= n
2
/n in (1). Despite its simplicity, it has proved to
be a reasonably good classifier in many applications. For example, Michie et al. (1994)andHand
(2006) have shown that linear discriminant analysis has very competitive performance for many
real-world benchmark datasets.
With the rapid advance of technology, high-dimensional data appear more and more frequently.
In such data, the dimension, p, can be much larger than the sample size, n. It has been empiri-
cally observed that, for classification problems with high-dimension-and-low-sample-size data,
some simple linear classifiers perform as well as more sophisticated classification algorithms
such as the support vector machine and boosting. See, e.g., the comparison study by Dettling
(2004). Hall et al. (2005) provided some geometric insight into this phenomenon. In recent years,
many papers have considered ways to modify linear discriminant analysis to be suitable for high-
dimensional classification. One approach is to use more sophisticated estimates of the inverse
covariance matrix
1
to replace the naive sample estimate. Under sparsity assumptions, one
can obtain good estimators of and
1
even when p is much larger than n (Bickel & Levina,
2008; Cai et al., 2010; Rothman et al., 2008). However, a better estimate of
1
does not neces-
sarily lead to a better classifier. In an ideal scenario where we know that is an identity matrix
and π
1
= π
2
= 0·5, then we could classify x to class 2 if {x ( ˆμ
1
μ
2
)/2}
T
( ˆμ
2
−ˆμ
1
)>0.
Although this classifier does not suffer from the difficulty of estimating a large covariance matrix,
Fan & Fan (2008) showed that this classifier performs no better than random guessing when p
is sufficiently large, due to noise accumulation in estimating μ
1
and μ
2
. Therefore, effectively
exploiting sparsity is critically important for high-dimensional classification.
Tibshirani et al. (2002) proposed the nearest shrunken centroids classifier for tumour classi-
fication and gene selection using microarray data. This classifier is defined as follows. For each
variable x
j
, we compute d
jg
= (1/n
g
+ 1/n)
1/2
( ˆμ
gj
−¯x
j
)(s
j
+ s
0
)
1
(g = 1, 2),where ˆμ
gj
is
the within-class sample mean, s
2
j
is the sample estimate of
jj
and s
0
is a small positive constant
added for robustness. For simplicity, we set s
0
= 0. Define the shrunken centroids mean by
ˆμ
gj
x
j
+ (n
1
g
+ n
1
)
1/2
s
j
d
λ
jg
(g = 1, 2),
where ¯x
j
= (n
1
ˆμ
1
+ n
2
ˆμ
2
)/n is the marginal sample mean of x
j
, λ is a pre-chosen positive
constant and d
λ
jg
is computed by soft-thresholding d
jg
: d
λ
jg
= sign(d
jg
)(|d
jg
|−λ)
+
(g = 1, 2).
The nearest shrunken centroids classifier classifies x to class 2 if
p
j=1
{x
j
( ˆμ
2 j
μ
1 j
)/2}s
2
j
( ˆμ
2 j
−ˆμ
1 j
) + log(n
2
/n
1
)>0. (2)
Comparing (2)and(1), we see that this classifier modifies the usual linear discriminant analysis
in two ways. First, it uses only the diagonal of the sample covariance matrix to estimate .If
λ = 0, this classifier reduces to diagonal linear discriminant analysis, which has been shown in
Bickel & Levina (2004) t o work much better than linear discriminant analysis in high dimen-
sions. Secondly, this classifier uses the shrunken centroids means to estimate μ
1
, μ
2
for fea-
ture selection. If we use a sufficiently large λ, then the soft-thresholding operation will force
ˆμ
j1
μ
j2
x
j
for some variables, which then make no contribution to the classifier defined
in (2). Many experiments have shown that the nearest shrunken centroids classifier is very com-
petitive for high-dimensional classification. More recently, Fan & Fan (2008) proposed features
annealed independence rules, in which feature selection is done by hard-thresholding marginal
t-statistics for testing whether μ
1 j
= μ
2 j
.
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 31
Since the goal of sparse discriminant analysis is to find those features that contribute most to
classification, the target of an ideal feature selection should be the discriminative set that contains
all the discriminative features that contribute to the Bayes rule, because we would use the Bayes
rule for classification if it was available. Feature selection is needed when the cardinality of the
discriminative set is much smaller than the total number of features. The performance of feature
selection by a sparse method is measured by its ability to discover the discriminative set. There is
little theoretical work for justifying the nearest shrunken centroids classifier and its variants. To
our knowledge, only Fan & Fan (2008) provided detailed theoretical analysis of features annealed
independence rules, under the fundamental assumption that is a diagonal matrix. However,
such an assumption is too restrictive to hold in applications, because strong correlations can
exist in high-dimensional data, and ignoring them may lead to misleading feature selection. We
argue that independence rules aim to discover the so-called signal set, whose definition is given
explicitly in
§ 2. We further provide a necessary and sufficient condition under which this set is
identical to the discriminative set. This condition can easily be violated and hence independence
rules could be problematic.
In this work, we propose a new procedure for sparse discriminant analysis in high dimen-
sions. Our proposal is motivated by the fact that classical linear discriminant analysis can be
reconstructed exactly via least squares (Hastie et al., 2008). We suggest using penalized sparse
least squares to derive sparse discriminant methods. Our proposal is computationally effi-
cient in high dimensions owing to efficient algorithms for computing penalized least squares.
We further provide theoretical justifications for our proposal. If the Bayes rule has a sparse
representation, our theoretical results show that the proposed sparse method can simultane-
ously identify the discriminative set and estimate the Bayes classification direction consistently.
The theory is valid even when the dimension can grow faster than any polynomial order of
the sample size and does not impose strong assumptions on the correlation structure among
predictors.
2. S
IGNAL SET AND THE DISCRIMINATIVE SET
Consider the problem of tumour classification with gene expression arrays. It is an intuitively
sound claim that differentially expressed genes should be responsible for the tumour classifica-
tion and equally expressed genes can safely be discarded. However, we show in this section that
a differentially expressed gene can have no role in classification and an equally expressed gene
can significantly influence classification.
By definition, the discriminative set is equal to A ={j : {
1
2
μ
1
)}
j
|= 0}, since the
Bayes classification direction is
1
2
μ
1
). Variables in A are called discriminative vari-
ables. Define the signal set
˜
A ={j : μ
1 j
|= μ
2 j
}; variables in
˜
A are called signals. Ideally,
˜
A is
the variable selection outcome of an independence rule. Practically, independence rules pick the
strongest signals indicated by the data. When is diagonal, A =
˜
A. For a general covariance
matrix, however, the discriminative and the signal sets can be very different, as shown in the
following proposition.
P
ROPOSITION 1. Let
=
A, A
A, A
c
A
c
,A
A
c
,A
c
,=
˜
A,
˜
A
˜
A,
˜
A
c
˜
A
c
,
˜
A
˜
A
c
,
˜
A
c
.
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
32 QING MAI,HUI ZOU &MING YUAN
1. If and only if
˜
A
c
,
˜
A
1
˜
A,
˜
A
2,
˜
A
μ
1,
˜
A
) = 0, we have A
˜
A.
2. If and only if μ
2,A
c
= μ
1,A
c
or
A
c
,A
1
A, A
2,A
μ
1,A
) = 0, we have
˜
A A.
Based on Proposition 1 we can construct examples that a non-signal can be discriminative and
a nondiscriminative feature can be a signal. Consider a linear discriminant analysis model with
p = 25, μ
1
= 0
p
,
ii
= 1and
ij
= 0·5(i, j = 1,...,25; i |= j). If μ
2
= (1
T
5
, 0
T
20
)
T
,then
˜
A =
{1, 2, 3, 4, 5} and A ={j : j = 1,...,25},since
1
2
μ
1
) = (1·62 × 1
T
5
, 0·38 × 1
T
20
)
T
.
Similarly, if we let μ
2
= (3 × 1
T
5
, 2·5 × 1
T
5
)
T
, then all variables are signals but A ={1, 2, 3, 4, 5},
because
1
2
μ
1
) = (1
T
5
, 0
T
20
)
T
.
The above arguments warn us that independence rules could select a wrong set of features. Dif-
ferent sparse discriminant analysis methods have been proposed based on Fisher’s view of linear
discriminant analysis: the discriminant direction is obtained by maximizing β
T
ˆ
Bβ/β
T
ˆ
β,where
ˆ
B = ( ˆμ
2
−ˆμ
1
)
T
( ˆμ
2
−ˆμ
1
). Wu et al. (2009) proposed the
1
-constrained Fisher discriminant:
min
β
β
T
ˆ
β, subject to
T
ˆ
Bβ)
1/2
= 1, β
1
τ. (3)
A referee pointed out that a similar estimator was proposed by Trendafilov & Jolliffe (2007).
When revising this paper, it came to our attention that Witten & Tibshirani (2011) proposed
another
1
-penalized linear discriminant:
max
β
β
T
ˆ
Bβ λ
p
j=1
|s
j
β
j
|
, subject to β
T
ˆ
β
1. (4)
Little is known about the theoretical proper ties of the estimators defined in (3)and(4), but we
include them in our numerical experiments.
3. M
ETHOD AND THEORY
3·1. Sparse discriminant analysis via penalized least squares
Our approach is motivated by the intimate connection between linear discriminant analysis
and least squares in the classical p < n setting; see Chapter 4 of Hastie et al. (2008). We code
the class labels as y
1
=−n/n
1
and y
2
= n/n
2
,wheren = n
1
+ n
2
.Let
(
ˆ
β
ols
,
ˆ
β
ols
0
) = arg min
β,β
0
n
i=1
(y
i
β
0
x
T
i
β)
2
. (5)
Then
ˆ
β
ols
= c
ˆ
1
( ˆμ
2
−ˆμ
1
) for some positive constant c. In other words, the least squares for-
mulation in (5) exactly derives the usual linear discriminant analysis direction.
This connection is lost in high-dimensional problems because the sample covariance estimate
is not invertible and the linear discriminant analysis direction is not well defined. However, we
may consider a penalized least squares formulation to produce a classification direction. Let
P
λ
(·) be a generic sparsity-inducing penalty, such as the lasso penalty (Tibshirani, 1996)where
P
λ
(t) = λt (t 0). We first compute the solution to a penalized least squares problem,
(
ˆ
β
λ
,
ˆ
β
λ
0
) = arg min
β,β
0
n
1
n
i=1
(y
i
β
0
x
T
i
β)
2
+
p
j=1
P
λ
(|β
j
|)
. (6)
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 33
Then our classification rule is to assign x to class 2 if
x
T
ˆ
β
λ
+
ˆ
β
0
> 0. (7)
Note that
ˆ
β
0
in (7) differs from
ˆ
β
λ
0
in (6). In the p n case, consider the ordinary least
squares estimator and the usual linear discriminant analysis. Let us write
ˆ
β
ols
= c
ˆ
β
LDA
,where
ˆ
β
LDA
=
ˆ
1
( ˆμ
2
−ˆμ
1
). We should use
ˆ
β
0
= c
ˆ
β
LDA
0
in (7), where
ˆ
β
LDA
0
= log(n
2
/n
1
) −{( ˆμ
1
+
ˆμ
2
)/2}
T
ˆ
β
LDA
, such that the ordinar y least squares classifier and t he linear discriminant analysis
rule yield an identical classification. If we use
ˆ
β
ols
0
in (7), then these two classifiers are not the
same in general. Finding the right intercept is critical for classification but receives little attention
in the literature. Hastie et al. (2008) mentioned that one could choose the intercept
ˆ
β
0
empirically
by minimizing the training error. For tunately, there is a closed-form formula for computing the
optimal intercept.
P
ROPOSITION 2. Suppose that a linear classifier assigns x to class 2 if x
T
˜
β +
˜
β
0
> 0. Given
˜
β,if
2
μ
1
)
T
˜
β>0, then the optimal intercept is
β
opt
0
=−
1
+ μ
2
)
T
˜
β/2 +
˜
β
T
˜
β{
2
μ
1
)
T
˜
β}
1
log
2
1
), (8)
which can be estimated by
ˆ
β
opt
0
=−( ˆμ
1
μ
2
)
T
˜
β/2 +
˜
β
T
ˆ
˜
β{( ˆμ
2
−ˆμ
1
)
T
˜
β}
1
log(n
2
/n
1
). (9)
By Proposition 2, we calculate
ˆ
β
0
and then the classifier given in (7) assigns x to class 2 if
{x ( ˆμ
1
μ
2
)/2}
T
ˆ
β
λ
+ (
ˆ
β
λ
)
T
ˆ
ˆ
β
λ
{( ˆμ
2
−ˆμ
1
)
T
ˆ
β
λ
}
1
log(n
2
/n
1
)>0.
Remark 1. The condition
2
μ
1
)
T
˜
β>0 in Proposition 2 is very mild. If the linear clas-
sifier actually yields
2
μ
1
)
T
˜
β<0, then we can always use
˜
β
new
=−
˜
β, which obeys
2
μ
1
)
T
˜
β
new
> 0.
Remark 2. If π
1
= π
2
= 0·5, then we can take
ˆ
β
opt
0
=−( ˆμ
1
μ
2
)
T
˜
β/2. In general, we need
to include the second term in the right-hand side of (9). If π
1
|= π
2
and without the sparsity
condition on
˜
β, the second term in (9) would not work well when p n. Fortunately, when
˜
β is
sparse, we have
˜
β
T
˜
β =
i, j:
˜
β
i
|= 0,
˜
β
j
|= 0
ij
˜
β
i
˜
β
j
,
˜
β
T
ˆ
˜
β =
i, j:
˜
β
i
|= 0,
˜
β
j
|= 0
ˆ
ij
˜
β
i
˜
β
j
.
Thus, even when p n, as long as
˜
β
0
n,
˜
β
T
ˆ
˜
β is a good estimator for
˜
β
T
˜
β.Usinga
regularized estimate of could provide some further improvement. For example, for banded
covariance matrices, the banding estimator (Bickel & Levina, 2008) and the tapering estimator
(Caietal., 2010) are better estimators for than the sample covariance. However, in this work,
our primary focus is
ˆ
β
λ
and we do not want to entangle estimating large covariance matrices with
feature selection.
In principle, (6) can work with any sparsity-inducing penalty function. We choose the lasso in
this work because it is the most popular penalty in the literature. Other popular penalty functions
include the smoothly clipped absolute deviation (Fan & Li, 2001), the elastic net (Zou & Hastie,
2005), the adaptive lasso (Zou, 2006), the fused lasso (Tibshirani et al., 2005), the grouped
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
34 QING MAI,HUI ZOU &MING YUAN
lasso (Yu an & L i n, 2006) and the minimum concavity penalty (Zhang, 2010). For convenience,
we call the resulting classifier lassoed discriminant analysis from now on. We can use either
the least angle regression algorithm (Efron et al., 2004) or the coordinate descent algorithm
(Friedman et al., 2010) to c ompute the lasso-penalized least squares estimator.
3·2. Theory
We first introduce some necessary notation. For a general m × n matrix M,defineM
=
max
i=1,...,m
n
j=1
|M
ij
|. For any vector b,letb
= max
j
|b
j
| and |b|
min
= min
j
|b
j
|.We
let β
Bayes
=
1
2
μ
1
) represent the Bayes classifier coefficient vector. So, A ={j :
β
Bayes
j
|= 0} and let s be the cardinality of A.WeuseC = cov(x) to represent the marginal covari-
ance matrix of the predictors and partition C as
C =
C
AA
C
AA
c
C
A
c
A
C
A
c
A
c
.
We define three quantities that frequently appear in our analysis:
κ =C
A
c
A
(C
AA
)
1
=(C
AA
)
1
,=μ
2A
μ
1A
.
Suppose that X is the predictor matrix and let
˜
X be the centred predictor matrix, whose column-
wise mean is zero. Obviously, C
(n)
=
˜
X
T
˜
X/n is an empirical version of C. Likewise, we can
write
˜
X
T
A
˜
X
A
/n = C
(n)
AA
and
˜
X
T
A
c
˜
X
A
/n = C
(n)
A
c
A
.
Denote β
= (C
AA
)
1
2A
μ
1A
). Now we can define
˜
β
Bayes
by letting
˜
β
Bayes
A
= β
and
˜
β
Bayes
A
c
= 0. The following proposition shows the equivalence between
˜
β
Bayes
and β
Bayes
.
P
ROPOSITION 3. The quantities
˜
β
Bayes
and β
Bayes
are equivalent in the sense that
˜
β
Bayes
=
cβ
Bayes
for some positive constant c and the Bayes classifier can be written as assigning x to
class 2 if
{x
1
+ μ
2
)/2}
T
˜
β
Bayes
+ (
˜
β
Bayes
)
T
˜
β
Bayes
{
2
μ
1
)
T
˜
β
Bayes
}
1
log
2
1
)>0.
Proposition 3 tells us that it suffices to show that the proposed sparse discriminant analysis
can consistently recover the support of
˜
β
Bayes
and estimate β
.
Throughout our analysis we assume that the variance of each variable is bounded by a finite
constant. In practice, one often standardizes the data beforehand. Then the finite constant can be
taken as unity. In this subsection,
0
and c
1
, c
2
are positive constants.
The lassoed discriminant analysis direction is computed by
(
ˆ
β
lasso
,
ˆ
β
λ
0
) = arg min
β,β
0
n
1
n
i=1
(y
i
β
0
x
T
i
β)
2
+ λ
p
j=1
|β
j
|
. (10)
If lassoed discriminant analysis finds the support of the Bayes rule, then we should have
ˆ
β
lasso
A
c
= 0
and
ˆ
β
lasso
A
should be identical to
ˆ
β
A
,where
ˆ
β
A
= arg min
β,β
0
n
1
n
i=1
(y
i
β
0
jA
x
ij
β
j
)
2
+
jA
λ|β
j
|
. (11)
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 35
We introduce
ˆ
β
A
only for mathematical analysis. It is not a real estimator, because its definition
depends on knowing A. To ensure that the lassoed discriminant analysis has the variable selection
consistency property, we impose a condition on the covariance matrix of the predictors:
κ =C
A
c
A
(C
AA
)
1
< 1. (12)
The condition in (12) is an analogue of the irrepresentability condition for the lasso regression
estimator (Meinshausen & B
¨
uhlmann, 2006; Zou, 2006; Zhao & Yu, 2006; Wainwright, 2009).
This condition can be relaxed if using a concave penalty to derive
ˆ
β
λ
; see § 5.
T
HEOREM 1. Pick any λ such that λ<min{|β
|
min
/(2ϕ), }. Then:
1. assuming the condition in (12), with probability at least 1 δ
1
,
ˆ
β
lasso
A
=
ˆ
β
A
and
ˆ
β
lasso
A
c
= 0,
where
δ
1
= 2 ps exp(c
1
ns
2
2
) + 2p exp{−c
2
nλ
2
(1 κ 2ϕ)
2
(1 + κ)
2
/16} (13)
and is any positive constant less than min[
0
(1 κ)(4ϕ)
1
{λ/2 + (1 + κ)}
1
];
2. with probability at least 1 δ
2
, none of the elements of
ˆ
β
A
is zero, where
δ
2
= 2s
2
exp(c
1
ns
2
2
) + 2s exp(c
2
n
2
) (14)
and is any positive constant less than min{
0
(3 + ζ)
1
, ζ (6 + 2ζ)
1
};
3. for any positive satisfying <min{
0
(2ϕ)
1
}, we have
pr(
ˆ
β
A
β
4ϕλ) 1 2s
2
exp(c
1
ns
2
2
) 2s exp(c
2
n
2
). (15)
The non-asymptotic results in Theorem 1 can be easily translated into asymptotic arguments
when allowing the triple (n, s, p) to tend to infinity at suitable rates. To highlight the main points,
we assume that ,κ, ϕ are constants. In addition, we need the following regularity conditions:
Condition 1. n, p →∞and log(ps )s
2
/n 0;
Condition 2. |β
|
min
{log( ps )s
2
/n}
1/2
.
Condition 1 restricts p. Clearly, we cannot expect the proposed method to work for an arbi-
trarily large p. However, t he restriction is rather loose. Consider the case where s = O(n
1/2γ
)
for some γ<1/2. Condition 1 holds as long as p e
n
2γ
. Therefore, p is allowed to grow faster
than any polynomial order of n, referred to as nonpolynomial-dimension asymptotics. Condition
2 requires the nonzero elements of the Bayes rule t o be large enough such that we could consis-
tently separate them from zero using the observed data. The lower bound actually converges to
zero under Condition 1, so Condition 2 is not strong.
T
HEOREM 2. Let
ˆ
A ={j :
ˆ
β
lasso
j
|= 0}. Under Conditions 1 and 2, if we choose λ = λ
n
such
that λ
n
|β
|
min
and λ
n
{log( ps )s
2
/n}
1/2
, and further assume κ<1,thenpr(
ˆ
A = A) 1
and pr(
ˆ
β
lasso
A
β
4ϕλ
n
) 1.
Remark 3. Although we use penalized least squares to estimate the classification direc-
tion, there is a fundamental difference between Theorem 1 and theoretical results derived for
lasso-penalized least squares regression (Meinshausen & B
¨
uhlmann, 2006; Zhao & Yu, 2006;
Wainwright, 2009). The previous work assumes that the data obey a linear regression model
with additive noise, which is not true for y and x in (10).
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
36 QING MAI,HUI ZOU &MING YUAN
Table 1. Simulation models. The choices of n, p, and β
Bayes
are listed
Model np β
Bayes
1 100 400
ij
= 0·5
|i j|
.0·556 (3, 1·5, 0, 0, 2, 0
p5
)
T
2 100 400
ij
= 0·5
|i j|
.0·582 (3, 2·5, 2·8, 0
p3
)
T
3 400 800
jj
= 1,
ij
= 0·5, i |= j.0·395 (3, 1·7, 2·2, 2·1, 2·55, 0
p5
)
T
4 300 800 = I
5
˜
,I
5
is an identity matrix; 0·916 (1·2, 1·4, 1·15, 1·64, 1·5, 1, 2, 0
p7
)
T
˜
jj
= 1,
˜
ij
= 0·6, i |= j.
5 400 800
jj
= 1,
ij
= 0·5, i |= j.0·551 (3, 1·7, 2·2,2·1, 2·55,(p 5)
1
1
p5
)
T
6 400 800
jj
= 1,
ij
= 0·5, i |= j.0·362 (3, 1·7, 2·2, 2·1, 2·55,(p 5)
1
1
p5
)
T
Remark 4. Our method is also fundamentally different from those based on high-dimensional
covariance estimation. In the current literature on covariance or inverse-covariance matrix
estimation, a commonly used assumption is that the target matrix has some sparsity structure
(Bickel & Levina, 2008; Cai et al., 2010). Such assumptions are not needed in our method.
4. N
UMERICAL RESULTS
4·1. Simulation
We use simulated data to demonstrate the good performance of our proposal. For com-
parison, we included the nearest shrunken centroids classifier (Tibshirani et al., 2002), the
features annealed independence rule (Fan & Fan, 2008), the
1
-penalized linear discriminant
(Witten & Tibshirani, 2011)andthe
1
-constrained Fisher discriminant (Wu et al., 2009). The
nearest shrunken centroids classifier is implemented in the R package pamr; see http://
cran.r-project.org/web/packages/pamr/index.html.The
1
-penalized linear discriminant is
implemented in the R package penalizedLDA; see http://cran.r-project.org/web/packages/
penalizedLDA/index.html. We used the code of Wu et al. (2009) to implement the
1
-constrained
Fisher discriminant. We also considered the t-test classifier: we first performed Bonferroni-
adjusted t-tests with size 0·05 and then did linear discriminant analysis only using these features
that passed the t-test.
We randomly generated n class labels such that π
1
= π
2
= 0·5. Conditioning on the class labels
g (g = 1, 2), we generated the p-dimensional predictor x from a multivariate normal distribution
with mean vector μ
g
and covariance . Without loss of generality, we set μ
1
= 0andμ
2
=
β
Bayes
. We considered six different simulation models. The choices of n, p, and β
Bayes
are
showninTable1. Models 1–4 are sparse discriminant models with different covariance and mean
structure; Models 5 and 6 are practically sparse in the sense that their Bayes rules depend on
all variables in theory but can be well approximated by sparse discriminant functions. Table 2
summarizes the simulation results based on 2000 replications.
Our method, lassoed discriminant analysis, is the only one that shows good performance in
all six simulation settings. It closely mimics the Bayes rule, regardless of the Bayes error and
covariance structure. Tibshirani’s method and Fan’s method have very comparable performance,
but they are much worse than ours except for Model 1. In Model 1, the first five elements of
μ
2
μ
1
are much larger than the rest, which implies that independence rules can include all three
discriminative variables. On the other hand, although Model 2 uses the same as in Model 1,
it has a very different mean structure: the first two elements of μ
2
μ
1
are huge while the rest
are much smaller. This means that independence rules have difficulty in selecting variable 3,
resulting in inferior classification. Wu’s method has good classification accuracy overall, but
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 37
Table 2. Simulation results. The methods are named after the first author of the original papers.
The reported numbers are medians with their standard errors, obtained by bootstrap, in parenthe-
ses. TRUE selection and FALSE selection denote the numbers of selected variables f rom the dis-
criminative set and its complement, respectively. Fitted model size is the total number of selected
variables
Bayes Our method Wu Witten Tibshirani Fan t-test classifier
Model 1
Error (%) 10 10·89 13·71 10·81 10·94 11·47 10·46
(0·03)(0·01)(0·01)(0·02)(0·05)(0·01)
TRUE selection 3 3 3 1 3 3 3
(0) (0) (0) (0) (0) (0)
FALSE selection 0 2 0 26 6 7 1
(0·16)(0·49)(0·11)(0·61)(0·66) (0)
Model 2
Error (%) 10 12·84 14·514·25 15·12 15·67 14·12
(0·05)(0·01)(0·02)(0·05)(0·07
)(0·01)
TR
UE selection 3 3 1 2 2 2 2
(0) (0·14) (0) (0·34) (0) (0)
FALSE selection 0 6 0 4 9 8 0
(0·27) (0) (0·61)(0·73)(0·29) (0)
Model 3
Error (%) 20 21·93 22·37 33·69 27·48 25·69 38·94
(0·03)(0·05)(0·01)(0·07)(0·02)(0·03)
TRUE selection 5 5 5 3 3 2 3
(0) (0) (0) (0) (0) (0)
FALSE selection 0 14 2 419·5 2 0 790
(0·59) (0) (10·19)(0·31) (0) (
0·42)
Model
4
Error (%) 10 12·50 13·99 23·90 19·25 18·56 24·59
(0·02)(0·03)(0·01)(0·04)(0·00)(0·06)
TRUE selection 7 7 6 4 4 3 6
(0) (0) (0) (0) (0) (0)
FALSE selection 0 18 2 35 1 0 153
(0·70) (0) (4·43)(0·48) (0) (0)
Model 5
Error (%) 10 11·11 12·07 21·99 14·72 14·27 25·79
(0·02)(0·07)(0·01)(0·03)(0·01)(0·04)
Fitted model size 800 21 7 737 3 3 800
(0·65)(0·16)(2·29)(0·46) (0)
(0)
Model 6
Error (%) 20 22·22 23·34 30·43 26·13 24·14 36·80
(0·03)(0·05)(0·01)(0·07) (0) (0·04)
Fitted model size 800 20 5 592·5 8 3 798
(0·53)(0·49)(7·46)(0·51) (0) (0)
it can often miss some important features. Witten and Tibshirani’s method has rather poor per-
formance, which is somewhat surprising because the basic idea is similar to Wu’s. Witten and
Tibshirani’s formulation in (4) is nonconvex while Wu’s formulation in (3) is convex, which may
help explain their different performances. The t-test classifier is best for Model 1, second best
for Model 2 and worst for Models 3–6.
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
38 QING MAI,HUI ZOU &MING YUAN
Table 3. Comparing lassoed discriminant analysis with other approaches on the colon and the
prostate datasets
Our method Witten Wu Tibshirani Fan t-test classifier
Colon Error (%) 86·4 (1·54) 86·4 (0·49) 84·1 (2·17) 86·4 (1·20) 86·4 (0·61) 81·0 (1·67)
Fitted model size 5 (0·63) 10 (1·39) 1 (0) 89 (29·95) 11 (1·19) 16 (1·15)
Prostate Error (%) 94·1 (0·55) 91·2 (0·24) 91·2 (
0·70) 91·2 (0·96) 76·5 (0·54) 76·5 (1·62)
F
itted model size 10 (0·77) 18 (4·45) 1 (0) 10 (0·84) 4 (0·40) 58 (2·65)
Table 4. Adjusted classification accuracies of four methods by forcing them to
select a similar number of genes as does our method
Witten Wu Tibshirani Fan
Colon 86·4 (0·51) 86·4 (1·06) 63·6 (0·70) 77·3 (2·16)
Prostate 94·1 (1·25) 91·2 (1·24) 91·2 (1·39) 73·5 (1·11)
4·2. Real data
We further compare the methods on two benchmark datasets: the colon and prostate cancer
datasets. The basic task here is to predict whether an observation is tumour or is normal tissue.
We randomly split the datasets into the training and test sets with ratio 2:1. Model fitting was
done on the training set and classification accuracy was evaluated on the test set. This procedure
was repeated 100 times. Shown in Table 3 are the classification accuracies and the numbers of
genes selected by each competitor.
The colon and prostate datasets have been previously used to test classification and fea-
ture selection methods. See Alon et al. (1999), Singh et al. (2002)andDettling (2004). Dettling
(2004) reported that BagBoost was the most accurate classifier for the prostate data, with 92·5%
classification accuracy, and the nearest shrunken centroids classifier was the most accurate clas-
sifier for the colon data. Table 2 shows that our method is as accurate as the nearest shrunken
centroids classifier on the colon data and significantly outperforms BagBoost on the prostate
data. Since BagBoost does not do gene selection, we do not include it in Table 2. Witten and
Tibshirani’s method works quite well on these two real datasets. As suggested by a referee, we
adjusted the tuning parameters of Witten and Tibshirani’s, Wu’s, Tibshirani’s and Fan’s meth-
ods such that they selected a similar number of genes to that by our method on each dataset.
Their adjusted classification errors are reported in Table 4. This adjustment helps Witten and
Tibshirani’s method but degrades Tibshirani’s and Fan’s methods.
5. D
ISCUSSION
Sparse discriminant analysis based on independence rules is computationally attractive for
high-dimensional classification. However, independence rules may lead to misleading feature
selection and hence poor classification performance, due to the difference between discrimi-
native and signal variables. When doing feature selection in classification, one should aim to
recover the discriminative set, not the signal set, which is the goal of large-scale hypothesis test-
ing. Discovering the signal set is the fundamental question of research in many scientific studies
(Efron, 2010), but identifying features for classification could be very different from identifying
interesting signals, and hence the statistical tools for data analysis should be carefully chosen.
If one feels that the condition (12) is somewhat strong for establishing the nonpolynomial-
dimension theory, one may use a concave penalty other than the lasso penalty. We have tried
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 39
using the smoothly clipped absolute deviation penalty ( Fan & Li, 2001) and have shown that
the resulting sparse discriminant algorithm enjoys a strong oracle property without requiring the
condition (12). These results are given in a technical report which is available upon request. We
only focus on lassoed discriminant analysis in the current paper because our primary goal is
to demonstrate the effectiveness of the penalized least squares formulation of sparse discrim-
inant analysis. We do not want to overly emphasize the penalty function. For a given dataset,
one penalty function may be more appropriate than others. For example, in some situations, the
predictors may have a natural ordering, so ordered variable selection is preferred, and the nested
lasso (Levina et al., 2008) is a better choice than the lasso or smoothly clipped absolute devia-
tion penalty. It is straightforward to implement the nested lasso in our proposal, but a detailed
treatment is beyond the scope of this paper.
A
CKNOWLEDGEMENT
The authors thank the editor, associate editor and referees for their helpful comments and sug-
gestions. Mai is supported by an Alumni Fellowship from the School of Statistics at the University
of Minnesota. Zou and Yuan are supported by the National Science Foundation, U.S.A.
A
PPENDIX
Proof of Proposition 1. 1. Let =
1
and β
Bayes
= (μ
2
μ
1
). Note that A
˜
A is equivalent to
β
Bayes
˜
A
c
= 0, and β
Bayes
˜
A
c
=
˜
A
c
,
˜
A
2,
˜
A
μ
1,
˜
A
), where
˜
A
c
,
˜
A
=−(
˜
A
c
,
˜
A
c
˜
A
c
,
˜
A
1
˜
A,
˜
A
˜
A,
˜
A
c
)
1
˜
A
c
,
˜
A
1
˜
A,
˜
A
.
Therefore, part 1 is proven.
2. By definition,
˜
A A is equivalent to μ
2,A
c
= μ
1,A
c
.Nowusingμ
2
μ
1
= β
Bayes
we have μ
2,A
μ
1,A
=
A,A
β
Bayes
A
and μ
2,A
c
μ
1,A
c
=
A
c
,A
β
Bayes
A
. Hence, μ
2,A
c
μ
1,A
c
=
A
c
,A
1
A,A
2,A
μ
1,A
).
Thus part 2 is proven.
Proof of Proposition 2. We recode the response variable as y
1. Note that
˜
β
opt
0
=
arg min
˜
β
0
E{y
new
|= sign(x
T
new
˜
β +
˜
β
0
) | training data}. Since y
new
, x
new
are independent from the train-
ing data, (Y
new
, z
new
= x
T
new
˜
β) obeys a one-dimensional linear discriminant analysis model, that is,
z
new
| y
new
= 1 N (
˜
β
T
μ
2
,
˜
β
T
˜
β), pr(y
new
= 1) = π
2
and z
new
| y
new
=−1 N (
˜
β
T
μ
1
,
˜
β
T
˜
β), pr(y
new
=
1) = π
1
. Then a straightforward calculation gives (8).
Proof of Proposition 3. By definition we can write C
AA
=
AA
+ π
1
π
2
2A
μ
1A
)(μ
2A
μ
1A
)
T
and
C
AA
˜
β
Bayes
A
= μ
2A
μ
1A
. Let c = n{n 2 + π
1
π
2
2A
μ
1A
)
T
1
AA
2A
μ
1A
)}
1
> 0, then
˜
β
Bayes
A
=
cβ
Bayes
A
.
We now prove Theorems 1 and 2. With y = n/n
2
, n/n
1
and the centred predictor matrix
˜
X, we can
write
ˆ
β
lasso
= arg min
β
n
1
β
T
(
˜
X
T
˜
X 2( ˆμ
2
−ˆμ
1
)
T
β + λ
p
j=1
|β
j
|
.
The following two lemmas are repeatedly used in the proof.
L
EMMA A1. There exist constants
0
and c
1
, c
2
such that for any
0
we have
pr(|C
(n)
ij
C
ij
| ) 2exp(n
2
c
1
)(i, j = 1,..., p); (A1)
pr{|( ˆμ
2 j
−ˆμ
1 j
)
2 j
μ
1 j
)| } 2exp(n
2
c
2
)(j = 1,..., p); (A2)
pr(C
(n)
AA
C
AA
) 2s
2
exp(ns
2
2
c
1
); (A3)
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
40 QING MAI,HUI ZOU &MING YUAN
pr(C
(n)
A
c
A
C
A
c
A
) 2( p s)s exp(ns
2
2
c
1
); (A4)
pr{( ˆμ
2
−ˆμ
1
)
2
μ
1
)
} 2 p exp(n
2
c
2
); (A5)
pr{( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
} 2s exp(n
2
c
2
). (A6)
L
EMMA A2. There exist constants
0
,c
1
such that for any min(
0
, 1), we have
pr{C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
(C
AA
)
1
ϕ(κ + 1)(1 ϕ)
1
} 2 ps exp(ns
2
2
c
1
).
Proof of Lemma A1. We only prove (A1) and (A2). Inequalities (A3)–(A6) can be obtained from
(A1)–(A2) by union bounds. First, pr(μ
1 j
μ
1 j
| | Y ) 2exp{−n
1
2
/(2σ
2
j
)}.Also,n
1
Ber(n
1
)
and pr(|n
1
π
1
n| n) 2exp(nc
2
2
). Therefore, pr(μ
1
μ
1
| ) 2exp{−nπ
1
2
/(4σ
2
j
)}+
2exp(nc
2
π
2
/4) 2exp(nc
(1)
2
2
). The same inequality also holds for class 2. Thus (A2) holds.
To prove (A1), note that C
(n)
ij
= n
1
n
k=1
x
ki
x
kj
−¯x
i
¯x
j
. Since ¯x
v
π
1
ˆμ
1v
π
2
ˆμ
2v
(v = i, j),bypre-
vious arguments, we know that there exists c
1
> 0 such that
pr{| ¯x
i
¯x
j
E(x
i
)E(x
j
)| } 2exp(nc
1
2
).
We further have that n
1
n
k=1
x
ki
x
kj
E(x
i
x
j
) =
2
l=1
n
l
/n{n
1
l
g
k
=l
x
ki
x
kj
E(x
i
x
j
| g = l)}+
2
l=1
E(x
i
x
j
| g = l)(n
l
/n π
l
) and E(x
i
x
j
| g = l) =
ij
+ μ
li
μ
lj
for l = 1, 2. Note that
n
1
l
g
k
=l
x
ki
x
kj
= n
1
l
g
k
=l
(x
ki
μ
li
)(x
kj
μ
lj
) + μ
li
lj
−ˆμ
lj
) + μ
lj
li
−ˆμ
li
) + μ
li
μ
lj
.
Bickel & Levina (2008) showed that, for <
0
,
pr{|n
1
l
g
k
=l
(x
ki
μ
li
)(x
kj
μ
lj
)
ij
| >| Y } 2exp(c
3
n
2
). (A7)
Combining the concentration results for ˆμ
lν
, n
l
and (A7), we have (A1).
Proof of Lemma A2. Let η
1
=C
AA
C
(n)
AA
, η
2
=C
A
c
A
C
(n)
A
c
A
and η
3
=(C
(n)
AA
)
1
(C
AA
)
1
.Firstwehave
C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
(C
AA
)
1
C
(n)
A
c
A
C
A
c
A
×(C
(n)
AA
)
1
(C
AA
)
1
+C
(n)
A
c
A
C
A
c
A
×(C
AA
)
1
+C
A
c
A
(C
AA
)
1
×C
AA
C
(n)
AA
×(C
AA
)
1
+C
A
c
A
(C
AA
)
1
×C
AA
C
(n)
AA
×(C
(n)
AA
)
1
(C
AA
)
1
η
1
+ η
2
)(ϕ + η
3
).
Moreover, η
3
(C
(n)
AA
)
1
×(C
(n)
AA
C
AA
)
×(C
AA
)
1
= + η
3
η
1
. So, as long as ϕη
1
< 1,
we have η
3
ϕ
2
η
1
(1 ϕη
1
)
1
and hence
C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
(C
AA
)
1
η
1
+ η
2
(1 ϕη
1
)
1
.
Then we consider the event max
1
2
) and use Lemma 1 to obtain Lemma 2.
Proof of Theorem 1. We first prove conclusion 1. By (11) we can write
ˆ
β
A
= (n
1
˜
X
T
A
˜
X
A
)
1
{( ˆμ
2A
ˆμ
1A
) λt
A
/2}, where t
A
represents t he subgradient such that t
j
= sign(
ˆ
β
j
) if
ˆ
β
j
|= 0 and 1 < t
j
< 1if
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
Direct sparse discriminant analysis 41
ˆ
β
j
= 0. Write
ˆ
β
A
= (C
AA
)
1
2A
μ
1A
) + (C
(n)
AA
)
1
{( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)}
−{(C
(n)
AA
)
1
(C
AA
)
1
}
2A
μ
1A
) λ(C
(n)
AA
)
1
t
A
/2. (A8)
In order to show that
ˆ
β
lasso
= (
ˆ
β
A
, 0), it suffices to verify that
n
1
˜
X
T
A
c
˜
X
A
ˆ
β
A
( ˆμ
2A
C
−ˆμ
1A
C
)
λ/2. (A9)
The left-hand side of (A9) is equal to
C
(n)
A
c
A
(C
(n)
AA
)
1
( ˆμ
2A
−ˆμ
1A
) C
(n)
A
c
A
(C
(n)
AA
)
1
λt
A
/2 ( ˆμ
2A
C
−ˆμ
1A
C
)
. (A10)
Using C
A
c
A
C
1
AA
2A
μ
1A
) =
2A
C
μ
1A
C
),(A10) is bounded from above by
U
1
=C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
C
1
AA
+( ˆμ
2A
C
−ˆμ
1A
C
)
2A
C
μ
1A
C
)
+{C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
C
1
AA
+ κ}( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
+ (C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
C
1
AA
+ κ)λ/2.
If C
(n)
A
c
A
(C
(n)
AA
)
1
C
A
c
A
C
1
AA
+ 1)ϕ(1 ϕ)
1
, and
( ˆμ
2
−ˆμ
1
)
2
μ
1
)
4
1
λ(1 κ 2ϕ)/(1 + κ),
then U
1
λ/2. Therefore, by Lemmas A1 and A2,wehave
1 δ
1
pr{n
1
˜
X
T
A
c
˜
X
A
ˆ
β
A
2A
C
μ
1A
C
)
λ/2}
1 2 ps exp(ns
2
2
c
1
) 2p exp[nc
2
{4
1
λ(1 κ 2ϕ)/(1 + κ)}
2
].
Thus (13) is proven.
We now prove conclusion 2. Let ζ =|β
|
min
/(ϕ). Write η
1
=C
AA
C
(n)
AA
and η
3
=(C
(n)
AA
)
1
(C
AA
)
1
. Then for any j A,
|
ˆ
β
j
| ζϕ
3
+ ϕ){λ/2 +( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
}−η
3
.
When η
1
ϕ<1 we have shown that η
3
2
η
1
(1 η
1
ϕ)
1
, thus
|
ˆ
β
j
| ζϕ (1 η
1
ϕ)
1
{λϕ/2 +( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
ϕ + ϕ
2
η
1
}≡ L
1
.
Because β
ϕ, ζ 1. Hence λ |β
|
min
/(2ϕ) 2|β
|
min
/{(3 + ζ)ϕ}. Under the events η
1
and ( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
we have L
1
> 0. Therefore,
pr(L
1
> 0) 1 2s
2
exp(c
1
ns
2
2
) 2s exp(nc
2
2
).
Thus (14) is proven.
We now prove conclusion 3. By (A8) and η
1
ϕ<1, we have
ˆ
β
A
β
(1 η
1
ϕ)
1
{λϕ/2 +( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
ϕ + ϕ
2
η
1
}.
Under the events η
1
<and ( ˆμ
2A
−ˆμ
1A
)
2A
μ
1A
)
we have
ˆ
β
A
β
4ϕλ. Thus,
pr(
ˆ
β
A
β
4ϕλ) 1 2s
2
exp(c
1
ns
2
2
) 2s exp(nc
2
2
).
Thus (15) is proven.
Proof of Theorem 2. Theorem 2 directly follows from Theorem 1, so its proof is omitted.
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from
42 QING MAI,HUI ZOU &MING YUAN
REFERENCES
ALON,U.,BARKAI,N.,NOTTERMAN,D.A.,GISH,K.,MACK,S.&LEVINE,A.J.(1999). Broad patterns of gene
expression revealed by clustering analysis of tumor and nor mal colon tissues probed by oligonucleotide arrays.
Proc. Nat. Acad. Sci. 96, 6745–50.
B
ICKEL,P.J.&LEVINA,E.(2004). Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some
alternatives when there are many more variables than observations. Bernoulli 10, 989–1010.
B
ICKEL,P.J.&LEVINA,E.(2008). Regularized estimation of large covariance matrices. Ann. Statist. 36, 199–227.
C
AI,T.,ZHANG,C.H.&ZHOU,H.(2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist.
38, 2118–44.
D
ETTLING,M.(2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20, 3583–93.
E
FRON,B.(2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing and Prediction.
Cambridge: Cambridge University Press.
E
FRON,B.,HASTIE,T.J.,JOHNSTONE,I.M.&TIBSHIRANI,R.J.(2004). Least angle r egression. Ann. Statist. 32,
407–99.
F
AN,J.&FAN,Y.(2008). High dimensional classification using features annealed independence rules. Ann. Statist.
36, 2605–37.
F
AN,J.&LI,R.(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist.
Assoc. 96, 1348–60.
F
RIEDMAN,J.H.,HASTIE,T.J.&TIBSHIRANI,R.J.(2010). Regularization paths for generalized linear models via
coordinate descent. J. Statist. Software 33, 1–22.
H
ALL,P.,MARRON,J.S.&NEEMAN,A.(2005). Geometric representation of high dimension, low sample size data.
J. R. Statist. Soc. B 67, 427–44.
H
AND,D.J.(2006). Classifier technology and the illusion of progress. Statist. Sci. 21, 1–14.
H
ASTIE,T.J.,TIBSHIRANI,R.J.&FRIEDMAN,J.H.(2008). Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd edn. Berlin: Springer.
L
EVINA,E.,ROTHMAN,A.J.&ZHU,J.(2008). Sparse estimation of large covariance matrices via a nested lasso
penalty. Ann. Appl. Statist. 2, 245–63.
M
EINSHAUSEN,N.&B
¨
UHLMANN,P.(2006). High dimensional graphs and variable selection with the lasso. Ann.
Statist. 34, 1436–62.
M
ICHIE,D.,SPIEGELHALTER,D.J.&TAY LO R,C.C.(1994). Machine Learning, Neural and Statistical Classification,
Chichester: Ellis Horwood.
R
OTHMAN,A.J.,BICKEL,P.,LEVINA,E.&ZHU,J.(2008). Sparse permutation invariant covariance estimation. Elec-
tron. J. Statist. 2, 494–515.
S
INGH,D.,FEBBO,P.G.,ROSS,K.,JACKSON,D.G.,MANOLA,J.,LADD,C.,TAMAYO,P.,RENSHAW,A.A.,DAMICO,
A. V., R
ICHIE,J.P.,ET AL. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1,
203–9.
T
IBSHIRANI,R.J.(1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88.
T
IBSHIRANI,R.J.,HASTIE,T.J.,NARASIMHAN,B.&CHU,G.(2002). Diagnosis of multiple cancer types by shrunken
centroids of gene expression. Proc. Nat. Acad. Sci. 99, 6567–72.
T
IBSHIRANI,R.J.,SAUNDERS,M.,ROSSET,S.,ZHU,J.&KEITH,K.(2005). Sparsity and smoothness via the fused
lasso. J. R. Statist. Soc. B 67, 91–108.
T
RENDAFILOV,N.T.&JOLLIFFE,I.T.(2007). DALASS: Variable selection in discriminant analysis via the lasso.
Comp. Statist. Data Anal. 51, 3718–36.
W
AINWRIGHT,M.J.(2009). Sharp thresholds for noisy and high-dimensional recovery of sparsity using
1
-constrained
quadratic programming (lasso). IEEE Trans. Info. Theory 55, 2183–202.
W
ITTEN,D.M.&TIBSHIRANI,R.J.(2011). Penalized classification using Fisher’s linear discriminant. J. R. Statist.
Soc. B 73, 753–72.
W
U,M.C.,ZHANG,L.,WANG,Z.,CHRISTIANI,D.C.&LIN,X.(2009). Sparse linear discriminant analysis for
simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics 25, 1145–51.
Y
UAN,M.&LIN,Y.(2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B
68, 49–67.
Z
HANG,C.H.(2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942.
Z
HAO,P.&YU,B.(2006). On model selection consistency of lasso. J.Mach.Learn.Res.7, 2541–63.
Z
OU,H.(2006). The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 101, 1418–29.
Z
OU,H.&HASTIE,T.J.(2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. B 67,
301–20.
[Received February 2011. Revised September 2011]
at Georgia Institute of Technology on September 25, 2012http://biomet.oxfordjournals.org/Downloaded from