3 STRATIFIED SIMPLE RANDOM SAMPLING • Suppose the

3 STRATIFIED SIMPLE RANDOM SAMPLING

• Suppose the population is partitioned into disjoint sets of sampling units called strata.

If a sample is selected within each stratum, then this sampling procedure is known as

stratiﬁed sampling.

• If we can assume the strata are sampled independently across strata, then

(i) the estimator of t or y

can be found by combining stratum sample sums or means

using appropriate weights

(ii) the variances of estimators associated with the individual strata can be summed

to obtain the variance an estimator associated with the whole population. (Given

independence, the variance of a sum equals the sum of the individual variances.)

• (ii) implies that only within-stratum variances contribute to the variance of an estimator.

Thus, the basic motivating principle behind using stratiﬁcation to produce an estimator

with small variance is to partition the population so that units within each stratum are as

similar as possible. This is known as the stratiﬁcation principle.

• In ecological studies, it is common to stratify a geographical region into subregions that are

similar with respect to a known variable such as elevation, animal habitat type, vegetation

types, etc. because it is suspected that the y-values may vary greatly across strata while

they will tend to be similar within each stratum. Analogously, when sampling people, it

is common to stratify on variables such as gender, age groups, income levels, education

levels, marital status, etc.

• Sometimes strata are formed based on sampling convenience. For example, suppose a

large study region appears to be homogeneous (that is, there are no spatial patterns) and

is stratiﬁed based on the geographical proximity of sampling units. Taking a stratiﬁed

sample ensures the sample is spread throughout the study region. It may not, however,

lead to any signiﬁcant reduction in the variance of an estimator.

• But, if the y-values are spatially correlated (y values tend to be similar for neighboring

units), geographically determined strata can improve estimation of population parameters.

Notation: H = the number of strata

= number of population units in stratum h h = 1, 2, . . . , H

N =

h=1

= the number of units in the population

= number of sampled units in stratum h h = 1, 2, . . . , H

n =

h=1

= the total number of units sampled

= the y-value associated with unit j in stratum h

= the sample mean for stratum h

j=1

= stratum h total t =

h=1

j=1

h=1

= the population total

= stratum h mean y

h=1

j=1

= the population mean

• If a simple random sample (SRS) is taken within each stratum, then the sampling design

is called stratiﬁed simple random sampling.

• For stratum h, there are





possible SRSs of size n

. Therefore, there are







···





possible stratiﬁed SRSs for speciﬁed stratum sample sizes n

, ··· , n

• If S

strat

is a stratiﬁed SRS, then the probability of selecting S

strat

P (S

strat

) =

h=1











···





• Thus, every possible stratiﬁed SRS having stratum sample sizes n

, ··· , n

has the same

probability of being selected.

3.1 Estimation of y

and t

• Because a SRS was taken within each stratum, we can apply the estimator formulas for

simple random sampling to each stratum. We can estimate each stratum population mean

and each stratum population total t

. The formulas are:

= y

j=1

= N

= (24)

• Because each

is an unbiased estimator of the stratum total t

for i = 1, 2, . . . , k, their

sum will be an unbiased estimator of the population total t. That is,

str

is an unbiased estimator of t. An unbiased estimator of y

is a weighted average of the

stratum sample means

str

h=1

or, equivalently,

str

where is the weighting factor for stratum h.

• Before we can study V (

str

) and V (

str

), we need to look at the within-stratum variances.

• Because a SRS is taken within stratum h, we can apply the results for simple random

sampling estimators to each stratum. The variances of the stratiﬁed SRS estimators of

the mean and total are:

V (

) = V (

) = (25)

where S

− 1

j=1

− y

)

is the ﬁnite population variance for stratum h.

• Because the simple random samples are independent across the strata, the variance of

str

is the sum of the individual stratum variances:

V (

str

) =

i=1

V (

) =

i=1

(26)

• Dividing by N

, gives the V (

str

V (

str

) =





V (

str

) =





i=1

− n

)

(27)

• Because S

is unknown, we use s

to get an unbiased estimator of V (

V (

) = (28)

where s

is the sample variance of the n

y-values sampled from stratum h.

• Substitution of (28) into (26) and (27) produce the estimated variances of the stratiﬁed

SRS estimators:

V (

str

) =

h=1

− n

)

V (

str

) =





h=1

− n

)

(29)

• Taking a square root of

V (

str

) or

V (

str

) yields the corresponding standard error.

This will be used when generating conﬁdence intervals for t or y

• For the estimated variances of the estimators given in (29), we are assuming that all n

> 1

(because s

is undeﬁned for n

= 1). Cochran (1977 pages 138-140) discusses two potential

methods of dealing with the extreme case where all n

= 1.

Stratiﬁcation Example with Strong Spatial Correlation

• Abundance counts for the population in Figures 5a and 5b show a strong diagonal spatial

correlation. The region has been gridded into a 20 ×20 grid of 10 m ×10 m quadrats. The

total abundance t = 13354. This population was stratiﬁed in two diﬀerent ways:

(i) Into the four 10 × 10 strata shown in Figure 5a.

Stratum sizes are N

= 100 and stratum sample sizes are n

= 5 for h = 1, 2, 3, 4.

Stratum sample totals

j=1

are 124, 158, 172, and 223 for h = 1, 2, 3, 4.

Stratum sample means y

are 24.8, 31.6, 34.4, and 44.6 for h = 1, 2, 3, 4.

Stratum sample variances are s

= 21.7, s

= 13.3, s

= 45.3, and s

= 41.3.

(ii) Into seven unequal size diagonally-oriented strata shown in Figure 5b.

Stratum sizes are N

= N

= 45, N

= N

= 60, N

= N

= 66, and N

= 58.

Stratum sample sizes are n

= n

= 3, n

= n

= 5, and n

= 4.

Stratum sample totals

j=1

are 65, 122, 153, 143, 178, 203, and 143 for h =

1, 2, 3, 4, 5, 6, 7, respectively.

Stratum sample means y

are 21.6, 24.4, 30.6, 35.75, 35.6, 40.6, and 47.6 for h =

1, 2, 3, 4, 5, 6, 7, respectively.

Stratum sample variances are s

= 10.3, s

= 14.8, s

= 19.3, s

= 4.25, s

= 8.3,

= 10.8, and s

= 26.3, respectively.

• For the stratiﬁed SRSs in Figure 5a and Figure 5b:

– Calculate

str

, and their standard errors.

– Calculate 95% conﬁdence intervals for t and y

3.1.1 Conﬁdence Intervals for y

and t

• If all of the stratum sample sizes n

are suﬃciently large (Thompson suggests n

≥ 30),

approximate 100(1 − α)% conﬁdence intervals for y

and t are

str

± z

∗

V (

str

)

str

± z

∗

V (

str

) (30)

where z

∗

is the upper α/2 critical value from the standard normal distribution.

• For smaller sample sizes, the following conﬁdence intervals have been recommended:

str

± t

∗

V (

str

)

str

± t

∗

V (

str

) (31)

where t

∗

is the upper α/2 critical value from the t(d) distribution. In this case, d is

Satterthwaite’s (1946) approximate degrees of freedom d where

d =



h=1



h=1

)

/(n

− 1)

(

V (

str

))

h=1

)

/(n

− 1)

(32)

where a

= N

− n

)/n

• Lohr (page 79) mentions that some software packages will use n − H degrees of freedom

(instead of the approximate degrees of freedom). Both R and SAS use n−H as the default

degrees of freedom.

• If the stratum sample sizes n

are all equal and the stratum sizes N

are all equal, then

the degrees of freedom reduces to d = n − H where n =

is the total sample size.

• One-sided conﬁdence intervals can by generated just like those using SRS. Just use t

∗

using

the upper α critical value from the t(d) distribution.

3.2 Using R and SAS to Analyze a Stratiﬁed SRS

Datasets used in the R code

R dataset from Figure 5a R dataset from Figure 5b

------------------------ ------------------------

count fpc stratum count fpc stratum

25 100 1 18 45 1

30 100 1 23 45 1

18 100 1 24 45 1

28 100 1 23 60 2

23 100 1 21 60 2

30 100 2 21 60 2

26 100 2 28 60 2

35 100 2 29 60 2

34 100 2 25 66 3

33 100 2 32 66 3

38 100 3 27 66 3

27 100 3 35 66 3

30 100 3 34 66 3

44 100 3 34 58 4

33 100 3 37 58 4

36 100 4 38 58 4

41 100 4 34 58 4

53 100 4 33 66 5

47 100 4 38 66 5

46 100 4 37 66 5

38 66 5

32 66 5

40 60 6

44 60 6

38 60 6

44 60 6

37 60 6

49 45 7

42 45 7

52 45 7

R code for Stratiﬁed SRS (Figure 5a)

source("c:/courses/st446/rcode/confintt.r")

# t-based confidence intervals for SRS in Figure 5a

library(survey)

strat5adat <- read.table("c:/courses/st446/rcode/fig5a.txt", header=T)

# strat5adat

strat_design <- svydesign(id=~1, fpc=~fpc, strata=~stratum, data=strat5adat)

strat_design

esttotal <- svytotal(~count,strat_design)

print(esttotal,digits=15)

confint.t(esttotal,degf(strat_design),level=.95)

confint.t(esttotal,degf(strat_design),level=.95,tails=’lower’)

confint.t(esttotal,degf(strat_design),level=.95,tails=’upper’)

estmean <- svymean(~count,strat_design)

print(estmean,digits=15)

confint.t(estmean,degf(strat_design),level=.95)

confint.t(estmean,degf(strat_design),level=.95,tails=’lower’)

confint.t(estmean,degf(strat_design),level=.95,tails=’upper’)

R output for Stratiﬁed SRS (Figure 5a)

(For the population total)

-------------------------------------------------------------------

mean( count ) = 13540.00000

SE( count ) = 480.66620

Two-Tailed CI for count where alpha = 0.05 with 16 df

2.5 % 97.5 %

12521.03317 14558.96683

-------------------------------------------------------------------

mean( count ) = 13540.00000

SE( count ) = 480.66620

One-Tailed (Lower) CI for count where alpha = 0.05 with 16 df

5 % upper

12700.81272 infinity

-------------------------------------------------------------------

mean( count ) = 13540.00000

SE( count ) = 480.66620

One-Tailed (upper) CI for count where alpha = 0.05 with 16 df

lower 95 %

-infinity 14379.18728

-------------------------------------------------------------------

(For the population mean)

-------------------------------------------------------------------

mean( count ) = 33.85000

SE( count ) = 1.20167

Two-Tailed CI for count where alpha = 0.05 with 16 df

2.5 % 97.5 %

31.30258 36.39742

-------------------------------------------------------------------

mean( count ) = 33.85000

SE( count ) = 1.20167

One-Tailed (Lower) CI for count where alpha = 0.05 with 16 df

5 % upper

31.75203 infinity

-------------------------------------------------------------------

mean( count ) = 33.85000

SE( count ) = 1.20167

One-Tailed (upper) CI for count where alpha = 0.05 with 16 df

lower 95 %

-infinity 35.94797

-------------------------------------------------------------------

R code for Stratiﬁed SRS (Figure 5b)

The R code is exactly the same as the R code for the Figure 5a data analysis except you read

in the data ﬁle ﬁg5b.txt.

R output for Stratiﬁed SRS (Figure 5b)

(For the population total)

-------------------------------------------------------------------

mean( count ) = 13462.70000

SE( count ) = 256.02201

Two-Tailed CI for count where alpha = 0.05 with 23 df

2.5 % 97.5 %

12933.07812 13992.32188

-------------------------------------------------------------------

mean( count ) = 13462.70000

SE( count ) = 256.02201

One-Tailed (Lower) CI for count where alpha = 0.05 with 23 df

5 % upper

13023.91117 infinity

-------------------------------------------------------------------

mean( count ) = 13462.70000

SE( count ) = 256.02201

One-Tailed (upper) CI for count where alpha = 0.05 with 23 df

lower 95 %

-infinity 13901.48883

-------------------------------------------------------------------

(For the population mean)

-------------------------------------------------------------------

mean( count ) = 33.65675

SE( count ) = 0.64006

Two-Tailed CI for count where alpha = 0.05 with 23 df

2.5 % 97.5 %

32.33270 34.98080

-------------------------------------------------------------------

mean( count ) = 33.65675

SE( count ) = 0.64006

One-Tailed (Lower) CI for count where alpha = 0.05 with 23 df

5 % upper

32.55978 infinity

-------------------------------------------------------------------

mean( count ) = 33.65675

SE( count ) = 0.64006

One-Tailed (upper) CI for count where alpha = 0.05 with 23 df

lower 95 %

-infinity 34.75372

-------------------------------------------------------------------

Using Proc Surveymeans in SAS:

• When the stratum unit totals (N

) are known, you must create a variable called total

that assigns N

to each stratum level. It must be called total . In the following examples,

the stratum variable is called Area.

• You also need to create a weight variable which takes on the value N

. In the following

examples, the weight variable is called W and it appears in the Weight statement.

• Include the option total=(dataname) in the Proc Surveymeans statement. (dataname)

is the name of the data set. In the ﬁrst example, the dataname is ﬁg 5a. In the second

example, the dataname is ﬁg 5b.

• Include a Stratum statement that contains the stratum variable.

• In the Var statement, include the response variable y. In these examples, y is Count.

• If you want one-sided conﬁdence intervals for y

or t, in the Proc Surveymeans statement

enter lclm or uclm for y

and lclmsum or uclmsum for t. In the second example, I

included all 4 options.

• The list option in the Stratum statement produces a table containing information about

each stratum.

Analysis of the Stratiﬁed SRS in Figure 5a

data fig5a;

input Area Count @@;

datalines;

1 18 1 23 1 28 1 25 1 30

2 35 2 30 2 26 2 33 2 34

3 33 3 27 3 30 3 44 3 38

4 47 4 36 4 41 4 53 4 46

;

data fig5a; set fig5a;

if Area = 1 then _total_= 100; *** _total_ = Nh ;

if Area = 2 then _total_= 100;

if Area = 3 then _total_= 100;

if Area = 4 then _total_= 100;

if Area=1 then W = 100/5; *** W = Nh / nh ;

if Area=2 then W = 100/5;

if Area=3 then W = 100/5;

if Area=4 then W = 100/5;

title1 ’Analysis of Stratified SRS in Figure 5a’;

proc surveymeans data=fig5a total=fig5a mean clm sum clsum df;

Stratum Area / list;

Var Count;

Weight W;

run;

===============================================================

Analysis of Stratified SRS in Figure 5a

The SURVEYMEANS Procedure

Data Summary

Number of Strata 4

Number of Observations 20

Sum of Weights 400

Stratum Information

Stratum Population Sampling

Index Area Total Rate N Obs Variable N

------------------------------------------------------------------------

1 1 100 5.00% 5 Count 5

2 2 100 5.00% 5 Count 5

3 3 100 5.00% 5 Count 5

4 4 100 5.00% 5 Count 5

------------------------------------------------------------------------

Statistics

Std Error

Variable DF Mean of Mean 95% CL for Mean

--------------------------------------------------------------------

Count 16 33.850000 1.201666 31.3025829 36.3974171

--------------------------------------------------------------------

Variable Sum Std Dev 95% CL for Sum

-------------------------------------------------------------------

Count 13540 480.666204 12521.0332 14558.9668

-------------------------------------------------------------------

Analysis of Stratiﬁed SRS in Figure 5b

data fig5b;

input Area Count @@;

datalines;

1 18 1 24 1 23

2 28 2 29 2 21 2 21 2 23

3 34 3 27 3 35 3 32 3 25

4 34 4 38 4 37 4 34

5 32 5 38 5 37 5 38 5 33

6 37 6 38 6 44 6 44 6 40

7 42 7 49 7 52

;

data fig5b; set fig5b;

if Area = 1 then _total_= 45; *** _total_ = Nh ;

if Area = 2 then _total_= 60;

if Area = 3 then _total_= 66;

if Area = 4 then _total_= 58;

if Area = 5 then _total_= 66;

if Area = 6 then _total_= 60;

if Area = 7 then _total_= 45;

if Area=1 then W = 45/3; *** W = Nh / nh ;

if Area=2 then W = 60/5;

if Area=3 then W = 66/5;

if Area=4 then W = 58/4;

if Area=5 then W = 66/5;

if Area=6 then W = 60/5;

if Area=7 then W = 45/3;

title1 ’Analysis of Stratified SRS in Figure 5b’;

proc surveymeans data=fig5b total=fig5b mean clm sum clsum df

lclm uclm lclmsum uclmsum ;

Stratum Area / list;

Var Count;

Weight W;

run;

Analysis of Stratified SRS in Figure 5b

The SURVEYMEANS Procedure

Data Summary

Number of Strata 7

Number of Observations 30

Sum of Weights 400

Stratum Information

Stratum Population Sampling

Index Area Total Rate N Obs Variable N

----------------------------------------------------------------------

1 1 45 6.67% 3 Count 3

2 2 60 8.33% 5 Count 5

3 3 66 7.58% 5 Count 5

4 4 58 6.90% 4 Count 4

5 5 66 7.58% 5 Count 5

6 6 60 8.33% 5 Count 5

7 7 45 6.67% 3 Count 3

----------------------------------------------------------------------

Statistics

Std Error

Variable DF Mean of Mean 95% CL for Mean

-----------------------------------------------------------------------

Count 23 33.656750 0.640055 32.3326953 34.9808047

-----------------------------------------------------------------------

Lower 95% Upper 95%

One-Sided CL One-Sided CL

Variable for Mean for Mean Sum Std Dev

-----------------------------------------------------------------------

Count 32.559778 34.753722 13463 256.022011

-----------------------------------------------------------------------

Lower 95% Upper 95%

One-Sided One-Sided

Variable 95% CL for Sum CL for Sum CL for Sum

-----------------------------------------------------------------

Count 12933.0781 13992.3219 13024 13901

-----------------------------------------------------------------

3.3 Eﬃciency of Stratiﬁed Simple Random Sampling

• Because the variance formulas for

str

and

str

are determined only from within-stratum

variances, the precision of the estimators can be improved by forming strata with small S

values (strata with similar y-values within each stratum). We will compare

V (

) from a

SRS to

V (

str

) from a stratiﬁed SRS.

• The population variance can be rewritten as the weighted sum of within-stratum and

between-stratum variabilities:

N − 1

h=1

j=1

− y

)

N − 1

h=1

− 1)S

h=1

− y

)

• By substituting this alternative form of S

into V (

) and V (

str

), it can be shown that:

V (

) − V (

str

) =

N − n

Nn(N − 1)

h=1

− y

)

−

h=1

(N − N

• If this diﬀerence in variances is positive, or, equivalently, if

h=1

− y

)

h=1

(N − N

then we say that

str

is more eﬃcient than

• A stratiﬁed SRS estimator will be more eﬃcient than the SRS estimator of y

or t if

the variability between stratum means is suﬃciently large relative to the within-stratum

variability. This is what happened with the stratiﬁcation used in Figures 5a and 5b.

3.4 Allocation of Sampling Units

• Given that we have enough resources to allocate n units among the H strata, how do we

determine the stratum sample sizes n

• Situation 1: If all strata are the same size and no prior information is available about

the population, a reasonable choice would be to assign equal (or nearly equal) sample sizes

to the strata. That is, n

≈ .

– Example: Consider the stratiﬁed population in Figure 5a. Suppose there are enough

resources to take a sample of size n = 50. How many samples should be taken for

each stratum assuming Situation 1?

• Situation 2: If the strata are not all the same size and no prior information is available

about the population, a reasonable choice would be to assign sample sizes proportional to

the sizes of the strata relative to the population size N. That is, n

. This is

known as proportional allocation.

– Example: Consider the stratiﬁed population in Figure 5b. Suppose there are enough

resources to take a sample of size n = 50. How many samples should be taken for

each stratum assuming proportional allocation?

• Situation 3: The allocation scheme that minimizes V (

str

) is called optimum allocation

and requires

Because the S

values are unknown, we would need prior estimates (possibly from past

data or published studies) to attempt optimum allocation.

– Example: Consider the stratiﬁed population in Figure 5b. Suppose there are enough

resources to take a sample of size n = 50 and we have prior estimates of s

= 3.2,

= 3.8, s

= 4.4, s

= 2.1, s

= 2.9, s

= 3.3, and s

= 5.1. How many samples

should be taken for each stratum assuming optimum allocation?

• Situation 4: In some cases, if the cost of sampling units varies from stratum to stratum,

then the total cost of taking a stratiﬁed SRS may determine how to allocate units to strata.

– Let c

be the ﬁxed (also called “overhead”) cost of the survey that does not depend

on what units are in sample. Let c

be the cost to sample a unit from stratum h.

The total cost C of the sample will be C = c

h=1

– Case I: For a ﬁxed total cost C, the smallest variance V (

) or V (

t) is achieved

by choosing n

such that:

(C − c

√

h=1

√

– Case II: For a ﬁxed (speciﬁed) variance V (

), the smallest cost is achieved by

ﬁrst determining the total sample size n such that

n =



h=1

√



h=1

√



V +

h=1

where V is the ﬁxed variance speciﬁed by the researcher. Then, the stratum sample

size n

for h = 1, 2, . . . , H is

√

h=1

√

For a ﬁxed V (

t), use V = V (

t)/N

in the formula.

• If all of the costs (c

) are the same, then the total sample size formula reduces to

n =



h=1



V +

h=1

• Because the S

values are unknown in either Case I or Case II, we would need prior

estimates (possibly from past data or published studies) to attempt optimum allocation.

Example of Situation 4: Case I: Suppose there is a ﬁxed total cost C = $3000 and a ﬁxed

overhead cost of c

= $500. Consider the stratiﬁcation used in Figure 5b. The unit sampling

costs are

= $20 per unit from stratum 1

= c

= $25 per unit from stratum 2 or 3

= $30 per unit from stratum 4

= c

= $35 per unit from stratum 5 or 6

= $40 per unit from stratum 7

Then, using s

as an estimate of S

: n

(C − c

√

h=1

√

2500 N

√

7657.776

rounded projected

Stratum N

√

cost

1 45 3.215 20 647.006 32.350 10.6 11 $220

2 60 3.847 25 1154.100 46.164 15.1 15 $375

3 66 4.393 25 1449.690 57.988 18.9 19 $475

4 58 2.062 30 655.054 21.835 7.1 7 $210

5 66 2.881 35 1124.919 32.141 10.5 10 $350

6 60 3.286 35 1166.414 33.326 10.9 11 $385

7 45 5.132 40 1460.593 36.515 11.9 12 $480

The estimated total cost is $ + c

= + $500 = requiring sampling

units.

Example of Situation 4: Case II: Suppose there is a ﬁxed variance of V = V (

t) = .35.

Consider the stratiﬁcation used in Figure 5b. The costs are the same as Case I.

Then, using s

as an estimate of S

n =



h=1

√



h=1

√



V +

h=1

≈

(7657.776)(260.319)

(400

)(.35) + 5254.1

Then, substitution yields

√

h=1

√

(37.433)N

√

260.319

≈ N

√

rounded projected

Stratum N

√

cost

1 45 3.215 20 647.006 32.350 465.0 4.65 5 $100

2 60 3.847 25 1154.100 46.164 888.0 6.64 7 $175

3 66 4.393 25 1449.690 57.988 1273.8 8.34 8 $200

4 58 2.062 30 655.054 21.835 246.5 3.14 3 $ 90

5 66 2.881 35 1124.919 32.141 547.8 4.62 5 $175

6 60 3.286 35 1166.414 33.326 648.0 4.79 5 $175

7 45 5.132 40 1460.593 36.515 1185.0 5.25 5 $200

7657.776 260.319

Thus, the minimum cost to achieve V is $ + c

= + $500 = requiring

a total of sampling units.