This post describes how to use stratifiedNorm()
to create a stratified random sample, given an arbitrary number of factors. The function is available via source_gist()
:
library('devtools')
source_gist("https://gist.github.com/mattsigal/c17650d8a9b0f5b018af")
I will create a small dataset to demonstrate how to use the function:
set.seed(77)
dat <- data.frame(Gender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
AgeGrp=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Coworker", "Friend"),
size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
Y=rnorm(n=1500, mean=0, sd=1),
Z=rnorm(n=1500, mean=0, sd=1))
str(dat)
## 'data.frame': 1500 obs. of 6 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 1 1 1 2 ...
## $ AgeGrp : Factor w/ 3 levels "18-39","40-49",..: 2 2 1 2 1 2 1 3 2 1 ...
## $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 3 3 3 4 2 2 1 4 4 2 ...
## $ X : num -1.478 0.328 0.149 -0.241 -0.759 ...
## $ Y : num 0.291 -0.403 -0.557 1.615 2.105 ...
## $ Z : num -0.4025 1.1505 -0.0306 0.3641 -1.0688 ...
stratifiedNorm()
has 6 inputs:
stratifiedNorm(dat, strata, observations=0, return.grid=FALSE, full.data=FALSE, full.data.id="sampled")
dat
: a data.frame object.strata
: a character vector indicating the strata variables. These need to match the variable names in the dataset.observations
: a numeric vector indicating how many cases to sample from each strata. If the length of this vector is 1, it will be repeated for each strata group (e.g., enter 5 to sample 5 cases from each combination.)return.grid
: logical, if TRUE will return the strata contingeny table.full.data
: logical, if TRUE will return the full dataset, otherwise will only return the sampled data.full.data.id
: used if full.data = TRUE, indicates the name of the vector added to the data.frame to indicate the observation was sampled.
Using stratifiedNorm()
First, we create our strata variable. For this dataset, the relevant factors are: Gender, AgeGroup, and Relationship. Note: the input order will affect the ordering of the contingency table!
strata = c("Gender", "AgeGrp", "Relationship")
Next, let’s investigate the ordering of the variables:
head(stratifiedNorm(dat, strata, return.grid = TRUE), n = 14)
## Gender AgeGrp Relationship Observations
## 1 Female 18-39 Coworker 0
## 2 Male 18-39 Coworker 0
## 3 Female 40-49 Coworker 0
## 4 Male 40-49 Coworker 0
## 5 Female 50+ Coworker 0
## 6 Male 50+ Coworker 0
## 7 Female 18-39 Direct 0
## 8 Male 18-39 Direct 0
## 9 Female 40-49 Direct 0
## 10 Male 40-49 Direct 0
## 11 Female 50+ Direct 0
## 12 Male 50+ Direct 0
## 13 Female 18-39 Friend 0
## 14 Male 18-39 Friend 0
When Relationship is entered last, it actually is ordered first (e.g., the first 6 rows of the contingency table refer to Relationship - Direct). Of course, the factors can be entered in a different order.
Now that we know the order the variables are entered in, we can define our observations vector, or how many people we want from each combination.
samples <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
If samples is a scalar, it will be recycled for the entire vector, otherwise it should be the same length as the number of rows in the contingency table. If it is longer or shorter, stratifiedNorm()
will return an error. I recommend running this once with return.grid = TRUE
to double check that the observations were entered correctly.
head(stratifiedNorm(dat = dat, strata = strata,
observations = samples, return.grid = TRUE), n = 14)
## Gender AgeGrp Relationship Observations
## 1 Female 18-39 Coworker 36
## 2 Male 18-39 Coworker 34
## 3 Female 40-49 Coworker 72
## 4 Male 40-49 Coworker 58
## 5 Female 50+ Coworker 47
## 6 Male 50+ Coworker 38
## 7 Female 18-39 Direct 18
## 8 Male 18-39 Direct 18
## 9 Female 40-49 Direct 15
## 10 Male 40-49 Direct 22
## 11 Female 50+ Direct 17
## 12 Male 50+ Direct 10
## 13 Female 18-39 Friend 24
## 14 Male 18-39 Friend 28
When we actually sample the data, we can have either the subset returned or the full dataset. Some warnings will be printed if there are less or equal numbers of counts per combination than there are observations in a particular category.
subset.data <- stratifiedNorm(dat, strata, samples, full.data = FALSE)
## Combination for (Female|18-39|Manager) has LESS than count. Returning all observations.
## Combination for (Male|18-39|Manager) has LESS than count. Returning all observations.
full.data <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
## Combination for (Female|18-39|Manager) has LESS than count. Returning all observations.
## Combination for (Male|18-39|Manager) has LESS than count. Returning all observations.
str(subset.data)
## 'data.frame': 775 obs. of 6 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ AgeGrp : Factor w/ 3 levels "18-39","40-49",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ X : num -0.00565 -0.20064 0.97883 0.57349 -0.70991 ...
## $ Y : num 0.457 0.104 -0.388 1.542 1.114 ...
## $ Z : num -0.0121 -0.5163 0.863 1.2574 0.1687 ...
str(full.data)
## 'data.frame': 1500 obs. of 7 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 1 1 1 2 ...
## $ AgeGrp : Factor w/ 3 levels "18-39","40-49",..: 2 2 1 2 1 2 1 3 2 1 ...
## $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 3 3 3 4 2 2 1 4 4 2 ...
## $ X : num -1.478 0.328 0.149 -0.241 -0.759 ...
## $ Y : num 0.291 -0.403 -0.557 1.615 2.105 ...
## $ Z : num -0.4025 1.1505 -0.0306 0.3641 -1.0688 ...
## $ sampled : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
The return with full.data has an additional logical vector called sampled
, which indicates cases that were selected. We can check the cases using contingency tables:
ftable(xtabs(~Gender + AgeGrp + Relationship, data = subset.data))
## Relationship Coworker Direct Friend Manager
## Gender AgeGrp
## Female 18-39 36 18 24 54
## 40-49 72 15 11 52
## 50+ 47 17 15 21
## Male 18-39 34 18 28 63
## 40-49 58 22 27 43
## 50+ 38 10 25 27
Note, if you want the sample to be reproducible, you should include a set.seed() command first! Compare:
full.data1 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
full.data2 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)
## [1] FALSE
set.seed(77)
full.data1 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
set.seed(77)
full.data2 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)
## [1] TRUE