Stratified Norms

2014-12-10 Data Management R

This post describes how to use stratifiedNorm() to create a stratified random sample, given an arbitrary number of factors. The function is available via source_gist():

library('devtools')
source_gist("https://gist.github.com/mattsigal/c17650d8a9b0f5b018af")

I will create a small dataset to demonstrate how to use the function:

set.seed(77)
dat <- data.frame(Gender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
                  AgeGrp=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
                  Relationship=sample(c("Direct", "Manager", "Coworker", "Friend"), 
                                      size = 1500, replace = TRUE),
                  X=rnorm(n=1500, mean=0, sd=1),
                  Y=rnorm(n=1500, mean=0, sd=1),
                  Z=rnorm(n=1500, mean=0, sd=1))
str(dat)
## 'data.frame':    1500 obs. of  6 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 1 1 1 2 ...
##  $ AgeGrp      : Factor w/ 3 levels "18-39","40-49",..: 2 2 1 2 1 2 1 3 2 1 ...
##  $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 3 3 3 4 2 2 1 4 4 2 ...
##  $ X           : num  -1.478 0.328 0.149 -0.241 -0.759 ...
##  $ Y           : num  0.291 -0.403 -0.557 1.615 2.105 ...
##  $ Z           : num  -0.4025 1.1505 -0.0306 0.3641 -1.0688 ...

stratifiedNorm() has 6 inputs:

stratifiedNorm(dat, strata, observations=0, return.grid=FALSE, full.data=FALSE, full.data.id="sampled")

dat: a data.frame object.
strata: a character vector indicating the strata variables. These need to match the variable names in the dataset.
observations: a numeric vector indicating how many cases to sample from each strata. If the length of this vector is 1, it will be repeated for each strata group (e.g., enter 5 to sample 5 cases from each combination.)
return.grid: logical, if TRUE will return the strata contingeny table.
full.data: logical, if TRUE will return the full dataset, otherwise will only return the sampled data.
full.data.id: used if full.data = TRUE, indicates the name of the vector added to the data.frame to indicate the observation was sampled.

Using `stratifiedNorm()`

First, we create our strata variable. For this dataset, the relevant factors are: Gender, AgeGroup, and Relationship. Note: the input order will affect the ordering of the contingency table!

strata = c("Gender", "AgeGrp", "Relationship")

Next, let’s investigate the ordering of the variables:

head(stratifiedNorm(dat, strata, return.grid = TRUE), n = 14)
##    Gender AgeGrp Relationship Observations
## 1  Female  18-39     Coworker            0
## 2    Male  18-39     Coworker            0
## 3  Female  40-49     Coworker            0
## 4    Male  40-49     Coworker            0
## 5  Female    50+     Coworker            0
## 6    Male    50+     Coworker            0
## 7  Female  18-39       Direct            0
## 8    Male  18-39       Direct            0
## 9  Female  40-49       Direct            0
## 10   Male  40-49       Direct            0
## 11 Female    50+       Direct            0
## 12   Male    50+       Direct            0
## 13 Female  18-39       Friend            0
## 14   Male  18-39       Friend            0

When Relationship is entered last, it actually is ordered first (e.g., the first 6 rows of the contingency table refer to Relationship - Direct). Of course, the factors can be entered in a different order.

Now that we know the order the variables are entered in, we can define our observations vector, or how many people we want from each combination.

samples <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)

If samples is a scalar, it will be recycled for the entire vector, otherwise it should be the same length as the number of rows in the contingency table. If it is longer or shorter, stratifiedNorm() will return an error. I recommend running this once with return.grid = TRUE to double check that the observations were entered correctly.

head(stratifiedNorm(dat = dat, strata = strata,
                    observations = samples, return.grid = TRUE), n = 14)
##    Gender AgeGrp Relationship Observations
## 1  Female  18-39     Coworker           36
## 2    Male  18-39     Coworker           34
## 3  Female  40-49     Coworker           72
## 4    Male  40-49     Coworker           58
## 5  Female    50+     Coworker           47
## 6    Male    50+     Coworker           38
## 7  Female  18-39       Direct           18
## 8    Male  18-39       Direct           18
## 9  Female  40-49       Direct           15
## 10   Male  40-49       Direct           22
## 11 Female    50+       Direct           17
## 12   Male    50+       Direct           10
## 13 Female  18-39       Friend           24
## 14   Male  18-39       Friend           28

When we actually sample the data, we can have either the subset returned or the full dataset. Some warnings will be printed if there are less or equal numbers of counts per combination than there are observations in a particular category.

subset.data <- stratifiedNorm(dat, strata, samples, full.data = FALSE)
## Combination for (Female|18-39|Manager) has LESS than count. Returning all observations.
## Combination for (Male|18-39|Manager) has LESS than count. Returning all observations.

full.data <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
## Combination for (Female|18-39|Manager) has LESS than count. Returning all observations.
## Combination for (Male|18-39|Manager) has LESS than count. Returning all observations.

str(subset.data)
## 'data.frame':    775 obs. of  6 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AgeGrp      : Factor w/ 3 levels "18-39","40-49",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X           : num  -0.00565 -0.20064 0.97883 0.57349 -0.70991 ...
##  $ Y           : num  0.457 0.104 -0.388 1.542 1.114 ...
##  $ Z           : num  -0.0121 -0.5163 0.863 1.2574 0.1687 ...

str(full.data)
## 'data.frame':    1500 obs. of  7 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 1 1 1 2 ...
##  $ AgeGrp      : Factor w/ 3 levels "18-39","40-49",..: 2 2 1 2 1 2 1 3 2 1 ...
##  $ Relationship: Factor w/ 4 levels "Coworker","Direct",..: 3 3 3 4 2 2 1 4 4 2 ...
##  $ X           : num  -1.478 0.328 0.149 -0.241 -0.759 ...
##  $ Y           : num  0.291 -0.403 -0.557 1.615 2.105 ...
##  $ Z           : num  -0.4025 1.1505 -0.0306 0.3641 -1.0688 ...
##  $ sampled     : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...

The return with full.data has an additional logical vector called sampled, which indicates cases that were selected. We can check the cases using contingency tables:

ftable(xtabs(~Gender + AgeGrp + Relationship, data = subset.data))
##               Relationship Coworker Direct Friend Manager
## Gender AgeGrp                                            
## Female 18-39                     36     18     24      54
##        40-49                     72     15     11      52
##        50+                       47     17     15      21
## Male   18-39                     34     18     28      63
##        40-49                     58     22     27      43
##        50+                       38     10     25      27

Note, if you want the sample to be reproducible, you should include a set.seed() command first! Compare:

full.data1 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
full.data2 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)
## [1] FALSE

set.seed(77)
full.data1 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
set.seed(77)
full.data2 <- stratifiedNorm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)
## [1] TRUE

Stratified Norms

Using stratifiedNorm()

Using `stratifiedNorm()`