Simulating Likert scale data in R
In my last project I had to find theoretical limits for a psychometric index involving Likert scale data (aka categorical data). After successfully finding it, I decided to test the results in a simple Monte-Carlo simulation.
I was surprised to find out that there is no built-in categorical data generator in R. What I was looking for, was something like runif(100)
which would generate a vector of length 100 where every element is drawn from a multinomial distribution in general or a categorical distribution in particular.
The first idea was to use sample
function with given probabilities: sample(c(0,1,2),1,prob=c(0.33, 0.33, 0.34))
but you couldn’t repeat this procedure for N participants without using loops, which is very inefficient, or you would end up with rep
repeating the same random pick N times.
I didn’t want to use any third-party libraries just for this small application either, so I came up with this simple trick.
Algorithm
Suppose, you want to generate a 5-category data (x1, x2, x3, x4, x5) for N participants with probabilities (1/10, 2/10, 4/10, 2/10, 1/10). The following formula will work:
distribution <- c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1))
potential <- rep(distribution, M)
likert_data <- sample(potential, N)
or, as one-liner:
likert_data <- sample(rep(c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1)), M), N)
Notice that distribution
sets the probabilities, potential
repeats this M times (where M is any number greater than or equal to N — I personally used M = N), and likert_data
(uniformly) randomly picks N elements and returns the required vector.
Notice how in the screenshot above, we obtain almost exact probabilities we wanted: (1, 2, 4, 2, 1)/10. Since every time is a random draw, there are some deviations, but repeating this formula and averaging, gives the desired values.
UPDATE: A StackExchange user suggested a better hack — to randomly sample with replacement. This would make my solution obsolete, but it’s brilliant:
likert_data <- sample(c(x1,x2,x3,x4,x5), N, replace = TRUE, prob=c(1/10, 2/10, 4/10, 2/10, 1/10))
Comments