Simulating Likert scale data in R

In my last project I had to find theoretical limits for a psychometric index involving Likert scale data (aka categorical data). After successfully finding it, I decided to test the results in a simple Monte-Carlo simulation.

I was surprised to find out that there is no built-in categorical data generator in R. What I was looking for, was something like runif(100) which would generate a vector of length 100 where every element is drawn from a multinomial distribution in general or a categorical distribution in particular.

The first idea was to use sample function with given probabilities: sample(c(0,1,2),1,prob=c(0.33, 0.33, 0.34)) but you couldn’t repeat this procedure for N participants without using loops, which is very inefficient, or you would end up with rep repeating the same random pick N times.

I didn’t want to use any third-party libraries just for this small application either, so I came up with this simple trick.

Algorithm

Suppose, you want to generate a 5-category data (x1, x2, x3, x4, x5) for N participants with probabilities (1/10, 2/10, 4/10, 2/10, 1/10). The following formula will work:

distribution <- c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1))
potential <- rep(distribution, M)
likert_data <- sample(potential, N)

or, as one-liner:

likert_data <- sample(rep(c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1)), M), N)

Notice that distribution sets the probabilities, potential repeats this M times (where M is any number greater than or equal to N — I personally used M = N), and likert_data (uniformly) randomly picks N elements and returns the required vector.

Notice how in the screenshot above, we obtain almost exact probabilities we wanted: (1, 2, 4, 2, 1)/10. Since every time is a random draw, there are some deviations, but repeating this formula and averaging, gives the desired values.

UPDATE: A StackExchange user suggested a better hack — to randomly sample with replacement. This would make my solution obsolete, but it’s brilliant:

likert_data <- sample(c(x1,x2,x3,x4,x5), N, replace = TRUE, prob=c(1/10, 2/10, 4/10, 2/10, 1/10))

Hi, thanks for reading this post! I disabled comments on this website. You can contact me via social media links or email.

Explore more like this

blog statistics

A no-nonsense guide to frontend for backend developers

Introduction Absolute basics Client-side vs. Server-side Components Frontend libraries Conclusion

23 Dec 2024

Truncated normal is not normal

In my research I have to deal with many Monte-Carlo simulations of normalized variables that fall in the interval $\left[ 0, 1 \right]$. By the Central Limit Theorem I can...

28 May 2024

Global maxima in fitness landscapes

This blog post states a problem, does not provide any solutions. This is a high form of whining.

24 Mar 2021

R. Hojimatov