Computers cannot generate truly random data (for true random data,
visit, for example, Random.org).
Instead, they use a complicated formula or algorithm to generate a
pseudo-random number. It is possible to set the initial value used in
this formula in order to make the results reproducible. This is called
setting the seed, and is very easy to do with the
set.seed()
function. According to the help file on random
number generation, ?RNG
, if you do not set a seed, “a new
one is created from the current time and the process ID when one is
required. Hence different sessions will give different simulation
results, by default.”
To demonstrate, we can set a seed and then generate 5 numbers from the normal distribution, with a mean of 0 and a standard deviation of 1:
set.seed(12345)
rnorm(n = 5, mean = 0, sd = 1)
## [1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
How do we know that these results are reproducible, based on the seed? We can generate them again, with the same seed, and output the same numbers:
set.seed(12345)
rnorm(n = 5, mean = 0, sd = 1)
## [1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875
Each seed will result in different values being generated:
set.seed(42)
rnorm(n = 5, mean = 0, sd = 1)
## [1] 1.3709584 -0.5646982 0.3631284 0.6328626 0.4042683
One of the simplest ways to generate synthetic data is to draw from
the continuous uniform distribution. When drawing from this
distribution, every value between the minimum and the maximum is equally
likely to be drawn. It takes the arguments n
(number of
values to generate), min
(lower limit of the distribution),
and max
(upper limit of the distribution). By default, the
minimum and maximum values are 0 and 1.
# Generate 10 values from the continuous uniform distribution, using defaults
set.seed(123)
runif(10)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
## [8] 0.8924190 0.5514350 0.4566147
If we want the distribution to cover a different range of values, we
can change the min
and max
arguments.
set.seed(321)
runif(10, min = 10, max = 100)
## [1] 96.03044 94.35570 31.43984 32.95663 45.14608 40.70619 50.71425 36.09395
## [9] 50.56059 82.59361
We can check that every value between the minimum and maximum are equally likely to be selected by generating a large number of observations and plotting the results in a histogram. There will be some small deviations because of randomness, but each bin should contain approximately the same number of observations.
set.seed(54321)
runif(n = 10000) %>%
hist(main = "Histogram of Data Drawn from the Uniform Distribution",
xlab = "Random Value")
Sometimes, instead of a numeric variable, we might wish to generate a
categorical variable. In this case, instead of using the
runif
function, we can sample
from a vector of
possible values. First we create the vector, and then we sample from the
vector. For example, we are generating synthetic data relating to a car
dealership and need to specify the colours of twenty cars at the
dealership. The cars could be black, white, grey, blue, green, or red.
Because two cars could both be red, we need to sample with replacement,
so we set the replace
argument to TRUE
.
# Specify the colours
colours <- c("black", "white", "grey", "blue", "green", "red")
# Generate the sample
set.seed(987)
car_colours <- sample(colours, size = 10, replace = TRUE)
car_colours
## [1] "black" "green" "black" "red" "green" "blue" "white" "black" "black"
## [10] "grey"
We can now create our data frame of cars at the dealership:
dealership_cars <- data.frame(car_num = 1:10,
car_colour = car_colours)
dealership_cars
## car_num car_colour
## 1 1 black
## 2 2 green
## 3 3 black
## 4 4 red
## 5 5 green
## 6 6 blue
## 7 7 white
## 8 8 black
## 9 9 black
## 10 10 grey
Alternatively, we could include the sample while creating the data frame:
set.seed(789)
dealership_cars <- data.frame(car_num = 1:10,
car_colour = sample(colours, size = 10, replace = TRUE))
dealership_cars
## car_num car_colour
## 1 1 green
## 2 2 blue
## 3 3 blue
## 4 4 red
## 5 5 white
## 6 6 white
## 7 7 grey
## 8 8 green
## 9 9 blue
## 10 10 grey
Or we can mutate this column:
set.seed(123)
dealership_cars <- data.frame(car_num = 1:10)
dealership_cars %<>% mutate(car_colour = sample(colours, size = 10, replace = TRUE))
dealership_cars
## car_num car_colour
## 1 1 grey
## 2 2 red
## 3 3 grey
## 4 4 white
## 5 5 white
## 6 6 red
## 7 7 grey
## 8 8 green
## 9 9 blue
## 10 10 red
We saw how we can sample with replacement, to reflect that sometimes duplicates exist, however we can also sample without replacement, to generate a sample where duplicates cannot exist. For example, a regular deck of playing cards contain four suits (Spades, Clubs, Hearts, Diamonds) each with 13 cards (Ace, 2-10, Jack, Queen, King). If I deal three cards from a regular deck of cards, I cannot deal myself the Ace of Spades twice. This type of sampling is sampling without replacement.
First let’s generate a deck of cards:
deck_of_cards <- data.frame(value = rep(c("Two", "Three", "Four", "Five", "Six",
"Seven", "Eight", "Nine", "Ten", "Jack",
"Queen", "King", "Ace"), 4),
suit = rep(c("Spades", "Clubs", "Diamonds", "Hearts"), 13)) %>%
mutate(card = paste(value, "of", suit)) %>%
pull(card)
# First we're repeating the values of the cards four times, then we're repeating
# the suits for each card in the suit, then we put the two together and extract
# just the card column as a vector. Don't worry if you don't follow everything
# that this is doing, it's just to create our deck of cards.
Now we can deal ourselves five cards:
set.seed(112358)
hand_of_cards <- sample(deck_of_cards, size = 5, replace = FALSE)
hand_of_cards
## [1] "Ace of Clubs" "King of Hearts" "Seven of Clubs" "Jack of Clubs"
## [5] "Five of Spades"
We can also use this technique to put all objects in a vector in order, by specifying the size as the same as the number of objects in our vector. For example, if we need to determine the order in which eight students are to present to the class, we can draw a sample of eight without replacement:
students <- c("Alice", "Bob", "Cameron", "Deborah", "Elizabeth", "Fred", "Gareth", "Hubert")
set.seed(123)
sample(students, size = 8, replace = FALSE)
## [1] "Gareth" "Hubert" "Cameron" "Fred" "Bob" "Deborah"
## [7] "Elizabeth" "Alice"
Sometimes particular categories are more prevalent than other categories. We could adapt the sampling approach above by having numerous observations with the same value in the vector, but this seems a bit inefficient… fortunately, we can specify weights, so that the sampling occurs using the weights to specify the prevalence of each of the values.
Revisiting our earlier car dealership example, perhaps some car colours are more prevalent than other car colours:
# Specify the colours
colours <- data.frame(colour = c("black", "white", "grey", "blue", "green", "red"),
prevalence = c(0.4, 0.25, 0.15, 0.12, 0.03, 0.05))
set.seed(456)
dealership_cars <- data.frame(car_num = 1:100,
car_colour = sample(colours$colour,
size = 100,
replace = TRUE,
prob = colours$prevalence))
dealership_cars %>% count(car_colour)
## car_colour n
## 1 black 34
## 2 blue 11
## 3 green 4
## 4 grey 15
## 5 red 8
## 6 white 28
R isn’t limited to generating data from the uniform distribution, there are many different distributions that can be used, including the normal distribution.
set.seed(789)
normal <- rnorm(10000, 200, 30)
hist(normal, breaks = 50)
When using the case_when
function, any observation that
does not meet any of the criteria becomes missing. We can use this to
our advantage to create synthetic data with missing values.
Let’s use our previous hypothetical scenario to assume that some students didn’t report their hours spent studying, however their marks were known by the university. We begin by allocating a random value between 0 and 1 for each record:
set.seed(98765)
students %<>%
mutate(rand = runif(5000, min = 0, max = 1))
Now we choose to delete 2% of the hours from the dataset, selected at
random. We can do this by keeping values for records with a value in
this new rand
column that are greater than 0.02 (and
converting to missing all records where the rand
value is
less than 0.02). Because the case_when
function will
convert to missing any values that don’t meet the criteria, we only need
to specify the criteria for values to be kept:
students %<>%
mutate(hours = case_when(rand >= 0.02 ~ hours))
Although we set 2% of observations to be missing marks, other proportions could be chosen and applied in a similar manner.
Note - if deleting data from multiple rows, you will need to use
different rand
columns, each generated with a different
seed (or all generated together within a single mutate) so that the same
observations don’t have all the data deleted.
A large dataset generated from the normal distribution is likely to contain outliers, however some distributions, such as Gosset’s t-distribution, will generate a larger number of outliers. Here, we shall generate 200 observations from the t-distribution with three degrees of freedom:
set.seed(1000)
t_example <- rt(n = 200, df = 3)
boxplot(t_example)
We will learn more about outliers, however all you need to know for now is that the circles on the box plot represent outliers.
Not all data is uniform or normal, some is skewed, such as salaries or house prices. The beta distribution can be quite useful for generating skewed data. It can also be used to generate other shapes - triangular or mound-shaped, for example. The choice of shape parameters allow the shape of the distribution to change. The beta distribution always is within the range of 0 to 1, so it may be necessary to multiply it by a constant to get the desired range of values.
For example, here we generate some skewed data to represent salary.
set.seed(100)
salary <- rbeta(1000, shape1 = 2, shape2 = 15) * 200000 + 20000
hist(salary, breaks = 50)
These shape parameters result in a more mound-shaped distribution:
set.seed(100)
beta_2 <- rbeta(1000, shape1 = 2, shape2 = 2)
hist(beta_2, breaks = 20)
Experiment with some other shape parameter values to see the effect:
set.seed(100)
beta_3 <- rbeta(1000, shape1 = 15, shape2 = 1)
hist(beta_3, breaks = 50)
set.seed(100)
beta_4 <- rbeta(1000, shape1 = 0.5, shape2 = 0.5)
hist(beta_4, breaks = 50)
The statistical roots of R is demonstrated in the plethora of
statistical distributions available. More information is available
through the help file, ?Distributions