In this problem, you are going to look at Stat 100’s survey 1 data in Spring 2017. The csv data file can be downloaded here. Put the file, Stat100_2017spring_survey01.csv, to your R’s working directory and load it with the command

library(tidyverse)
survey <- read_csv("Stat100_2017spring_survey01.csv")

The column variables are explained on this webpage.

a. The speed column is the maximum speed (in mph) students claimed they had ever driven. What is the mean and sample standard deviation of speed?

Use the summarize() function to get the answer:

summarize(survey, mean(speed), sd(speed))
# A tibble: 1 x 2
  `mean(speed)` `sd(speed)`
          <dbl>       <dbl>
1       80.9076    36.16879

b. Plot a histogram of the maximum speed.

ggplot(survey) + geom_histogram(aes(speed,..density..), bins=16, fill="white", color="black")

c. You should see from the histogram that there are students who said ‘0’, meaning that they had never driven a car. Calculate the total number of students who had never driven a car. Of these students, how many of them were males and how many of them were females?

Use the filter() function to subset the tibble:

non_drivers <- filter(survey, speed==0)

The number of students who had never driven a car is…

nrow(non_drivers)
[1] 153

To break the number down by gender, we can use group_by() and then summarize():

non_drivers %>% group_by(gender) %>% summarize(n())
# A tibble: 2 x 2
  gender `n()`
   <chr> <int>
1 Female   114
2   Male    39

OR use the table() function:

table(non_drivers$gender)

Female   Male 
   114     39 

Note that n() is a function in dplyr that counts the number of observations in a group. This function can only be used from within summarise(), mutate() and filter().

d. From the histogram, you also see that there are a number of students whose maximum driving speeds were quite low. Let’s assume that only those whose maximum driving speeds exceed 30 mph were regular drivers. Create a subset of the speed column for regular drivers and then calculate the mean and sample standard deviation.

Use filter() to subset the data and then summarize() to calculate the statistics.

regular <- filter(survey, speed > 30)
(stats <- summarize(regular, mean=mean(speed), sd=sd(speed)))
# A tibble: 1 x 2
   mean       sd
  <dbl>    <dbl>
1    93 20.19292

e. Plot a histogram of the maximum speed for the regular drivers. Then superpose a normal curve with the same mean and standard deviation (calculated above).

ggplot(regular) + 
  geom_histogram(aes(speed,..density..), bins=16, fill="white", color="black") + 
  stat_function(fun=dnorm, args=list(mean=stats$mean, sd=stats$sd), color="red")