R Markdown Exercise: GPA and Hours of Study

This problem requires you to write answers to the following questions in an R markdown file. You will create the markdown file and an html file generated by the markdown file.

In this problem, you will analyze a combined Stat 100 and Stat 200 survey data in Spring 2017 to determine if a student's GPA is correlated with the average number of hours the student spend studying.

First, download the csv file of the data here and save it as "Stat100_200_2017spring_survey01M.csv" to the folder where your R markdown file is. Load the file to R using the command

survey <- read.csv("Stat100_200_2017spring_survey01M.csv")

The description of the variables in the data frame can be found on this webpage. Take some time to explore the data.

  1. (2 points) How many students in the data are freshmen? How many of them are sophomores? How many are juniors? How many are seniors?
    Hint: You only need a single R command to get all the answers. If you forget the command, review Week 5's notes.
  2. (2 points) Use the xyplot() function in the lattice graphics to make scatter plots of the GPA versus the average study hours (in column 'studyHr') for students in each school year.
  3. (6 points) Fit a linear model predicting a student's GPA from the average study hours. What are the intercept and slope? Is the slope statistically significant (assume the usual null cutoff α = 5%)? Make a scatter plot of GPA versus studyHr and then add the regression line on the plot.
  4. (2 points) Based on the result in part (c), what can you conclude about the relationship of GPA and average study hours?
  5. (2 points) Make a scatter plot of the residuals versus studyHr for the regression result in part (c).
  6. (9 points) Create 4 subsets of the survey data frame containing freshman, sophomore, junior and senior students. Then fit a linear model predicting GPA from studyHr for each group. The slope of which group(s) is significant at the 5% level?
    Hint: If you forget how to subset a data frame, review Week 3's Lon Capa problem on subsetting a data frame.
  7. (2 points) Use the predict() function on one of the linear models in (f) to predict the GPA of a senior student spending 1.5 hours/day studying.
  8. (4 points) Fit a linear model predicting studyHr from GPA for the senior students. Then use it and the predict() function to predict studyHr for a senior student with a GPA = GPA_g, where GPA_g is the predicted GPA calculated in part (g) above. Is the predicted value of studyHr greater than, equal to, or smaller than 1.5?
    (As you've learned in Stat 100, this phenomenon is a consequence of the regression to the mean.)

Guidelines


Solution