R Markdown Exercise: GPA and Hours of Study
This problem requires you to write answers to the following questions in an R markdown file. You will create the markdown file and an html file generated by the markdown file.
In this problem, you will analyze a combined Stat 100 and Stat 200 survey data in Spring 2017 to determine if a student's GPA is correlated with the average number of hours the student spend studying.
First, download the csv file of the data here and save it as "Stat100_200_2017spring_survey01M.csv" to the folder where your R markdown file is. Load the file to R using the command
survey <- read.csv("Stat100_200_2017spring_survey01M.csv")
The description of the variables in the data frame can be found on this webpage. Take some time to explore the data.
- (2 points) How many students in the data are freshmen? How many of them are sophomores? How many are juniors? How many are seniors?
Hint: You only need a single R command to get all the answers. If you forget the command, review Week 5's notes. - (2 points) Use the
xyplot()
function in the lattice graphics to make scatter plots of the GPA versus the average study hours (in column 'studyHr') for students in each school year. - (6 points) Fit a linear model predicting a student's GPA from the average study hours. What are the intercept and slope? Is the slope statistically significant (assume the usual null cutoff α = 5%)? Make a scatter plot of GPA versus studyHr and then add the regression line on the plot.
- (2 points) Based on the result in part (c), what can you conclude about the relationship of GPA and average study hours?
- (2 points) Make a scatter plot of the residuals versus studyHr for the regression result in part (c).
- (9 points) Create 4 subsets of the
survey
data frame containing freshman, sophomore, junior and senior students. Then fit a linear model predicting GPA from studyHr for each group. The slope of which group(s) is significant at the 5% level?
Hint: If you forget how to subset a data frame, review Week 3's Lon Capa problem on subsetting a data frame. - (2 points) Use the
predict()
function on one of the linear models in (f) to predict the GPA of a senior student spending 1.5 hours/day studying. - (4 points) Fit a linear model predicting studyHr from GPA for the senior students. Then use it and the
predict()
function to predict studyHr for a senior student with a GPA = GPA_g, where GPA_g is the predicted GPA calculated in part (g) above. Is the predicted value of studyHr greater than, equal to, or smaller than 1.5?
(As you've learned in Stat 100, this phenomenon is a consequence of the regression to the mean.)
Guidelines
- Write down your name in the R markdown file.
- Generate an html file from the markdown file by knitr.
- You should download the survey data to your computer and then load it to R, instead of loading it directly from the website. This is for the purpose of reproducibility: the file at the remote website may be changed or disappear later and your R markdown file cannot be run. By saving it to your computer in the same location of the markdown file, you can be sure that your markdown result is reproducible.
- Show all codes and output. If your code chunk contains more than a few lines, include a brief explanation of what you are doing unless you write your code in a self-explanatory style.
- Label and state the answers to each question clearly. Don't just show the code and say that the information is in the output.
Solution
- RMarkdown file (Download the file and save it in the same folder as the data file you downloaded above. Open it with R Studio and then click "Knit".)
- Knitted html file