```{r setoptions, echo=FALSE}
# Disable the comments in R outputs
knitr::opts_chunk$set(comment = NA)
```
This problem requires you to write answers to the following questions in an R markdown file. You will create the markdown file and an html file generated by the markdown file.
In this problem, you will analyze a combined Stat 100 and Stat 200 survey data in Spring 2017 to determine if a student's GPA is correlated with the average number of hours the student spend studying.
First, download the csv file of the data here and save it as "Stat100_200_2017spring_survey01M.csv" to the folder where your R markdown file is. Load the file to R using the command
```{r, eval=FALSE}
survey <- read.csv("Stat100_200_2017spring_survey01M.csv")
```
The description of the variables in the data frame can be found on this webpage. Take some time to explore the data.
**a. (2 points) How many students in the data are freshmen? How many of them are sophomores? How many are juniors? How many are seniors?**
**Hint: You only need a single R command to get all the answers. If you forget the command, review Week 5's notes.**
```{r}
# load data
survey <- read.csv("Stat100_200_2017spring_survey01M.csv")
```
We can use the `table()` command to find out the number of students in each school year.
```{r}
(tbl <- table(survey$schoolYear))
```
We see that there are `r tbl["Freshman"]` freshmen, `r tbl["Sophomore"]` sophomores, `r tbl["Junior"]` juniors and `r tbl["Senior"]` seniors.
**b. (2 points) Use the `xyplot()` function in the lattice graphics to make scatter plots of the GPA versus the average study hours (in column `studyHr`) for students in each school year.**
```{r}
# load the lattice package
library(lattice)
# create plots
xyplot(GPA ~ studyHr | schoolYear, data=survey, pch=16, xlab="average study hours/day")
```
**c. (6 points) Fit a linear model predicting a student's GPA from the average study hours. What are the intercept and slope? Is the slope statistically significant (assume the usual null cutoff $\alpha$ = 5%)? Make a scatter plot of GPA versus studyHr and then add the regression line on the plot.**
```{r}
# fit linear model
fit <- lm(GPA ~ studyHr, data=survey)
summary(fit)
```
From the summary output, we see that the intercept is `r signif(fit$coef[1],4)` and the slope is `r signif(fit$coef[2],4)`. The p-value for the slope is 7.78×10-10 << 5%. This means that the slope is highly significant.
The scatter plot and regression line (thick red line) is plotted below.
```{r}
plot(GPA ~ studyHr, data=survey, las=1, pch=16, xlab="average study hours/day")
abline(fit, col="red", lwd=2)
```
**d. (2 points) Based on the result in part (c), what can you conclude about the relationship of GPA and average study hours?**
Since the slope is positive and is highly significant, there is evidence to suggest an assoication between a student who spends more hours studying and having a higher GPA. There is a positive correlation between the average study hours and GPA.
However, the correlation coefficient between the GPA and study hour is `r round(cor(survey$GPA,survey$studyHr),2)` (computed by the command `cor(survey$GPA,survey$studyHr)`). This is not a strong correlation.
**e. (2 points) Make a scatter plot of the residuals versus studyHr for the regression result in part (c).**
```{r}
# Residual plot for part (c)
plot(fit$residuals ~ survey$studyHr, xlab="average study hours/day",
ylab="Residuals", las=1, pch=16)
abline(h=0)
```
**f. (9 points) Create 4 subsets of the survey data frame containing freshman, sophomore, junior and senior students. Then fit a linear model predicting GPA from studyHr for each group. The slope of which group(s) is significant at the 5% level?**
**Hint: If you forget how to subset a data frame, review Week 3's Lon Capa problem on subsetting a data frame.**
```{r}
# subset data
survey_fr <- survey[survey$schoolYear=="Freshman",]
survey_so <- survey[survey$schoolYear=="Sophomore",]
survey_jr <- survey[survey$schoolYear=="Junior",]
survey_sr <- survey[survey$schoolYear=="Senior",]
# fit linear models
fit_fr <- lm(GPA ~ studyHr, data=survey_fr)
fit_so <- lm(GPA ~ studyHr, data=survey_so)
fit_jr <- lm(GPA ~ studyHr, data=survey_jr)
fit_sr <- lm(GPA ~ studyHr, data=survey_sr)
# look at the models
summary(fit_fr)
summary(fit_so)
summary(fit_jr)
summary(fit_sr)
```
From the summaries, we see that the slopes for the freshmen, sophomores and seniors are significant at the 5% level.
**g. (2 points) Use the `predict()` function on one of the linear models in (f) to predict the GPA of a senior student spending 1.5 hours/day studying.**
```{r}
(GPA_pred <- predict(fit_sr, newdata=data.frame(studyHr=1.5)))
```
So the predicted GPA is `r round(GPA_pred,2)`.
**h. (4 points) Fit a linear model predicting studyHr from GPA for the senior students. Then use it and the `predict()` function to predict studyHr for a senior student with a GPA = GPA_g, where GPA_g is the predicted GPA calculated in part (g) above. Is the predicted value of studyHr greater than, equal to, or smaller than 1.5?**
**(As you've learned in Stat 100, this phenomenon is a consequence of the regression to the mean.)**
```{r}
# linear model
fit_hours <- lm(studyHr ~ GPA, data=survey_sr)
# make prediction
(studyHr_pred <- predict(fit_hours, newdata=data.frame(GPA=GPA_pred)))
```
We see that the predicted studyHr is `r round(studyHr_pred,2)`, which is greater than 1.5.
The predicted studyHr, `r round(studyHr_pred,2)`, is closer to the mean of `survey_sr$studyHr` (= `r mean(survey_sr$studyHr)`) than 1.5 is to the mean, a consequence of the *regression to the mean*.