```{r setoptions, echo=FALSE} # Disable the comments in R outputs knitr::opts_chunk$set(comment = NA) ``` This problem requires you to write answers to the following questions in an R markdown file. You will create the markdown file and an html file generated by the markdown file. In this problem, you will analyze a combined Stat 100 and Stat 200 survey data in Spring 2017 to determine if a student's GPA is correlated with the average number of hours the student spend studying. First, download the csv file of the data here and save it as "Stat100_200_2017spring_survey01M.csv" to the folder where your R markdown file is. Load the file to R using the command ```{r, eval=FALSE} survey <- read.csv("Stat100_200_2017spring_survey01M.csv") ``` The description of the variables in the data frame can be found on this webpage. Take some time to explore the data. **a. (2 points) How many students in the data are freshmen? How many of them are sophomores? How many are juniors? How many are seniors?** **Hint: You only need a single R command to get all the answers. If you forget the command, review Week 5's notes.** ```{r} # load data survey <- read.csv("Stat100_200_2017spring_survey01M.csv") ``` We can use the `table()` command to find out the number of students in each school year. ```{r} (tbl <- table(survey$schoolYear)) ``` We see that there are `r tbl["Freshman"]` freshmen, `r tbl["Sophomore"]` sophomores, `r tbl["Junior"]` juniors and `r tbl["Senior"]` seniors. **b. (2 points) Use the `xyplot()` function in the lattice graphics to make scatter plots of the GPA versus the average study hours (in column `studyHr`) for students in each school year.** ```{r} # load the lattice package library(lattice) # create plots xyplot(GPA ~ studyHr | schoolYear, data=survey, pch=16, xlab="average study hours/day") ``` **c. (6 points) Fit a linear model predicting a student's GPA from the average study hours. What are the intercept and slope? Is the slope statistically significant (assume the usual null cutoff $\alpha$ = 5%)? Make a scatter plot of GPA versus studyHr and then add the regression line on the plot.** ```{r} # fit linear model fit <- lm(GPA ~ studyHr, data=survey) summary(fit) ``` From the summary output, we see that the intercept is `r signif(fit$coef[1],4)` and the slope is `r signif(fit$coef[2],4)`. The p-value for the slope is 7.78×10-10 << 5%. This means that the slope is highly significant. The scatter plot and regression line (thick red line) is plotted below. ```{r} plot(GPA ~ studyHr, data=survey, las=1, pch=16, xlab="average study hours/day") abline(fit, col="red", lwd=2) ``` **d. (2 points) Based on the result in part (c), what can you conclude about the relationship of GPA and average study hours?** Since the slope is positive and is highly significant, there is evidence to suggest an assoication between a student who spends more hours studying and having a higher GPA. There is a positive correlation between the average study hours and GPA. However, the correlation coefficient between the GPA and study hour is `r round(cor(survey$GPA,survey$studyHr),2)` (computed by the command `cor(survey$GPA,survey$studyHr)`). This is not a strong correlation. **e. (2 points) Make a scatter plot of the residuals versus studyHr for the regression result in part (c).** ```{r} # Residual plot for part (c) plot(fit$residuals ~ survey$studyHr, xlab="average study hours/day", ylab="Residuals", las=1, pch=16) abline(h=0) ``` **f. (9 points) Create 4 subsets of the survey data frame containing freshman, sophomore, junior and senior students. Then fit a linear model predicting GPA from studyHr for each group. The slope of which group(s) is significant at the 5% level?** **Hint: If you forget how to subset a data frame, review Week 3's Lon Capa problem on subsetting a data frame.** ```{r} # subset data survey_fr <- survey[survey$schoolYear=="Freshman",] survey_so <- survey[survey$schoolYear=="Sophomore",] survey_jr <- survey[survey$schoolYear=="Junior",] survey_sr <- survey[survey$schoolYear=="Senior",] # fit linear models fit_fr <- lm(GPA ~ studyHr, data=survey_fr) fit_so <- lm(GPA ~ studyHr, data=survey_so) fit_jr <- lm(GPA ~ studyHr, data=survey_jr) fit_sr <- lm(GPA ~ studyHr, data=survey_sr) # look at the models summary(fit_fr) summary(fit_so) summary(fit_jr) summary(fit_sr) ``` From the summaries, we see that the slopes for the freshmen, sophomores and seniors are significant at the 5% level. **g. (2 points) Use the `predict()` function on one of the linear models in (f) to predict the GPA of a senior student spending 1.5 hours/day studying.** ```{r} (GPA_pred <- predict(fit_sr, newdata=data.frame(studyHr=1.5))) ``` So the predicted GPA is `r round(GPA_pred,2)`. **h. (4 points) Fit a linear model predicting studyHr from GPA for the senior students. Then use it and the `predict()` function to predict studyHr for a senior student with a GPA = GPA_g, where GPA_g is the predicted GPA calculated in part (g) above. Is the predicted value of studyHr greater than, equal to, or smaller than 1.5?** **(As you've learned in Stat 100, this phenomenon is a consequence of the regression to the mean.)** ```{r} # linear model fit_hours <- lm(studyHr ~ GPA, data=survey_sr) # make prediction (studyHr_pred <- predict(fit_hours, newdata=data.frame(GPA=GPA_pred))) ``` We see that the predicted studyHr is `r round(studyHr_pred,2)`, which is greater than 1.5. The predicted studyHr, `r round(studyHr_pred,2)`, is closer to the mean of `survey_sr$studyHr` (= `r mean(survey_sr$studyHr)`) than 1.5 is to the mean, a consequence of the *regression to the mean*.