This problem requires you to write answers to the following questions in an R markdown file. You will create the markdown file and an html file generated by the markdown file.

In this problem, you will analyze a combined Stat 100 and Stat 200 survey data in Spring 2017 to determine if a student’s GPA is correlated with the average number of hours the student spend studying.

First, download the csv file of the data here and save it as “Stat100_200_2017spring_survey01M.csv” to the folder where your R markdown file is. Load the file to R using the command

survey <- read.csv("Stat100_200_2017spring_survey01M.csv")

The description of the variables in the data frame can be found on this webpage. Take some time to explore the data.

a. (2 points) How many students in the data are freshmen? How many of them are sophomores? How many are juniors? How many are seniors?

Hint: You only need a single R command to get all the answers. If you forget the command, review Week 5’s notes.

# load data 
survey <- read.csv("Stat100_200_2017spring_survey01M.csv")

We can use the table() command to find out the number of students in each school year.

(tbl <- table(survey$schoolYear))

 Freshman    Junior    Senior Sophomore 
      906       159        87       351 

We see that there are 906 freshmen, 351 sophomores, 159 juniors and 87 seniors.

b. (2 points) Use the xyplot() function in the lattice graphics to make scatter plots of the GPA versus the average study hours (in column studyHr) for students in each school year.

# load the lattice package
library(lattice)

# create plots
xyplot(GPA ~ studyHr | schoolYear, data=survey, pch=16, xlab="average study hours/day")

c. (6 points) Fit a linear model predicting a student’s GPA from the average study hours. What are the intercept and slope? Is the slope statistically significant (assume the usual null cutoff \(\alpha\) = 5%)? Make a scatter plot of GPA versus studyHr and then add the regression line on the plot.

# fit linear model 
fit <- lm(GPA ~ studyHr, data=survey)
summary(fit)

Call:
lm(formula = GPA ~ studyHr, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3360 -0.2892  0.1109  0.4108  0.8748 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.101772   0.027608 112.352  < 2e-16 ***
studyHr     0.046845   0.007569   6.189 7.78e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5602 on 1501 degrees of freedom
Multiple R-squared:  0.02489,   Adjusted R-squared:  0.02424 
F-statistic: 38.31 on 1 and 1501 DF,  p-value: 7.783e-10

From the summary output, we see that the intercept is 3.102 and the slope is 0.04685. The p-value for the slope is 7.78×10-10 << 5%. This means that the slope is highly significant.

The scatter plot and regression line (thick red line) is plotted below.

plot(GPA ~ studyHr, data=survey, las=1, pch=16, xlab="average study hours/day")
abline(fit, col="red", lwd=2)

d. (2 points) Based on the result in part (c), what can you conclude about the relationship of GPA and average study hours?

Since the slope is positive and is highly significant, there is evidence to suggest an assoication between a student who spends more hours studying and having a higher GPA. There is a positive correlation between the average study hours and GPA.

However, the correlation coefficient between the GPA and study hour is 0.16 (computed by the command cor(survey$GPA,survey$studyHr)). This is not a strong correlation.

e. (2 points) Make a scatter plot of the residuals versus studyHr for the regression result in part (c).

# Residual plot for part (c)
plot(fit$residuals ~ survey$studyHr, xlab="average study hours/day", 
     ylab="Residuals", las=1, pch=16)
abline(h=0)

f. (9 points) Create 4 subsets of the survey data frame containing freshman, sophomore, junior and senior students. Then fit a linear model predicting GPA from studyHr for each group. The slope of which group(s) is significant at the 5% level?

Hint: If you forget how to subset a data frame, review Week 3’s Lon Capa problem on subsetting a data frame.

# subset data
survey_fr <- survey[survey$schoolYear=="Freshman",]
survey_so <- survey[survey$schoolYear=="Sophomore",]
survey_jr <- survey[survey$schoolYear=="Junior",]
survey_sr <- survey[survey$schoolYear=="Senior",]

# fit linear models
fit_fr <- lm(GPA ~ studyHr, data=survey_fr)
fit_so <- lm(GPA ~ studyHr, data=survey_so)
fit_jr <- lm(GPA ~ studyHr, data=survey_jr)
fit_sr <- lm(GPA ~ studyHr, data=survey_sr)

# look at the models
summary(fit_fr)

Call:
lm(formula = GPA ~ studyHr, data = survey_fr)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2671 -0.2841  0.1436  0.4436  0.8159 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.10100    0.03984  77.832  < 2e-16 ***
studyHr      0.05537    0.01160   4.774 2.11e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.606 on 904 degrees of freedom
Multiple R-squared:  0.02459,   Adjusted R-squared:  0.02351 
F-statistic: 22.79 on 1 and 904 DF,  p-value: 2.108e-06
summary(fit_so)

Call:
lm(formula = GPA ~ studyHr, data = survey_so)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75128 -0.28474  0.06188  0.37861  0.92203 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.02458    0.04850  62.367  < 2e-16 ***
studyHr      0.05338    0.01251   4.268 2.55e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.492 on 349 degrees of freedom
Multiple R-squared:  0.04959,   Adjusted R-squared:  0.04687 
F-statistic: 18.21 on 1 and 349 DF,  p-value: 2.55e-05
summary(fit_jr)

Call:
lm(formula = GPA ~ studyHr, data = survey_jr)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.23174 -0.29400  0.08602  0.38602  0.80822 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.18734    0.07359  43.312   <2e-16 ***
studyHr      0.00888    0.01837   0.483    0.629    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4875 on 157 degrees of freedom
Multiple R-squared:  0.001486,  Adjusted R-squared:  -0.004874 
F-statistic: 0.2337 on 1 and 157 DF,  p-value: 0.6295
summary(fit_sr)

Call:
lm(formula = GPA ~ studyHr, data = survey_sr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29268 -0.24557  0.02508  0.31311  0.80153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.15137    0.07858  40.102   <2e-16 ***
studyHr      0.04710    0.01906   2.472   0.0154 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.402 on 85 degrees of freedom
Multiple R-squared:  0.06706,   Adjusted R-squared:  0.05609 
F-statistic:  6.11 on 1 and 85 DF,  p-value: 0.01544

From the summaries, we see that the slopes for the freshmen, sophomores and seniors are significant at the 5% level.

g. (2 points) Use the predict() function on one of the linear models in (f) to predict the GPA of a senior student spending 1.5 hours/day studying.

(GPA_pred <- predict(fit_sr, newdata=data.frame(studyHr=1.5)))
       1 
3.222022 

So the predicted GPA is 3.22.

h. (4 points) Fit a linear model predicting studyHr from GPA for the senior students. Then use it and the predict() function to predict studyHr for a senior student with a GPA = GPA_g, where GPA_g is the predicted GPA calculated in part (g) above. Is the predicted value of studyHr greater than, equal to, or smaller than 1.5?

(As you’ve learned in Stat 100, this phenomenon is a consequence of the regression to the mean.)

# linear model
fit_hours <- lm(studyHr ~ GPA, data=survey_sr)

# make prediction
(studyHr_pred <- predict(fit_hours, newdata=data.frame(GPA=GPA_pred)))
       1 
3.317619 

We see that the predicted studyHr is 3.32, which is greater than 1.5.

The predicted studyHr, 3.32, is closer to the mean of survey_sr$studyHr (= 3.4482759) than 1.5 is to the mean, a consequence of the regression to the mean.