In Week 3’s confounder problem, you analyzed the Stat 100 survey data from Spring 2013. You identified gender as a confounder to the apparent negative correlation between the shoe size and the number of shoes owned by stratification. As you have learned from Stat 200, another way to control for a possible confounding variable is by modeling. In this problem, you are going to compare the results of stratification and modelling. The data you are going to analyze is from Stat 100’s survey 1 data in spring 2017. It can be downloaded here. This webpage contains a description of the column variables. Save the csv file to the same directory as your R markdown file. Then load the file to R using the command
survey <- read.csv("Stat100_2017spring_survey01M2.csv")a. (2 points) What is the correlation coefficient between the shoe size and number of shoes owned?
The correlation can be computed using the cor() function:
cor(survey$shoeSize, survey$shoeNums)[1] -0.1859561You can also wrap the function inside the with(survey,) environment if you don’t want to type the survey$ prefix:
with(survey, cor(shoeSize, shoeNums))[1] -0.1859561We see that the correlation coefficient is -0.1859561. This is a negative correlation, as we have seen before.
b. (6 points) Fit a linear model predicting the number of shoes from shoe size. What are the intercept and slope? Is the slope statistically significant? Make a scatter plot of shoe number versus shoe size and add the regression line to the plot.
# fit the linear model
fit_overall <- lm(shoeNums ~ shoeSize, data=survey)
summary(fit_overall)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey)
Residuals:
    Min      1Q  Median      3Q     Max 
-18.446  -6.514  -2.895   3.763  89.447 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   22.656      1.408  16.093  < 2e-16 ***
shoeSize      -1.052      0.156  -6.745 2.32e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.77 on 1270 degrees of freedom
Multiple R-squared:  0.03458,   Adjusted R-squared:  0.03382 
F-statistic: 45.49 on 1 and 1270 DF,  p-value: 2.325e-11We see that the intercept is 22.6562 and the slope is -1.0524, consistent with the negative correlation we found in (a). The p-value of the slope is 2.32e-11, which is much smaller than 5%. This means that the slope is statistically significant.
Shown below is a plot of shoe number vs shoe size. The regression line is shown in red.
plot(shoeNums ~ shoeSize, data=survey, pch=16)
abline(fit_overall, col="red")We split the survey data into male and female groups via the logical vectors male and female:
male <- (survey$gender == "male")
female <- !maleNow we fit a linear model for each group using lm():
fit_male <- lm(shoeNums ~ shoeSize, data=survey, subset=male)
fit_female <- lm(shoeNums ~ shoeSize, data=survey, subset=female)c. (2 points) What are the regression equations for the male and female groups?
Look at the result of regressions:
summary(fit_male)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = male)
Residuals:
   Min     1Q Median     3Q    Max 
-8.545 -3.737 -1.907  1.077 92.093 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.9913     2.9543   1.013   0.3118  
shoeSize      0.4681     0.2757   1.698   0.0902 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.34 on 448 degrees of freedom
Multiple R-squared:  0.006393,  Adjusted R-squared:  0.004175 
F-statistic: 2.883 on 1 and 448 DF,  p-value: 0.09024summary(fit_female)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = female)
Residuals:
    Min      1Q  Median      3Q     Max 
-16.346  -7.069  -2.143   3.931  80.951 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6819     2.2431   4.316 1.78e-05 ***
shoeSize      0.8516     0.2823   3.016  0.00264 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.57 on 820 degrees of freedom
Multiple R-squared:  0.01097,   Adjusted R-squared:  0.009766 
F-statistic: 9.097 on 1 and 820 DF,  p-value: 0.00264From the output, we conclude that the regression equations for the two groups are:
Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)
Female: Shoe numbers = 9.6819 + 0.8516×(shoe size)
(2 points) In which group(s) is the slope statistically significant?
The slope is significant if the p-value is smaller than 5%. The p-value for the slope of males is 9.02% and the p-value for the slope of females is 0.264%. This means that the slope in the female group is significant but not significant in the male group.
(2 points) Use xyplot() to plot the shoe number vs shoe size for these two groups on the same graph and show the regression lines.
Below, the parameter layout=c(1,2) is to arrange the plot in 2 rows. The regression lines are shown in red.
library(lattice)
xyplot(shoeNums ~ shoeSize | gender, data=survey, pch=16, layout=c(1,2), 
       panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
       })Change the reference level of gender
survey$gender <- relevel(survey$gender, "male")d. (4 points) Fit a linear model predicting the number of shoes from shoe size and gender without an interaction term.
# model without interaction term
fit_noint <- lm(shoeNums ~ shoeSize + gender, data=survey)
summary(fit_noint)
Call:
lm(formula = shoeNums ~ shoeSize + gender, data = survey)
Residuals:
    Min      1Q  Median      3Q     Max 
-16.145  -5.566  -1.869   2.855  92.113 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.7577     2.1835   0.347 0.728626    
shoeSize       0.6790     0.2011   3.377 0.000756 ***
genderfemale  10.2768     0.8136  12.631  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.15 on 1269 degrees of freedom
Multiple R-squared:  0.1424,    Adjusted R-squared:  0.141 
F-statistic: 105.4 on 2 and 1269 DF,  p-value: < 2.2e-16From the output, we conclude that the regression equation is
Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768×(genderfemale)
e. (5 points) Split the equation in (d) into male and female groups. Show your steps.
To obtain the equation for the males, we set gendermale=0 in the equation in (d). This gives
Male: Shoe numbers = 0.7577 + 0.679×(shoe size)
For the females, we set gendermale=1 in the equation in (d). This gives
Female: Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768
which can be simplified to give
Female: Shoe numbers = 11.0345 + 0.679×(shoe size)
(1 point) Are the equations exactly the same as in part (c)?
Compared the equations in (e) and (c), we see that they are different.
Without an interaction term, the slopes for the two groups are the same. This will be in general inconsistent with the equations obtained by fitting a linear model to each group separately.
f. (5 points) Fit a linear model predicting the number of shoes from shoe size and gender with an interaction term.
# Model with interaction term
fit_int <- lm(shoeNums ~ shoeSize*gender, data=survey)
summary(fit_int)
Call:
lm(formula = shoeNums ~ shoeSize * gender, data = survey)
Residuals:
    Min      1Q  Median      3Q     Max 
-16.346  -5.513  -1.920   2.795  92.093 
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)             2.9913     3.2112   0.932   0.3518  
shoeSize                0.4681     0.2997   1.562   0.1185  
genderfemale            6.6906     3.8670   1.730   0.0838 .
shoeSize:genderfemale   0.3834     0.4042   0.949   0.3430  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.15 on 1268 degrees of freedom
Multiple R-squared:  0.143, Adjusted R-squared:  0.141 
F-statistic: 70.53 on 3 and 1268 DF,  p-value: < 2.2e-16From the output, we see that the regression equation is
Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906×(genderfemale) + 0.3834×(genderfemale)×(shoe size)
g. (5 points) Split the equation in (f) into male and female groups. Show your steps.
The male equation is obtained by setting genderfemale=0 in the equation in (f). The result is
Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)
For the female equation, we set genderfemale=1 in the equation in (f). This gives
Female: Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906 + 0.3834×(shoe size)
which can be simplified to give
Female: Shoe numbers = 9.6819 + 0.8515×(shoe size)
(1 point) Are these equations exactly the same as part (c)?
We see that these equations match the equations calculated in part (c), as expected. The tiny difference in the slope of the female equation is caused by my rounding the regression coefficients to 4 decimal places.