In Week 3’s confounder problem, you analyzed the Stat 100 survey data from Spring 2013. You identified gender as a confounder to the apparent negative correlation between the shoe size and the number of shoes owned by stratification. As you have learned from Stat 200, another way to control for a possible confounding variable is by modeling. In this problem, you are going to compare the results of stratification and modelling. The data you are going to analyze is from Stat 100’s survey 1 data in spring 2017. It can be downloaded here. This webpage contains a description of the column variables. Save the csv file to the same directory as your R markdown file. Then load the file to R using the command
survey <- read.csv("Stat100_2017spring_survey01M2.csv")
a. (2 points) What is the correlation coefficient between the shoe size and number of shoes owned?
The correlation can be computed using the cor()
function:
cor(survey$shoeSize, survey$shoeNums)
[1] -0.1859561
You can also wrap the function inside the with(survey,)
environment if you don’t want to type the survey$
prefix:
with(survey, cor(shoeSize, shoeNums))
[1] -0.1859561
We see that the correlation coefficient is -0.1859561. This is a negative correlation, as we have seen before.
b. (6 points) Fit a linear model predicting the number of shoes from shoe size. What are the intercept and slope? Is the slope statistically significant? Make a scatter plot of shoe number versus shoe size and add the regression line to the plot.
# fit the linear model
fit_overall <- lm(shoeNums ~ shoeSize, data=survey)
summary(fit_overall)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey)
Residuals:
Min 1Q Median 3Q Max
-18.446 -6.514 -2.895 3.763 89.447
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.656 1.408 16.093 < 2e-16 ***
shoeSize -1.052 0.156 -6.745 2.32e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.77 on 1270 degrees of freedom
Multiple R-squared: 0.03458, Adjusted R-squared: 0.03382
F-statistic: 45.49 on 1 and 1270 DF, p-value: 2.325e-11
We see that the intercept is 22.6562 and the slope is -1.0524, consistent with the negative correlation we found in (a). The p-value of the slope is 2.32e-11, which is much smaller than 5%. This means that the slope is statistically significant.
Shown below is a plot of shoe number vs shoe size. The regression line is shown in red.
plot(shoeNums ~ shoeSize, data=survey, pch=16)
abline(fit_overall, col="red")
We split the survey
data into male and female groups via the logical vectors male
and female
:
male <- (survey$gender == "male")
female <- !male
Now we fit a linear model for each group using lm()
:
fit_male <- lm(shoeNums ~ shoeSize, data=survey, subset=male)
fit_female <- lm(shoeNums ~ shoeSize, data=survey, subset=female)
c. (2 points) What are the regression equations for the male and female groups?
Look at the result of regressions:
summary(fit_male)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = male)
Residuals:
Min 1Q Median 3Q Max
-8.545 -3.737 -1.907 1.077 92.093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9913 2.9543 1.013 0.3118
shoeSize 0.4681 0.2757 1.698 0.0902 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.34 on 448 degrees of freedom
Multiple R-squared: 0.006393, Adjusted R-squared: 0.004175
F-statistic: 2.883 on 1 and 448 DF, p-value: 0.09024
summary(fit_female)
Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = female)
Residuals:
Min 1Q Median 3Q Max
-16.346 -7.069 -2.143 3.931 80.951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6819 2.2431 4.316 1.78e-05 ***
shoeSize 0.8516 0.2823 3.016 0.00264 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.57 on 820 degrees of freedom
Multiple R-squared: 0.01097, Adjusted R-squared: 0.009766
F-statistic: 9.097 on 1 and 820 DF, p-value: 0.00264
From the output, we conclude that the regression equations for the two groups are:
Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)
Female: Shoe numbers = 9.6819 + 0.8516×(shoe size)
(2 points) In which group(s) is the slope statistically significant?
The slope is significant if the p-value is smaller than 5%. The p-value for the slope of males is 9.02% and the p-value for the slope of females is 0.264%. This means that the slope in the female group is significant but not significant in the male group.
(2 points) Use xyplot()
to plot the shoe number vs shoe size for these two groups on the same graph and show the regression lines.
Below, the parameter layout=c(1,2)
is to arrange the plot in 2 rows. The regression lines are shown in red.
library(lattice)
xyplot(shoeNums ~ shoeSize | gender, data=survey, pch=16, layout=c(1,2),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, col = "red")
})
Change the reference level of gender
survey$gender <- relevel(survey$gender, "male")
d. (4 points) Fit a linear model predicting the number of shoes from shoe size and gender without an interaction term.
# model without interaction term
fit_noint <- lm(shoeNums ~ shoeSize + gender, data=survey)
summary(fit_noint)
Call:
lm(formula = shoeNums ~ shoeSize + gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-16.145 -5.566 -1.869 2.855 92.113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7577 2.1835 0.347 0.728626
shoeSize 0.6790 0.2011 3.377 0.000756 ***
genderfemale 10.2768 0.8136 12.631 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.15 on 1269 degrees of freedom
Multiple R-squared: 0.1424, Adjusted R-squared: 0.141
F-statistic: 105.4 on 2 and 1269 DF, p-value: < 2.2e-16
From the output, we conclude that the regression equation is
Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768×(genderfemale)
e. (5 points) Split the equation in (d) into male and female groups. Show your steps.
To obtain the equation for the males, we set gendermale=0 in the equation in (d). This gives
Male: Shoe numbers = 0.7577 + 0.679×(shoe size)
For the females, we set gendermale=1 in the equation in (d). This gives
Female: Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768
which can be simplified to give
Female: Shoe numbers = 11.0345 + 0.679×(shoe size)
(1 point) Are the equations exactly the same as in part (c)?
Compared the equations in (e) and (c), we see that they are different.
Without an interaction term, the slopes for the two groups are the same. This will be in general inconsistent with the equations obtained by fitting a linear model to each group separately.
f. (5 points) Fit a linear model predicting the number of shoes from shoe size and gender with an interaction term.
# Model with interaction term
fit_int <- lm(shoeNums ~ shoeSize*gender, data=survey)
summary(fit_int)
Call:
lm(formula = shoeNums ~ shoeSize * gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-16.346 -5.513 -1.920 2.795 92.093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9913 3.2112 0.932 0.3518
shoeSize 0.4681 0.2997 1.562 0.1185
genderfemale 6.6906 3.8670 1.730 0.0838 .
shoeSize:genderfemale 0.3834 0.4042 0.949 0.3430
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.15 on 1268 degrees of freedom
Multiple R-squared: 0.143, Adjusted R-squared: 0.141
F-statistic: 70.53 on 3 and 1268 DF, p-value: < 2.2e-16
From the output, we see that the regression equation is
Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906×(genderfemale) + 0.3834×(genderfemale)×(shoe size)
g. (5 points) Split the equation in (f) into male and female groups. Show your steps.
The male equation is obtained by setting genderfemale=0 in the equation in (f). The result is
Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)
For the female equation, we set genderfemale=1 in the equation in (f). This gives
Female: Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906 + 0.3834×(shoe size)
which can be simplified to give
Female: Shoe numbers = 9.6819 + 0.8515×(shoe size)
(1 point) Are these equations exactly the same as part (c)?
We see that these equations match the equations calculated in part (c), as expected. The tiny difference in the slope of the female equation is caused by my rounding the regression coefficients to 4 decimal places.