In Week 3’s confounder problem, you analyzed the Stat 100 survey data from Spring 2013. You identified gender as a confounder to the apparent negative correlation between the shoe size and the number of shoes owned by stratification. As you have learned from Stat 200, another way to control for a possible confounding variable is by modeling. In this problem, you are going to compare the results of stratification and modelling. The data you are going to analyze is from Stat 100’s survey 1 data in spring 2017. It can be downloaded here. This webpage contains a description of the column variables. Save the csv file to the same directory as your R markdown file. Then load the file to R using the command

survey <- read.csv("Stat100_2017spring_survey01M2.csv")

a. (2 points) What is the correlation coefficient between the shoe size and number of shoes owned?

The correlation can be computed using the cor() function:

cor(survey$shoeSize, survey$shoeNums)
[1] -0.1859561

You can also wrap the function inside the with(survey,) environment if you don’t want to type the survey$ prefix:

with(survey, cor(shoeSize, shoeNums))
[1] -0.1859561

We see that the correlation coefficient is -0.1859561. This is a negative correlation, as we have seen before.

b. (6 points) Fit a linear model predicting the number of shoes from shoe size. What are the intercept and slope? Is the slope statistically significant? Make a scatter plot of shoe number versus shoe size and add the regression line to the plot.

# fit the linear model
fit_overall <- lm(shoeNums ~ shoeSize, data=survey)
summary(fit_overall)

Call:
lm(formula = shoeNums ~ shoeSize, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.446  -6.514  -2.895   3.763  89.447 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   22.656      1.408  16.093  < 2e-16 ***
shoeSize      -1.052      0.156  -6.745 2.32e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.77 on 1270 degrees of freedom
Multiple R-squared:  0.03458,   Adjusted R-squared:  0.03382 
F-statistic: 45.49 on 1 and 1270 DF,  p-value: 2.325e-11

We see that the intercept is 22.6562 and the slope is -1.0524, consistent with the negative correlation we found in (a). The p-value of the slope is 2.32e-11, which is much smaller than 5%. This means that the slope is statistically significant.

Shown below is a plot of shoe number vs shoe size. The regression line is shown in red.

plot(shoeNums ~ shoeSize, data=survey, pch=16)
abline(fit_overall, col="red")

Stratification

We split the survey data into male and female groups via the logical vectors male and female:

male <- (survey$gender == "male")
female <- !male

Now we fit a linear model for each group using lm():

fit_male <- lm(shoeNums ~ shoeSize, data=survey, subset=male)
fit_female <- lm(shoeNums ~ shoeSize, data=survey, subset=female)

c. (2 points) What are the regression equations for the male and female groups?

Look at the result of regressions:

summary(fit_male)

Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = male)

Residuals:
   Min     1Q Median     3Q    Max 
-8.545 -3.737 -1.907  1.077 92.093 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.9913     2.9543   1.013   0.3118  
shoeSize      0.4681     0.2757   1.698   0.0902 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.34 on 448 degrees of freedom
Multiple R-squared:  0.006393,  Adjusted R-squared:  0.004175 
F-statistic: 2.883 on 1 and 448 DF,  p-value: 0.09024
summary(fit_female)

Call:
lm(formula = shoeNums ~ shoeSize, data = survey, subset = female)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.346  -7.069  -2.143   3.931  80.951 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6819     2.2431   4.316 1.78e-05 ***
shoeSize      0.8516     0.2823   3.016  0.00264 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.57 on 820 degrees of freedom
Multiple R-squared:  0.01097,   Adjusted R-squared:  0.009766 
F-statistic: 9.097 on 1 and 820 DF,  p-value: 0.00264

From the output, we conclude that the regression equations for the two groups are:

Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)

Female: Shoe numbers = 9.6819 + 0.8516×(shoe size)

(2 points) In which group(s) is the slope statistically significant?

The slope is significant if the p-value is smaller than 5%. The p-value for the slope of males is 9.02% and the p-value for the slope of females is 0.264%. This means that the slope in the female group is significant but not significant in the male group.

(2 points) Use xyplot() to plot the shoe number vs shoe size for these two groups on the same graph and show the regression lines.

Below, the parameter layout=c(1,2) is to arrange the plot in 2 rows. The regression lines are shown in red.

library(lattice)
xyplot(shoeNums ~ shoeSize | gender, data=survey, pch=16, layout=c(1,2), 
       panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
       })


Modeling

Change the reference level of gender

survey$gender <- relevel(survey$gender, "male")


d. (4 points) Fit a linear model predicting the number of shoes from shoe size and gender without an interaction term.

# model without interaction term
fit_noint <- lm(shoeNums ~ shoeSize + gender, data=survey)
summary(fit_noint)

Call:
lm(formula = shoeNums ~ shoeSize + gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.145  -5.566  -1.869   2.855  92.113 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.7577     2.1835   0.347 0.728626    
shoeSize       0.6790     0.2011   3.377 0.000756 ***
genderfemale  10.2768     0.8136  12.631  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.15 on 1269 degrees of freedom
Multiple R-squared:  0.1424,    Adjusted R-squared:  0.141 
F-statistic: 105.4 on 2 and 1269 DF,  p-value: < 2.2e-16

From the output, we conclude that the regression equation is

Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768×(genderfemale)

e. (5 points) Split the equation in (d) into male and female groups. Show your steps.

To obtain the equation for the males, we set gendermale=0 in the equation in (d). This gives

Male: Shoe numbers = 0.7577 + 0.679×(shoe size)

For the females, we set gendermale=1 in the equation in (d). This gives

Female: Shoe numbers = 0.7577 + 0.679×(shoe size) + 10.2768

which can be simplified to give

Female: Shoe numbers = 11.0345 + 0.679×(shoe size)

(1 point) Are the equations exactly the same as in part (c)?

Compared the equations in (e) and (c), we see that they are different.

Without an interaction term, the slopes for the two groups are the same. This will be in general inconsistent with the equations obtained by fitting a linear model to each group separately.

f. (5 points) Fit a linear model predicting the number of shoes from shoe size and gender with an interaction term.

# Model with interaction term
fit_int <- lm(shoeNums ~ shoeSize*gender, data=survey)
summary(fit_int)

Call:
lm(formula = shoeNums ~ shoeSize * gender, data = survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.346  -5.513  -1.920   2.795  92.093 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)             2.9913     3.2112   0.932   0.3518  
shoeSize                0.4681     0.2997   1.562   0.1185  
genderfemale            6.6906     3.8670   1.730   0.0838 .
shoeSize:genderfemale   0.3834     0.4042   0.949   0.3430  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.15 on 1268 degrees of freedom
Multiple R-squared:  0.143, Adjusted R-squared:  0.141 
F-statistic: 70.53 on 3 and 1268 DF,  p-value: < 2.2e-16

From the output, we see that the regression equation is

Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906×(genderfemale) + 0.3834×(genderfemale)×(shoe size)

g. (5 points) Split the equation in (f) into male and female groups. Show your steps.

The male equation is obtained by setting genderfemale=0 in the equation in (f). The result is

Male: Shoe numbers = 2.9913 + 0.4681×(shoe size)

For the female equation, we set genderfemale=1 in the equation in (f). This gives

Female: Shoe numbers = 2.9913 + 0.4681×(shoe size) + 6.6906 + 0.3834×(shoe size)

which can be simplified to give

Female: Shoe numbers = 9.6819 + 0.8515×(shoe size)

(1 point) Are these equations exactly the same as part (c)?

We see that these equations match the equations calculated in part (c), as expected. The tiny difference in the slope of the female equation is caused by my rounding the regression coefficients to 4 decimal places.