R Markdown Exercise: Shoe Number vs Shoe Size Confounder Problem Revisited

In Week 3's confounder problem, you analyzed the Stat 100 survey data from Spring 2013. You identified gender as a confounder to the apparent negative correlation between the shoe size and the number of shoes owned by stratification. As you have learned from Stat 200, another way to control for a possible confounding variable is by modelling. In this problem, you are going to compare the results of stratification and modelling. The data you are going to analyze is from Stat 100's survey 1 data in spring 2017. It can be downloaded here. This webpage contains a description of the column variables. Save the csv file to the same directory as your R markdown file. Then load the file to R using the command (replace "filename" by the actual file name)

survey <- read.csv("filename")

(2 points) What is the correlation coefficient between the shoe size and number of shoes owned?
(6 points) Fit a linear model predicting the number of shoes from shoe size. What are the intercept and slope? Is the slope statistically significant? Make a scatter plot of shoe number versus shoe size and add the regression line to the plot.

Stratification

In Week 3's problem, you subsetted the survey data frame into male and female groups, created two data frames and fitted a linear model separately on the two groups. Here we introduce another method to do the same calculation. First, create two logical vectors male and female using the following commands:

male <- (survey$gender == "male")
female <- !male

The object male is a logical vector. The elements of the vector that correspond to observations with the gender "male" are set to TRUE, whereas those correspond to "female" are set to FALSE. The logical vector female is just the opposite of male. As you have learned before, these logical vectors can be used to subset the data frame. For example, survey[male,] creates a data frame with only the "male" observations. However, we don't need to create a new data frame to fit a regression. The lm() function has an optional parameter subset that can be used to specify a subset of observations to be used in the fitting process. Therefore, we can fit a separate regression line for male and female using the following commands.

fit_male <- lm(shoeNums ~ shoeSize, data=survey, subset=male)
fit_female <- lm(shoeNums ~ shoeSize, data=survey, subset=female)

(2 points) What are the regression equations for the male and female groups? Express the equations in the following form.
Male: Shoe numbers = (some number) + (some number)×(shoe size)
Female: Shoe numbers = (some number) + (some number)×(shoe size)

(2 points) Use xyplot() to plot the shoe number vs shoe size for these two groups on the same graph and show the regression lines.
(2 points) In which group(s) is the slope statistically significant?

Modelling

Now you will analyze the problem by fitting linear models predicting shoe numbers from shoe size and gender. Note that gender is a factor variable in the data frame survey with two levels: female and male. To be consistent with the coding in Stat 200 notes, we want to change the reference level to male. This can be done using the command

survey$gender <- relevel(survey$gender, "male")

(4 points) Fit a linear model predicting the number of shoes from shoe size and gender without an interaction term. Express the result in an equation as follows.
Shoe numbers = (some number) + (some number)×(shoe size) + (some number)×(genderfemale)
Note that 'genderfemale' is a binary variable: genderfemale=0 for males and genderfemale=1 for females.
(5 points) Split the equation in (d) into male and female groups. Show your steps. Express your equations in the same form as part (c). Note: Do the splitting the same way as you do in Stat 200 (also in this week's notes). You don't need to use any R commands.
(1 point) Are the equations exactly the same as in part (c)?
(5 points) Fit a linear model predicting the number of shoes from shoe size and gender with an interaction term. Express the result in an equation as follows.
Shoe numbers = (some number) + (some number)×(shoe size) + (some number)×(genderfemale) + (some number)×(genderfemale)×(shoe size)
(5 points) Split the equation in (f) into male and female groups. Show your steps. Write the equations in the same form as in part (c). Note: Do the splitting the same way as you do in Stat 200 (also in this week's notes). You don't need to use any R commands.
(1 point) Are the equations exactly the same as in part (c)?

Guidelines

Write down your name in the R markdown file.
Generate an html file from the markdown file by knitr.
You should download the survey data to your computer and then load it to R, instead of loading it directly from the website. This is for the purpose of reproducibility: the file at the remote website may be changed or disappear later and your R markdown file cannot be run. By saving it to your computer in the same location of the markdown file, you can be sure that your markdown result is reproducible.
Show all codes and output. If your code chunk contains more than a few lines, include a brief explanation of what you are doing unless you write your code in a self-explanatory style.
Label and state the answers to each question clearly. Don't just show the code and say that the information is in the output.

Solution

RMarkdown file (Download the file and save it in the same folder as the data file you downloaded above. Open it with R Studio and then click "Knit".)

Knitted html file