R Markdown Exercise: Galton's Height Data

You now have experience in creating an RMarkdown document from scratch. We now strongly recommend you to use this RMarkdown template (right click the link and choose Save Link As...) for this exercise. Download the template and fill in the calculations as indicated in the file.

Sir Francis Galton (1822–1911) was an English statistician. He founded many concepts in statistics, such as correlation, quartile, percentile and regression, that are still being used today.

In this R markdown exercise, you are going to analyze the famous Galton data on the heights of parents and their children. The data were collected in the late 19th century in England. He coined the term regression towards mediocrity to describe the result of his linear model. (Note that the paper was written in 1886. The "computer" mentioned in the paper was actually a person whose job was to do number crunching.) Surprisingly, Galton's analysis is still useful today (see e.g. Predicting height: the Victorian approach beats modern genomics, Predicting human height by Victorian and genomic methods).

Galton's height data can be download here (right click and choose Save Link As...). The description of the data can be found on this webpage. Note that this is not a csv file. You need to use the read.table() function with appropriate parameters to load the data correctly to R.

  1. (1 points) Calculate the correlation matrix between 'Height', 'Father' and 'Mother'.
  2. (2 points) Use the pairs() function to create a matrix of scatterplots of the columns 'Height', 'Father' and 'Mother'. This is a graphical representation of the correlation matrix calculated above. (Hint: You need to subset the data frame to pull the 3 columns and then pass them to the pairs() function.)
  3. (5 points)
    (4 pts) Fit a multiple regression model predicting children's height ('Height') from father's height ('Father'), mother's height ('Mother'), and gender ('Gender'). In other words, the model should contain the following terms:
    Hchildren = β0 + β1 fgender + β2 Hfather + β3 Hmother,
    where Hchildren is the predicted height (in inches) of the adult children, Hfather and Hmother are the height (in inches) of the father and mother, respectively. fgender is a binary variable: fgender=0 for males and fgender=1 for females.
    (1 pt) Which slopes are significant (at the 5% level)?
  4. (2 points) Plot the residuals versus the fitted values for the multiple regression model above.

  5. Instead of fitting a multiple regression model, Galton constructed a simple model predicting children's height from parents' heights. However, he first had to deal with the gender difference between male and female heights.

  6. (4 points)
    (2 pts) Calculate the means of Father's and Mother's heights in the data set. Then show that Father's mean height is about 8% higher than Mother's mean height.
    (2 pts) Calculate the mean heights of the adult male and female children in the data set. Then show that male children's mean height is also about 8% higher than female children's mean height.
  7. (4 points)
    (2 pts) Calculate the medians of Father's and Mother's heights in the data set. Then show that Father's median height is about 8% higher than Mother's median height.
    (2 pts) Calculate the median heights of the adult male and female children in the data set. Then show that male children's median height is also about 8% higher than female children's median height.

  8. Galton defined the mid-parental height as the average of the Father's and Mother's height:
    Hmidparental = (Hfather + 1.08 Hmother)/2,
    where the factor 1.08 was introduced to account for the gender difference. He also "transmuted" the heights of all female children to the male equivalents by multiplying the female heights by 1.08. He then fitted a model predicting children's adjusted height from the mid-parental height.

  9. (4 points)
    (1 pt) Add a column to the data frame that stores the mid-parental heights.
    (2 pts) Add another column to the data frame that stores the adjusted heights of the children: the adjusted heights of the male children are the same as their heights; the adjusted heights of the female children are equal to their heights times 1.08.
    (1 pt) Calculate the correlation coefficient between the children's adjusted height and the mid-parental height
  10. (4 points)
    (2 pts) Fit a simple regression model predicting children's adjusted height from the mid-parental height.
    (2 pts) Make a scatter plot of children's adjusted height vs the mid-parental height and then add the regression line on the plot.
  11. (2 points) Plot the residuals versus the fitted values for the simple regression model above.

  12. How does the simple regression model in (h) compare with the multiple regression model in (c)? One measure of the "goodness of fit" is R2. However, comparing R2 returned by the model in (c) and R2 of the model in (h) is midleading because their predicted variables are different. In (c), the predicted variable is children's height, whereas in (h) the predicted variable is children's adjusted height. To have a fair comparison, we want to calculate the R2 of the model in (c) for the adjusted height and then compare it with the R2 in (h).

  13. (8 points)
    1.   (2 pts) Calculate the predicted values of children's adjusted height from the multiple regression model by multiplying the predicted heights by 1.08 for female children and keeping the predicted heights of the male children unchanged. Store the result in a new variable.
    2.   (5 pts) Calculate R2 for the adjusted heights of the model in (c) by R2AH = 1 - SSEAH/SSTAH, where SSEAH=∑ (AH-AHpredicted)2 and SSTAH=∑ (AH-AH)2 = (n-1) s2AH. Here AH is the actual adjusted heights of the Galton children calculated in (h) above, AHpredicted is the predicted adjusted heights calculated in (j1) above, AH is the mean of the adjusted height, s2AH is the sample variance of the adjusted height, and n is the total number of observations in the dataset.
    3.   (1 pt) Based on the values of the R2 for the adjusted height, is the multiple regression model in (c) much better than the simple regression model in (h)?

    Alan is a boy born in Guatemala. Carly is a girl born in India. They are both two years old. The heights of Alan's father and mother are 62 inches and 58 inches, respectively. The heights of Carly's father and mother are 68 inches and 65 inches, respectively.

  14. (4 points) Use the multiple regression model above to predict the height of Alan and Carly when they become adults.
  15. (4 points) Use the simple regression model above to predict the height of Alan and Carly when they become adults.
    Note: You'll need to convert the predicted adjusted height back to height for Carly.
  16. (2 points) Explain why the multiple and simple regression models above may not be suitable for predicting the adult heights for Alan and Carly.
    Hint: Watch this video for a similar question, or read Example 1e in this Stat 100 notes, or the bottom of P.35 in the Fall 2017 Stat 200 notebook for two other similar questions.

Guidelines


Solution