Sir Francis Galton (1822–1911) was an English statistician. He founded many concepts in statistics, such as correlation, quartile, percentile and regression, that are still being used today. In this R markdown exercise, you are going to analyze the famous Galton data on the heights of parents and their children. The data were collected in the late 19th century in England. He coined the term regression towards mediocrity to describe the result of his linear model. (Note that the paper was written in 1886. The "computer" mentioned in the paper was actually a person whose job was to do number crunching.) Surprisingly, Galton's analysis is still useful today (see e.g. Predicting height: the Victorian approach beats modern genomics, Predicting human height by Victorian and genomic methods). Galton's height data can be download here (right click and choose Save Link As...). The description of the data can be found on this webpage. Note that this is not a csv file. You need to use the `read.table()` function with appropriate parameters to load the data correctly to R. ### a. (1 point) **Calculate the correlation matrix between `Height`, `Father` and `Mother`.** Show how you load the data file. Then show your code for the correlation matrix and the result here... ### b. (2 points) **Use the `pairs()` function to create a matrix of scatterplots of the columns `Height`, `Father` and `Mother`. This is a graphical representation of the correlation matrix calculated above. (Hint: You need to subset the data frame to pull the 3 columns and then pass them to the `pairs()` function.)** Show your code and plot here... ### c. (5 points) **Fit a multiple regression model predicting children's height (`Height`) from father's height (`Father`), mother's height (`Mother`), and gender (`Gender`). In other words, the model should contain the following terms:** $$\hat{H}_{children} = \beta_0 + \beta_1 f_{gender} + \beta_2 H_{father} + \beta_3 H_{mother} ,$$ **where $\hat{H}_{children}$ is the predicted height (in inches) of the adult children, H_father and H_mother are the height (in inches) of the father and mother, respectively. f_gender is a binary variable: f_gender=0 for males and f_gender=1 for females. (4 pts)** Show your code and result here... **Which slopes are significant (at the 5% level)? (1 pt)** Write your answer here... ### d. (2 points) **Plot the residuals versus the fitted values for the multiple regression model above.** Show your code and plot here...
**Instead of fitting a multiple regression model, Galton constructed a simple model predicting children's height from parents' heights. However, he first had to deal with the gender difference between male and female heights.**
### e. (4 points) **Calculate the means of Father's and Mother's heights in the data set. Then show that Father's mean height is about 8% higher than Mother's mean height. (2 pts)** Show your calculation here... **Calculate the mean heights of the adult male and female children in the data set. Then show that male children's mean height is also about 8% higher than female children's mean height. (2 pts)** Show your calculation here... ### f. (4 points) **Calculate the medians of Father's and Mother's heights in the data set. Then show that Father's median height is about 8% higher than Mother's median height. (2 pts)** Show your calculation here... **Calculate the median heights of the adult male and female children in the data set. Then show that male children's median height is also about 8% higher than female children's median height. (2 pts)** Show your calculation here...
**Galton defined the mid-parental height as the average of the Father's and Mother's height:** $$H_{midparental} = \frac{1}{2} (H_{father} + 1.08 H_{mother}) ,$$ **where the factor 1.08 was introduced to account for the gender difference. He also "transmuted" the heights of all female children to the male equivalents by multiplying the female heights by 1.08. He then fitted a model predicting children's adjusted height from the mid-parental height.**
### g. (4 points) **Add a column to the data frame that stores the mid-parental heights. (1 pt)** Show your code here... **Add another column to the data frame that stores the adjusted heights of the children: the adjusted heights of the male children are the same as their heights; the adjusted heights of the female children are equal to their heights times 1.08. (2 pts)** Show your code here... **Calculate the correlation coefficient between the children's adjusted height and the mid-parental height (1 pt)** Show your calculation here... ### h. (4 points) **Fit a simple regression model predicting children's adjusted height from the mid-parental height. (2 pts)** Show your code and result here... **Make a scatter plot of children's adjusted height vs the mid-parental height and then add the regression line on the plot. (2 pts)** Show your code and plot here... ### i. (2 points) **Plot the residuals versus the fitted values for the simple regression model above.** Show your code and plot here...
**How does the simple regression model in (h) compare with the multiple regression model in (c)? One measure of the "goodness of fit" is R². However, comparing R² returned by the model in (c) and R² of the model in (h) is misleading because their predicted variables are different. In (c), the predicted variable is children's height, whereas in (h) the predicted variable is children's adjusted height. To have a fair comparison, we want to calculate the R² of the model in (c) for the adjusted height and then compare it with the R² in (h).**
### j. (8 points) **1. (2 points) Calculate the predicted values of children's adjusted height from the multiple regression model by multiplying the predicted heights by 1.08 for female children and keeping the predicted heights of the male children unchanged. Store the result in a new variable.** Show your code here... **2. (5 points) Calculate R² for the adjusted heights of the model in (c) by $R^2_{AH} = 1 - SSE_{AH}/SST_{AH}$, where $SSE_{AH}=\sum (AH-\hat{AH})^2$ and $SST_{AH}=\sum (AH-\overline{AH})^2 = (n-1) s^2_{AH}$. Here $AH$ is the actual adjusted heights of the Galton children calculated in (h) above, $\hat{AH}$ is the predicted adjusted heights calculated in (j1) above, $\overline{AH}$ is the mean of the adjusted height, $s^2_{AH}$ is the sample variance of the adjusted height, and $n$ is the total number of observations in the dataset.** Show your calculation here... **3. (1 point) Based on the values of the R² for the adjusted height, is the multiple regression model in (c) much better than the simple regression model in (h)?** Write your answer here...
**Alan is a boy born in Guatemala. Carly is a girl born in India. They are both two years old. The heights of Alan's father and mother are 62 inches and 58 inches, respectively. The heights of Carly's father and mother are 68 inches and 65 inches, respectively.**
### k. (4 points) **Use the multiple regression model above to predict the height of Alan and Carly when they become adults.** Show your code and result here... ### l. (4 points) **Use the simple regression model above to predict the height of Alan and Carly when they become adults.**
**Note: You'll need to convert the predicted adjusted height back to height for Carly.** Show your calculation here... ### m. (2 points) **Explain why the multiple and simple regression models above may not be suitable for predicting the adult heights for Alan and Carly.**
**Hint: Watch this video for a similar question, or read the bottom of P.35 in the Fall 2017 Stat 200 notebook for two other similar questions.** Write your answer here...