In this problem, you will analyze the Stat 100 survey 3 data in Fall 2015. The csv file can be downloaded here. The column variables are described on this webpage. The purpose of this exercise is to find out if there is any association between happiness (a person’s subjective well-being) and a person’s temperament (introvert/extrovert/ambirvert).

a. (2 points)

Create box plots of happiness for the introverts, extroverts and ambiverts. Add the group means to the box plots.

# Load the data to R:
survey <- read.csv("Stat100_2015fall_survey03.csv")

# Create box plots
plot(happiness ~ temperament, data=survey, las=1)
# Calculate group means
group_means <- tapply(survey$happiness, survey$temperament, mean)
# Add group means with red points
points(group_means, col="red", pch=16)

b. (3 points)

Perform an F-test to determine if there are any significant differences on the reported scale of happiness among introverts, extroverts and ambiverts. (2 pts)

The R command for the F-test is

summary(aov(happiness ~ temperament, data=survey))
              Df Sum Sq Mean Sq F value   Pr(>F)    
temperament    2    128   64.05   13.64 1.43e-06 ***
Residuals   1042   4894    4.70                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the p-value and state your conclusion. (1 pt)

From the output, the p-value is 1.43×10-6. This means that at least one group is significantly different from the other.

c. (4 points)

Perform pairwise t-tests with Bonferroni correction to adjust the p-values. (2 pts)

Pairwise t-tests with Bonferroni correction:

pairwise.t.test(survey$happiness, survey$temperament, p.adjust="bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  survey$happiness and survey$temperament 

          Ambivert Extrovert
Extrovert 0.0164   -        
Introvert 0.0016   7.4e-07  

P value adjustment method: bonferroni 

Determine from the adjusted p-values which pairs of groups show significant differences at the 5% level. (2 pts)

All of the adjusted p-values are less than 5%, which means that all pairs of groups show significant differences at the 5% level.

d. (5 points)

Perform a randomization test by scrambling the happiness variable and calculate the values of R2 of the scrambled data. Repeat the experiment at least 5000 times. (3 pts)

Note: The result of your randomization test must be reproducible. Therefore, you must set a seed number before calling any function involving random numbers. Use set.seed(your UIN number). You can simply follow the procedure in this week’s notes. You are not required to optimize the code.

Following the same method as in this week’s notes, the code is as follows.

# Define a function that computes R^2
computeR2 <- function(y,x) {
  summary(lm(y~x))$r.squared
}

# Compute the original R^2
(R20 <- computeR2(survey$happiness, survey$temperament))
[1] 0.02550464
# Perform randomization test. Do 5000 experiments 
set.seed(69678689)
R2 <- replicate(5000, computeR2(sample(survey$happiness),survey$temperament))

Make a histogram of these R = \(\sqrt{R^2}\) and indicate the position of the original R (from the unscrambled data). (2 pt)

hist(sqrt(R2), freq=FALSE,breaks=50, xlim=c(0,0.2), xlab="R")
# Add a vertical line at the position of the original R
abline(v=sqrt(R20), col="red")

e. (3 points)

Use the result of (d) to estimate the p-value. How does this estimated p-value compare to the one computed in part (b)?

The estimated p-value is the fraction of the values in R2 greater than R20:

(p_estimate <- mean(R2>R20))
[1] 0

The value is 0. This means that none of the 5000 values in R2 is greater than R20. So the estimated p-value is less than 1/5000 or 2×10-4. This is consistent with the p-value 1.43×10-6 calculated from the F statistic in part (b).