We will introduce several commands in R that perform statistical tests. It should be pointed out that there are many options in these commands that can be used to do more sophisticated analyses, but we will only cover topics that have been introduced in Stat 100 and Stat 200.
We will use Examples 2 and 3 in Chapter 22, part 1 of the Stat 100 notes to demonstrate the tests. (See this video for a review on the \(\chi^2\) goodness-of-fit Test)
Do people’s astrological signs affect their success? Fortune magazine published the zodiac signs of the 264 top executives of Fortune-400 Companies. Here are the data of the observed frequencies.
Observed <- c(23, 25, 30, 24, 20, 19, 23, 20, 20, 19, 21, 20)
names(Observed) <- c("Capricorn", "Aquaries", "Pisces", "Aries", "Taurus",
"Gemini", "Cancer", "Leo", "Virgo", "Libra",
"Scorpio", "Sagittarius")
Observed
Capricorn Aquaries Pisces Aries Taurus Gemini
23 25 30 24 20 19
Cancer Leo Virgo Libra Scorpio Sagittarius
23 20 20 19 21 20
To answer the question, we perform a \(\chi^2\) goodness-of-fit test. The null hypothesis is that people’s astrological signs do not affect their success. The differences in the observed frequencies are simply due to chance variation. The command chisq.test()
can be used to do the test. The syntax is
chisq.test(x, p = rep(1/length(x), length(x)))
Here x
is a numeric vector storing the observed frequencies. The parameter p
is a numeric vector of the same length as x
, giving the expected probabilities. The default value of p
are set to rep(1/length(x), length(x))
, which simply means that the expected probability is the same for each element in x
. In our case, x
is Observed
and we expect that each zodiac sign is equally likely under the null hypothesis, so we can use the default value of p
.
chisq.test(Observed)
Chi-squared test for given probabilities
data: Observed
X-squared = 5.1818, df = 11, p-value = 0.922
The test gives \(\chi^2\) = 5.1818 with df = 11, the same as the values calculated in the Stat 100 notes. The large p-value of 0.922 indicates that the null is not rejected. So the conclusion is that there is no evidence to indicate that people’s astrological signs affect their success.
Random Sample? A town has 50% Christian, 30% Jewish, and 20% Muslim children. 100 children were chosen to participate in a community project. Of the 100, 50 were Christians, 40 were Jewish and 10 were Muslim. Could this have been a random sample?
First, we enter the data.
Observed <- c(50, 40, 10)
names(Observed) <- c("Christian", "Jewish", "Muslim")
Observed
Christian Jewish Muslim
50 40 10
Unlike the previous example, the probabilities are not expected to be equal but should be 0.5 for Christians, 0.3 for Jewish and 0.2 for Muslim. The \(\chi^2\) goodness-of-fit test can be done using the following command.
chisq.test(Observed, p=c(0.5, 0.3, 0.2))
Chi-squared test for given probabilities
data: Observed
X-squared = 8.3333, df = 2, p-value = 0.0155
The test returns \(\chi^2\) = 8.3333 with df = 2, and the corresponding p-value is 0.0155. These numbers are again consistent with the calculations in the Stat 100 notes. Since the p-value is less than 0.05, the null is rejected at the significant level \(\alpha\)=0.05 and the conclusion is that the children were not chosen randomly from the town.
In Spring 2014, 858 Stat 100 students responded to a survey. The csv file of the survey data has been uploaded here. You can download it to your work space and then load it to R using the command
survey <- read.csv("Stat100_2014spring_survey02.csv")
The description of the data can be found on this webpage.
In the survey data, the column named ‘homeTown’ records whether the students are from a small town, a medium sized city (like Champaign Urbana), a big city suburb (like Chicago suburb) or from a big city (like Chicago but not the suburbs). Suppose we want to know if there is any difference in students’ home towns among different ethnic groups. We first create a contingency table to look at the data. The contingency table can be generated easily using the table()
function:
tbl <- with(survey, table(homeTown, ethnicity))
tbl
ethnicity
homeTown Asian Black Hispanic Mixed_Other White
big_city 64 41 51 8 26
Medium_City 21 10 11 9 66
Small_Town 10 4 7 7 93
suburbs 89 29 30 19 263
To visualize the data, we can make bar plots:
plot(homeTown ~ ethnicity, data=survey)
It appears from the plots that there are differences among the groups. For example, the majority of Black and Hispanic students live in big cities, but the majority of White students live in big city suburbs.
We can perform a \(\chi^2\) independence test to test if the observed differences are statistically significant. If you forget what a \(\chi^2\) independence test is, read Chapter 22, part 2 of Stat 100 notes or watch this video. The null hypothesis is that a student’s home town is independent of his/her ethnicity. The \(\chi^2\) independence test can be done using the chisq.test()
function. The simplest syntax is
chisq.test(x)
Unlike the \(\chi^2\) goodness-of-fit test, x
here is a matrix of the observed frequencies. Since we already have a contingency table tbl
, which is a matrix, we can simply apply the function on tbl
:
chisq.test(tbl)
Pearson's Chi-squared test
data: tbl
X-squared = 187.99, df = 12, p-value < 2.2e-16
The test indicates that \(\chi^2\) = 187.99 with df = 12 and the p-value is < 2.2e-16. This means that the null is rejected, or there is significant difference between at least one ethnic group.
Another syntax of chisq.test()
is
chisq.test(x,y)
where both x and y are vectors. For example, we can perform the same \(\chi^2\) independence test for the home town example using the command
with(survey, chisq.test(homeTown, ethnicity))
Pearson's Chi-squared test
data: homeTown and ethnicity
X-squared = 187.99, df = 12, p-value < 2.2e-16
We see that the result is exactly the same as before. A contingency table is not necessary in this syntax.
Like all other commands you will encounter in this note, you can use chisq.test()
as a black box without knowing what exactly it is doing behind the scene. However, we encourage you to go through at least one calculation to verify that the result outputted by chisq.test()
is exactly the same as what you get from the calculation you learned in Stat 100. Here we will demonstrate that. As a bonus, you get to learn a new R command as we go along.
Recall that in the \(\chi^2\) test, the value of \(\chi^2\) is calculated by the formula \[\chi^2 = \sum \frac{(Obs - Exp)^2}{Exp} \ \ \ \ \ (1)\] where \(Obs\) is the observed frequency and \(Exp\) is the expected frequency. The sum goes over each cell in the contingency table. The values of \(Obs\) are given by the tbl
variable we created above. The expected frequency at a particular cell is given by the formula \[Exp = \frac{({\rm row\ total})\times ({\rm column\ total})}{\rm overall\ total}\] The row total in a given row is the sum of the observed frequencies in that row. The column total in a given column is the sum of the observed frequencies in that column. These can be computed by the rowSums()
and colSums()
functions you learned in Week 7 (Section 18.8 of Peng’s textbook):
row_tot <- rowSums(tbl)
col_tot <- colSums(tbl)
row_tot
big_city Medium_City Small_Town suburbs
190 117 121 430
col_tot
Asian Black Hispanic Mixed_Other White
184 84 99 43 448
The overall total is the same as the total number of observations, which is 858.
overall_tot <- nrow(survey)
The next step is to create a 4×5 matrix to store the expected frequencies in all cells. We want the (i,j) element of the matrix to store the product of the ith element of row_tot
and the jth element of col_tot
divided by overall_tot
. The operation involves what mathematicians called the “outer product” of two vectors. We can of course use two nested for-loops to implement the calculation, but since this is a well-known operation you will think that there should be a built-in command in R for this type of operations. A quick search over the internet confirms our guess. The function %o%
is exactly what we want to carry out the outer product (presumably o
stands for “outer product”). It works like this. Suppose x
is a vector of length n
and y
is a vector of length m
. Then x %o% y
returns an n×m matrix with the (i,j) element equal to the product of the ith element of x
and the jth element of y
. So the expected frequencies can be computed as follows.
Exp <- row_tot %o% col_tot / overall_tot
Exp
Asian Black Hispanic Mixed_Other White
big_city 40.74592 18.60140 21.92308 9.522145 99.20746
Medium_City 25.09091 11.45455 13.50000 5.863636 61.09091
Small_Town 25.94872 11.84615 13.96154 6.064103 63.17949
suburbs 92.21445 42.09790 49.61538 21.550117 224.52214
With the observed and expected frequencies, we can now compute \(\chi^2\) using equation (1):
chisq <- sum( (tbl-Exp)^2/Exp )
chisq
[1] 187.9871
The df is given by (number of rows -1)×(number of columns - 1), which is 3×4 or 12. So we have \(\chi^2\) = 187.99 with df = 12, the same as the values returned by chisq.test()
. The p-value can be calculated using the pchisq()
function introduced in Week 3.
pchisq(chisq,12,lower.tail=FALSE)
[1] 9.747495e-34
The result is again consistent with the output of chisq.test()
.
One nice thing about our calculation is that we can also get an idea of how the groups differ by computing the fractional difference of the observed and expected frequencies:
(tbl-Exp)/Exp
ethnicity
homeTown Asian Black Hispanic Mixed_Other White
big_city 0.57070938 1.20413534 1.32631579 -0.15985312 -0.73792293
Medium_City -0.16304348 -0.12698413 -0.18518519 0.53488372 0.08035714
Small_Town -0.61462451 -0.66233766 -0.49862259 0.15433404 0.47199675
suburbs -0.03485844 -0.31112957 -0.39534884 -0.11833423 0.17137666
We see that there are unexpectedly large proportions of Black and Hispanic students living in big cities, unexpectedly small proportion of White students living in big cities, unexpectedly small proportions of Asian and Black students living in small towns.
On a scale of 0–10 rate where you fall: “0” means you strongly favor gay marriage… and “10” means you strongly oppose gay marriage.
We want to know if there is any difference between males and females towards gay marriage.
The response on the gay marriage question is in the column labelled ‘gayMarriage’ and the gender information is in the column ‘gender’. We plot histograms for males and females to see if there is any obvious difference:
library(lattice)
histogram(~gayMarriage | gender, data=survey, layout=c(1,2))
Here we use R’s lattice graphics function histogram()
to create conditional plots. The layout=c(1,2)
option specifies the histograms should be plotted on a single column with two rows.
Another useful plots are box plots, which can be generated by the boxplot()
or simply the plot()
command
plot(gayMarriage ~ gender, data=survey,las=1)
When the x-axis variable is a factor, the plot()
command generates a box plot for each category1. Note that the thick lines are medians. It is useful to indicate the means on the plots as well since the t-test compares the group means not medians. We can use tapply()
to calculate the group means and points()
to add them to the box plots:
means <- tapply(survey$gayMarriage, survey$gender, mean)
plot(gayMarriage ~ gender, data=survey,las=1)
points(means, col="red", pch="+")
The red +‘s denote the means in the box plots. It appears from the histograms and box plots that females have lower values in the ’gayMarriage’ variable, suggesting that females tend to favor gay marriage compared to males.
We can perform a t-test to see if the group means for ‘gayMarriage’ show a significant difference between males and females. The command t.test()
can be used for this analysis. In particular, if x
is a binary factor variable and y
is a numeric/integer vector,
t.test(y ~ x, data=data_name, alternative=c("two-sided","less","greater"),
var.equal=TRUE)
compares the means of y
in the two groups specified by the factor variable x
. The optional parameter alternative
specifies what alternative hypothesis is. The default is “two-sided”. In this case, the hypotheses are
null H0: the means are the same for the two groups
alternative Ha: the means are different for the two groups
The option var.equal=TRUE
assumes that x
and y
have the same variances and the pooled variance is calculated. If the option var.equal=TRUE
is omitted, t.test()
performs a more complicated analysis assuming the two variances are not equal. The Welch t-test, which some of you may have learned in a more advanced Stat course, is performed in that case. We will always use the option var.equal=TRUE
in this course. Let’s apply the function to our data:
t.test(gayMarriage ~ gender, data=survey, var.equal=TRUE)
Two Sample t-test
data: gayMarriage by gender
t = -5.5244, df = 856, p-value = 4.388e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.7074607 -0.8122462
sample estimates:
mean in group Female mean in group Male
2.140820 3.400673
Here gayMarriage is y
and gender is x
. Since ‘gender’ is a two-level factor variable with ‘Female’ as the reference level, the difference between the means is \(mean(gayMarriage_{Female})-mean(gayMarriage_{Male})\). The result shows that the difference between the means for females and males are highly significant. The p-value is 4.4×10-8, and the 95% confidence interval of \(mean(gayMarriage_{Female})-mean(gayMarriage_{Male})\) is (-1.707,-0.812). This means that females tend to favor gay marriage compared to males. The p-value is two-sided since we don’t specify the alternative
parameter and it takes the default value.
If we want to test whether the mean for females is less than males, we set the alternative
parameter to “less”:
t.test(gayMarriage ~ gender, data=survey, alternative="less", var.equal=TRUE)
Two Sample t-test
data: gayMarriage by gender
t = -5.5244, df = 856, p-value = 2.194e-08
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.8843343
sample estimates:
mean in group Female mean in group Male
2.140820 3.400673
When we set alternative="less"
, the alternative hypothesis becomes Ha: the mean for females is less than males. We see that the p-value is half of the previous value.
Another syntax of using the t.test()
function for a two-sample t-test is
t.test(y1,y2, alternative=c("two.sided","less","greater"), var.equal=TRUE)
Here y1 and y2 are the integer/numeric vectors whose means are to be compared. For example, the command t.test(gayMarriage ~ gender, data=survey, var.equal=TRUE)
is the same as
with(survey, t.test(gayMarriage[gender=="Female"], gayMarriage[gender=="Male"],
var.equal=TRUE))
Two Sample t-test
data: gayMarriage[gender == "Female"] and gayMarriage[gender == "Male"]
t = -5.5244, df = 856, p-value = 4.388e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.7074607 -0.8122462
sample estimates:
mean of x mean of y
2.140820 3.400673
Note that we use the with(survey, ...)
syntax in order to access the variables ‘gayMarriage’ and ‘gender’ in the ‘survey’ data frame without the need to attach the prefix ‘survey$’. The second syntax of t.test()
will come in handy when a factor variable is not available to separate the two groups. For example, the two groups may be stored in two separate data frames.
The t.test()
function described above performs the calculation as stated on P.122 of the Stat 200 notes (Fall 2017 Edition): \[t = \frac{{\rm ObsDiff}}{SE^+_{\rm difference}}\] \[SE^+_{\rm difference} = SD^+_{\rm errors}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\] \[SD^+_{\rm errors}=\sqrt{\frac{SSE}{n-g}}\] \[SSE = \sum_{i=1}^n (y_i-\hat{y}_i)^2= \sum_{i=1}^{n_1} (y_{1i}-\bar{y}_1)^2 + \sum_{i=1}^{n_2} (y_{2i}-\bar{y}_2)^2\] where \(n=n_1+n_2\) is the total number of observations and \(g\) is the number of groups. In our case, \(n_1\) is the number of females and \(n_2\) is the number of males and \(g=2\). It is instructive to write our own code for the t-test and confirm that we get the exact same answer as the t.test()
result:
n1 <- sum(survey$gender=="Female")
n2 <- sum(survey$gender=="Male")
n <- n1+n2
means <- tapply(survey$gayMarriage, survey$gender, mean)
ObsDiff <- means["Female"] - means["Male"]
SSE <- with(survey, sum( (gayMarriage[gender=="Female"]-means["Female"])^2)
+ sum( (gayMarriage[gender=="Male"]-means["Male"])^2) )
SDplus_err <- sqrt(SSE/(n-2))
SEplus_diff <- SDplus_err*sqrt(1/n1 + 1/n2)
(t <- unname(ObsDiff/SEplus_diff))
[1] -5.524405
As expected, we obtain the same t-value as returned by the t.test()
function. The one-sided and two-sided p-values can be computed by the pt()
function:
pval1 <- pt(t,n-2)
pval2 <- 2*pt(t,n-2)
c(pval1,pval2)
[1] 2.193757e-08 4.387514e-08
which are again the same as those returned by t.test()
.
The F-test is used to compare the means among more than two groups. Consider Example 2 on P.118 of the Stat 200 notes (Fall 2017 Edition), where we want to know if there are any differences between ethnic groups on their attitudes towards gay marriage. Let’s first make a few plots.
histogram(~gayMarriage | ethnicity, data=survey)
means <- with(survey, tapply(gayMarriage, ethnicity, mean) )
plot(gayMarriage ~ ethnicity, data=survey, las=1)
points(means, col="red", pch="+")
There appears to be differences among the groups. R’s aov()
function can be used to perform an F-test. The syntax is
aov(y ~ x, data=data_name)
Here x
is a factor variable (or an object that can be coerced into a factor) that can contain more than two levels. For our problem, ‘gayMarriage’ is y
and ‘ethnicity’ is x
:
result <- aov(gayMarriage ~ ethnicity, data=survey)
summary(result)
Df Sum Sq Mean Sq F value Pr(>F)
ethnicity 4 330 82.54 8.165 1.84e-06 ***
Residuals 853 8623 10.11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test gives an F value of 8.165 with df1 = 4 and df2 = 853. The corresponding p-value is 1.84×10-6. Hence the difference is highly significant. We conclude that the mean of at least one group is different from the others.
In the summary output, the ‘Sum Sq’ column displays the sum of square quantities. In the ‘ethnicity’ row, it is the sum of square between groups, or SSB in the Stat 200 notation. In the ‘Residual’ row, it is the sum of square within groups, or SSW in the Stat 200 notation. The ‘Mean Sq.’ column is calculated by dividing the value in ‘Sum Sq.’ column by the Df. In stat 200 notation, the ‘Mean Sq.’ column in the ‘ethnicity’ row is MSB; the ‘Mean Sq.’ column in the ‘Residuals’ row is MSE. The F-value is calculated by F = MSB/MSW.
The F-test calculation is described on P.120 of the Stat 200 notes (Fall 2017 Edition): \[F = \frac{MSB}{MSW}=\frac{SSB/(g-1)}{SSW/(n-g)}\] \[SSB = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\] \[SSW = \sum_{i=1}^n (y_i-\hat{y}_i)^2\] Here \(n\) is the total number of observations, \(g\) is the number of groups, \(\bar{y}\) is the overall mean and \(\hat{y}_i\) is the predicted value for the ith observation. The value of \(\hat{y}_i\) is simply the mean of the group to which the ith observation belongs. We now go through the coding exercise to reproduce the aov()
result from the formulae:
y <- survey$gayMarriage
x <- survey$ethnicity
n <- nrow(survey)
g <- length(levels(x)) # number of groups
bar_y <- mean(y)
means <- tapply(y,x,mean)
# Set hat_y[i] to the average of the group to which i belongs
hat_y <- means[x]
SSB <- sum((hat_y - bar_y)^2)
SSW <- sum((y - hat_y)^2)
SST <- (n-1)*var(y)
c(SSB,SSW,SST)
[1] 330.1631 8623.2600 8953.4231
# Sanity check: SST = SSW + SSB ?
(SST - SSW - SSB)/SST
[1] -6.348792e-18
F <- (SSB/(g-1)) / (SSW/(n-g))
F
[1] 8.164809
As expected, We obtain the same F value. We have also verified that the equality SST=SSW+SSB holds, as expected. In addition, SSB and SSW are the same numbers as in the ‘Sum Sq’ column in the output of summary(result)
above. To compute the associated p-value, we note that the numerator and denominator degrees of freedom for the F statistics are \(g-1\) and \(n-g\), respectively. The p-value is the area under the F curve to the right of the F value, which can be calculated using the pf()
function:
pf(F,g-1,n-g, lower.tail=FALSE)
[1] 1.83638e-06
This is the same p-value returned by the aov()
function. Finally, the square of the correlation between \(\hat{y}\) and \(y\) is given by \(R^2=SSB/SST\), which is 0.0368756 in our example.
From the box plots of gayMarriage versus ethnicity, it doesn’t seem that there is significant difference between ‘Asian’ and ‘Black’ nor among ‘Hispanic’, ‘Mixed/Other’ and ‘White’. We can test the significance using the aov()
function on subsets of the data.
We first create two subset conditions:
AB <- with(survey, ethnicity=="Asian" | ethnicity=="Black")
HMW <- !AB
Here AB
and HMW
are logical vectors. AB
is TRUE if ethnicity is ‘Asian’ or ‘Black’ and is FALSE otherwise. HMW
is the reverse of AB
: it is TRUE if ethnicity is ‘Hispanic’, ‘Mixed_Other’ or ‘White’ and FALSE otherwise. Like lm()
, the optional parameter subset
can be used to specify a subset of data to perform the analysis. For the Asian-Black groups, we have
result_AB <- aov(gayMarriage ~ ethnicity, data=survey, subset=AB)
summary(result_AB)
Df Sum Sq Mean Sq F value Pr(>F)
ethnicity 1 3.6 3.649 0.327 0.568
Residuals 266 2969.1 11.162
The p-value is 0.568, suggesting that there is no significant difference between ‘Asian’ and ‘Black’ on the attitude towards gay marriage. For the Hispanic-Mixed-White groups, we have
result_HMW <- aov(gayMarriage ~ ethnicity, data=survey, subset=HMW)
summary(result_HMW)
Df Sum Sq Mean Sq F value Pr(>F)
ethnicity 2 16 7.781 0.808 0.446
Residuals 587 5654 9.632
This large p-value (0.446) also indicates that there is no significant difference between the Hispanic, Mixed, and White groups on the attitude towards gay marriage.
From the previous calculation, we conclude that at least one ethnic group is different from the other on the attitude towards gay marriage. We next want to see which pairs of groups show significant differences. We need to do t-tests between pairs of groups. Before we do the tests, let’s guess what the answers may be. Looking at the box plots of gayMarriage versus ethnicity, it appears that Asians and Blacks show no significant difference; Whites, Hispanics and Mixed/Other are also similar. But Asians and Blacks appear to be significantly different from Whites, Hispanics and Mixed/Other.
We can use the pairwise.t.test()
function to perform pairwise t-tests. The syntax is
pairwise.t.test(y, x, p.adjust=method,
alternative=c("two.sided", "less", "greater"))
Here y
is a numeric/integer vector, x
is a factor variable, method
is the name of method you want to adjust the p-value. There are 8 methods, but we will only use “bonferroni” for the Bonferroni correction. You can look at the other 7 methods by typing ?p.adjust
. Just like t.test()
, the default value of the alternative
parameter is “two.sided”, but you can change it to “less” or “greater”. For our problem, y
is ‘survey$gayMarriage’ and x
is ‘survey$ethnicity’ and we will use the default “two.sided” for the alternative
parameter. The result is as follows.
pairwise.t.test(survey$gayMarriage, survey$ethnicity, p.adjust="bonferroni")
Pairwise comparisons using t tests with pooled SD
data: survey$gayMarriage and survey$ethnicity
Asian Black Hispanic Mixed_Other
Black 1.00000 - - -
Hispanic 0.00107 0.00152 - -
Mixed_Other 0.08779 0.05313 1.00000 -
White 0.00054 0.00273 1.00000 1.00000
P value adjustment method: bonferroni
The adjusted p-values given here are the same as those on P.123 of the Stat 200 notes (Fall 2017 Edition). The adjusted p-value between Asians and Blacks is 1 (not significant), between Asians and Hispanics is 0.00107 (significant), between Asians and Mixed/Other is 0.08779 (not significant), between Asians and Whites is 0.00054 (significant), between Blacks and Hispanics is 0.00152 (significant), between Blacks and Mixed/Other is 0.05313 (not significant), between Blacks and Whites is 0.00273 (significant), between Hispanics and Mixed/Other is 1 (not significant), between Hispanics and Whites is 1 (not significant), between Mixed/Other and White is 1 (not significant). So we see that the guesses above are mostly correct. The guesses are wrong on the Black-Mixed/Other and Asian-Mixed/Other where the adjusted p-values exceed the 5% cutoff.
It can be shown that performing an ANOVA analysis for comparing group means is mathematically equivalent to doing a linear regression on a factor variable. For example, let’s look at the result of the linear model predicting ‘gayMarriage’ from ‘gender’:
fit <- lm(gayMarriage ~ gender, data=survey)
summary(fit)
Call:
lm(formula = gayMarriage ~ gender, data = survey)
Residuals:
Min 1Q Median 3Q Max
-3.401 -2.141 -1.771 1.599 7.859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1408 0.1342 15.956 < 2e-16 ***
genderMale 1.2599 0.2281 5.524 4.39e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.178 on 856 degrees of freedom
Multiple R-squared: 0.03443, Adjusted R-squared: 0.0333
F-statistic: 30.52 on 1 and 856 DF, p-value: 4.388e-08
Since the reference level of ‘gender’ is “Female”, the intercept is the mean of ‘gayMarriage’ for females. The slope is the difference between the means of males and female, i.e. \(mean(gayMarriage_{Male})-mean(gayMarriage_{Female})\). The difference is 1.2599 with a t-value of 5.524. The corresponding p-value is 4.39×10-8. This is the same (two-sided) p-value returned by t.test()
shown above. The 95% confidence interval of \(mean(gayMarriage_{Male})-mean(gayMarriage_{Female})\) is the 95% CI for the slope:
confint(fit, "genderMale")
2.5 % 97.5 %
genderMale 0.8122462 1.707461
This is again consistent with the t.test()
result.
The lm()
function can also be used to compare group means between more than two groups. For example, gayMarriage ~ ethnicity:
fit <- lm(gayMarriage ~ ethnicity, data=survey)
summary(fit)
Call:
lm(formula = gayMarriage ~ ethnicity, data = survey)
Residuals:
Min 1Q Median 3Q Max
-3.643 -2.261 -1.849 1.609 8.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3913 0.2344 14.468 < 2e-16 ***
ethnicityBlack 0.2516 0.4187 0.601 0.548115
ethnicityHispanic -1.5428 0.3963 -3.893 0.000107 ***
ethnicityMixed_Other -1.4146 0.5386 -2.627 0.008779 **
ethnicityWhite -1.1301 0.2784 -4.059 5.37e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.18 on 853 degrees of freedom
Multiple R-squared: 0.03688, Adjusted R-squared: 0.03236
F-statistic: 8.165 on 4 and 853 DF, p-value: 1.836e-06
We see that the F statistic and the associated p-value are the same as the aov()
result above. In fact, the help page in ?aov
clearly states that aov()
provides a wrapper to lm
for fitting linear models. This means that the aov()
function actually calls the lm()
function. The main difference between lm()
and aov()
is in the output: the aov()
output is expressed in the traditional language of the analysis of variance rather than that of linear models.
Since ‘Asian’ is the reference level in the ‘ethnicity’ variable, the slopes in the linear model represent the differences between the means of other groups and ‘Asian’. The p-value associated with each slope is the unadjusted p-value between ‘Asian’ and each group. To get the adjusted p-values for Bonferroni correction, the p-values of the slopes need to be multiplied by \(g(g-1)/2\) or 10 for \(g=5\). Indeed, the adjusted p-values returned by the pairwise.t.test()
function shown above in the ‘Asian’ column is exactly 10 times the p-values of the slopes in the linear model (except the Asian-Black comparison where multiplying the p-value of the slope by 10 exceeds 1 and so the adjusted p-value is capped at 1).
Here we demonstrate how to use R to perform randomization tests described in Chapter 27 of the Stat 200 notes. It involves scrambling the y variable while keeping the factor variable x fixed. Repeat the experiment many times and calculate R2 of the scrambled data. In the end, we calculate the fraction of times the values of R = \(\sqrt{R^2}\) of the scrambled data exceed the R of the original data. This fraction is an estimate of the p-value from the randomization test.
There are probably packages in R that can be used to perform randomization tests, but we will write our own code here for demonstration. We first write a function to compute R2 for any given x and y. The easiest way is to call the lm()
function:
computeR2 <- function(y,x) {
summary(lm(y~x))$r.squared
}
Recall that the summary(lm(...))
function returns a number of diagnostics for the linear model. Among them is R2, which can be extracted using the command summary(lm(...))$r.squared
.
Next we apply computeR2()
on the survey
data with y
being ‘gayMarriage’ and x
being ‘ethnicity’ and check that we reproduce the R2 calculated above:
R20 <- computeR2(survey$gayMarriage, survey$ethnicity)
R20
[1] 0.03687562
This is the value we calculated above. The randomization test involves scrambling the y variable and calculating R2, and we have to repeat it many times. We can use the sample()
function to scramble the y variable, and replicate()
to repeat the experiment many times. We do it 5000 times here (in order to save time since lm()
is very slow for the task):
set.seed(63741891)
R2 <- replicate(5000, computeR2(sample(survey$gayMarriage), survey$ethnicity))
This gives us 5000 values of R2. The fraction of times R exceeds the original R is the same as the fraction of times R2 exceeds the original R2. It can be calculated by
mean(R2 > R20)
[1] 0
This means that for the 5000 experiments we tried, none of the R in the scrambled data exceeds the original R. Thus we conclude that the p-value is less than 1/5000 or 2×10-4. This is consistent with the p-value calculated from the F statistics, which is 1.84×10-6. The following is a histogram of R for the scrambled data. The vertical red line is the value of the original R.
hist(sqrt(R2),freq=FALSE,breaks=50, xlim=c(0,0.2), xlab="R")
abline(v=sqrt(R20), col="red")
This indicates that getting the original R by random sampling is highly unlikely, and hence we conclude that at least one group is significantly different from others on the attitude towards gay marriage.
We now do the randomization test for the Hispanic-Mixed-White groups, which we find from the calculation above that the p-value is 0.446. This can be easily done by subsetting the data:
x <- survey$ethnicity[HMW]
y <- survey$gayMarriage[HMW]
We first compute the original R2:
(R2_HMW <- computeR2(y,x))
[1] 0.002744753
Next we calculate R2 of the scrambled data and do it 5000 times:
set.seed(32859013)
R2 <- replicate(5000, computeR2(sample(y),x))
Here is the distribution of R:
hist(sqrt(R2), freq=FALSE, breaks=50, xlab="R")
abline(v=sqrt(R2_HMW), col="red")
The p-value is estimated by the fraction of times the values of R2 are greater than R2_HMW:
mean(R2 > R2_HMW)
[1] 0.4562
We see that this estimated p-value is close to that calculated from the F statistics. We have only done the calculation with 5000 experiments. The accuracy improves as the number of experiments increases. With 106 experiments, we find that the estimated p-value is 0.447, which is very close to the value from the F statistics.
The computeR2()
function shown above uses the lm()
function to compute R2. However, the lm()
function is very slow for randomization tests because it calculates many other unnecessary quantities. We can write a much faster code by computing R2 directly from the formula R2=SSB/SST. We discuss the detail of our code optimization in these additional notes for students who are interested in this topic.
Even though there seems to be many calculations involved here. We have only introduced a few R commands. Here is a quick summary of the commands and ideas.
The command chisq.test()
can be used to perform a \(\chi^2\) goodness-of-fit test. The syntax is
chisq.test(x, p = rep(1/length(x), length(x)))
Here x
is a numeric vector storing the observed frequencies, and p
is a numeric vector of the same length as x
storing the expected probability of each element in x
. The default value of p
is shown above. It is set so that each element in x
occurs with equal probability.
The command chisq.test()
can also used to perform a \(\chi^2\) independence test. The syntax is
chisq.test(x)
Here x
is a matrix storing the observed frequencies in the contingency table. A contingency table can be generated by the table()
function.
Another syntax for chisq.test()
is
chisq.test(x, y)
Here x
and y
are numeric/factor vectors.
plot(y ~ x)
, if x
is a factor variable and y
is a numeric/integer vector, a box plot of y
for each category in x
is generated.
The command t.test()
is used to perform a two-sample t test. The syntax is
t.test(y ~ x, data=data_name, alternative=c("two-sided","less","greater"),
var.equal=TRUE)
Here y
is a numeric/integer vector and x
is a factor variable with two levels. The default value of alternative
is “two-sided”. The option var.equal=TRUE
assumes the variance is equal in the two groups, and the pooled variance is calculated.
Another syntax for the t.test()
function is
t.test(y1,y2, alternative=c("two-sided","less","greater"), var.equal=TRUE)
Here y1
and y2
are the numeric/integer vectors for the two groups whose means are to be compared.
The command aov()
can be used to perform F-tests. The syntax is
aov(y ~ x, data=data_name)
Here y
is a numeric/integer vector and x
is a factor variable.
Pairwise t-tests can be done using the pairwise.t.test()
function, the syntax is
pairwise.t.test(y, x, p.adjust="bonferroni",
alternative=c("two.sided", "less", "greater"))
Here y
is a numeric/integer vector and x
is a factor variable. The option p.adjust="bonferroni"
means that the p-values will be adjusted for the Bonferroni correction.
lm()
function can be used for ANOVA comparing group means in place of t.test()
and aov()
.
To make a box plot on ‘gayMarriage’ without splitting it into categories, use boxplot(survey$gayMarriage)
.↩