# load tidyverse 
library(tidyverse)

In the second survey of Stat 100 in Fall 2017, students reported the number of hours they worked per week and the percent of their tuition their parents paid for. The data can be downloaded here. The hours students work per week is in the column named workHr and the percent of tuition is in the column tuition.

# load data
survey <- read_csv("stat100_2017fall_survey02.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  gender = col_character(),
  genderID = col_character(),
  greek = col_character(),
  homeTown = col_character(),
  ethnicity = col_character(),
  religion = col_character(),
  calculus = col_character(),
  GPA = col_double(),
  expectedIncome = col_double(),
  president = col_character(),
  politicalParty = col_character(),
  section = col_character()
)
See spec(...) for full column specifications.

Calaulate the average of workHr and tuition for each ethnic group (given in the column ethnicity).

Use group_by() and summarize():

(Avg <- group_by(survey, ethnicity) %>% 
   summarize(workHr_g=mean(workHr), tuition_g=mean(tuition)))
# A tibble: 6 x 3
    ethnicity workHr_g tuition_g
        <chr>    <dbl>     <dbl>
1       Black 8.031496  41.57480
2  East Asian 3.321033  83.69004
3    Hispanic 7.260638  39.30851
4       Other 7.785714  63.42857
5 South Asian 3.958333  76.11111
6       White 5.247440  63.39590

Which group has the highest average workHr? What is the highest average workHr? Which group has the lowest average workHr? What is the lowest average workHr?

Use arrange() to sort the observation by workHr:

(Avg <- arrange(Avg, workHr_g))
# A tibble: 6 x 3
    ethnicity workHr_g tuition_g
        <chr>    <dbl>     <dbl>
1  East Asian 3.321033  83.69004
2 South Asian 3.958333  76.11111
3       White 5.247440  63.39590
4    Hispanic 7.260638  39.30851
5       Other 7.785714  63.42857
6       Black 8.031496  41.57480

We see that Blacks have the highest average workHr of 8.03 hours/week, and East Asians have the lowest average workHr of 3.32 hours/week.

By default, arrange() sorts the data in ascending order. We can use the function desc() to sort the data in descending order:

arrange(Avg, desc(workHr_g))
# A tibble: 6 x 3
    ethnicity workHr_g tuition_g
        <chr>    <dbl>     <dbl>
1       Black 8.031496  41.57480
2       Other 7.785714  63.42857
3    Hispanic 7.260638  39.30851
4       White 5.247440  63.39590
5 South Asian 3.958333  76.11111
6  East Asian 3.321033  83.69004

Calculate the correlation between the group means of workHr and the group means of tuition. This is known as the ecological correlation. Compare the ecological correlation and the correlation between workHr and tuition.

# correlation between workHr and tuition
cor(survey$workHr, survey$tuition)
[1] -0.1851278
# ecological correlation
cor(Avg$workHr_g, Avg$tuition_g)
[1] -0.8536816

We see that the ecological correlation is more negative than the correlation. It is generally true that the magnitude of the ecological correlation is larger than the magnitude of the correlation, as you’ve learned in Stat 100.