Simple Data Analysis and Manipulations

Load Data File to R
Change Column Names
Looking at Data
Export Data to Files

Here we demonstrate how to load a data file to R and perform simple manipulations. We will use the second survey data of Stat 100 classes conducted in Fall 2015 to demonstrate the process.

You have been analyzing the Stat 100 survey data in the homework assignments. However, we have gone through the process of reformating the data before asking you to load them to R. Here we show you some of the techniques we use to reformat the data. The data in the real world are in general messier and need a lot of cleaning up before they can be analyzed.

To download the survey data, go to Statistics Department’s data program website, choose stat100 from the first drop-down menu on the left, then choose Survey 2, Fall 2015 (combined). After the data are loaded, you should see the following:

Data Program

This page lists all the questions in the survey and the coding used by the data program for the categorical data. Click the Display/Modify Data block on the upper left. You will enter a page that shows the data. Click the Download Raw Data box on the upper right of this page. Save the data ‘2015fall_survey02_combined.dat’ to your computer. The data in each column is separated by space.

Load Data File to R

To load the data file to R, first move the file to your R’s working directory, or alternatively, change your R’s working directory to the directory of the file. You can find out your R’s current working directory using the getwd() command. If you forget how to set your working directory, watch one of these videos again: Windows, Mac.

Once the file ‘2015fall_survey02_combined.dat’ is in the working directory, type

survey <- read.table('2015fall_survey02_combined.dat', header=TRUE)

to load the data to a data frame named survey. The header=TRUE option is to tell R that the first row in the file is a header and the data begin in the second row. Using the dim() function, we can find out the number of rows and columns of survey:

dim(survey)

[1] 1137   23

It indicates that there are 1137 observations (rows) and 23 columns. The column names are

names(survey)

 [1] "Gender.0.Male.1.Female.2.Other.."                                                                 
 [2] "Gender_ID.0.Male.1.Female.2.Other.."                                                              
 [3] "Greek.0.No.1.Yes.."                                                                               
 [4] "Home_Town.0.Small_Town.1.Medium_City.2.Big_City_Suburb.3.Big_City.."                              
 [5] "Ethnicity.0.White.1.Black.2.Hispanic.3.Asian.4.Mixed.5.Other.."                                   
 [6] "Religion.0.Christian.1.Jewish.2.Muslim.3.Hindu.4.Buddhist.5.Other_Religion.6.Agnostic.7.Atheist.."
 [7] "Religious"                                                                                        
 [8] "ACT"                                                                                              
 [9] "GPA"                                                                                              
[10] "Party_Hours_per_week"                                                                             
[11] "Drinks_per_week"                                                                                  
[12] "Num._Sex_Partners"                                                                                
[13] "Num._Relationships"                                                                               
[14] "First_Kiss_Age.26.still_waiting.."                                                                
[15] "Fav._Life_Period.0.Kindergarten.1.lower_grades.2.middle_school.3.High_School.4.College.."         
[16] "Hours_call_parents"                                                                               
[17] "Social_media.10.V10_or_more.."                                                                    
[18] "Texts"                                                                                            
[19] "Good_or_Well"                                                                                     
[20] "Parent_Relationship"                                                                              
[21] "Work_Hours_per_week"                                                                              
[22] "Percent_Parents_pay_tuition"                                                                      
[23] "Career.0.No_Idea.1.Some_Idea.2.Nervous.3.Concrete_plan.4.Grad_School.5.Other.."

When the file is loaded using the read.table() function with header=TRUE, R tries to assign the column names based on the characters in the first row. R checks the names and makes sure that they are syntactically valid variable names: variable names in R cannot contain space; they can consist of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. The column names in the file ‘2015fall_survey02_combined.dat’ contain many information that is needed for the data program, but these names are invalid variable names in R. As a result, R adjusts the names. If you want to preserve the column names as it is in the file, you can use the option check.names=FALSE:

survey <- read.table('2015fall_survey02_combined.dat', header=TRUE, check.names=FALSE)
names(survey)

 [1] "Gender(0=Male;1=Female;2=Other;)"                                                                 
 [2] "Gender_ID(0=Male;1=Female;2=Other;)"                                                              
 [3] "Greek(0=No;1=Yes;)"                                                                               
 [4] "Home_Town(0=Small_Town;1=Medium_City;2=Big_City_Suburb;3=Big_City;)"                              
 [5] "Ethnicity(0=White;1=Black;2=Hispanic;3=Asian;4=Mixed;5=Other;)"                                   
 [6] "Religion(0=Christian;1=Jewish;2=Muslim;3=Hindu;4=Buddhist;5=Other_Religion;6=Agnostic;7=Atheist;)"
 [7] "Religious"                                                                                        
 [8] "ACT"                                                                                              
 [9] "GPA"                                                                                              
[10] "Party_Hours_per_week"                                                                             
[11] "Drinks_per_week"                                                                                  
[12] "Num._Sex_Partners"                                                                                
[13] "Num._Relationships"                                                                               
[14] "First_Kiss_Age(26=still_waiting;)"                                                                
[15] "Fav._Life_Period(0=Kindergarten;1=lower_grades;2=middle_school;3=High_School;4=College;)"         
[16] "Hours_call_parents"                                                                               
[17] "Social_media(10=V10_or_more;)"                                                                    
[18] "Texts"                                                                                            
[19] "Good_or_Well"                                                                                     
[20] "Parent_Relationship"                                                                              
[21] "Work_Hours_per_week"                                                                              
[22] "Percent_Parents_pay_tuition"                                                                      
[23] "Career(0=No_Idea;1=Some_Idea;2=Nervous;3=Concrete_plan;4=Grad_School;5=Other;)"

Change Column Names

Even though the column names have now been preserved, they are still very inconvenient to use. We can change the names by assigning different names to names(survey):

names(survey) <- c("gender","genderID","greek","homeTown","ethnicity",
                   "religion","religious","ACT","GPA","partyHr",
                  "drinks","sexPartners","relationships","firstKissAge",
                  "favPeriod","hoursCallParents","socialMedia","texts",
                  "good_well","parentRelationship","workHr",
                  "percentTuition","career")
names(survey)

 [1] "gender"             "genderID"           "greek"             
 [4] "homeTown"           "ethnicity"          "religion"          
 [7] "religious"          "ACT"                "GPA"               
[10] "partyHr"            "drinks"             "sexPartners"       
[13] "relationships"      "firstKissAge"       "favPeriod"         
[16] "hoursCallParents"   "socialMedia"        "texts"             
[19] "good_well"          "parentRelationship" "workHr"            
[22] "percentTuition"     "career"

These column names have been shortened for convenience. The detail description of each column can be found on the survey data page at the data program.

Looking at Data

To take a look at the data in R, type View(survey). R opens a new window and displays the data in a spreadsheet.

If we want to perform data analysis, probably the first thing we want to check is if there are missing values in the data frame. This can easily be done using the command

sum(is.na(survey))

[1] 0

This command counts the number of NAs in the data frame, which is 0 in our case, meaning that there are no missing values.

We can look at particular rows or columns of the data in survey using the subsetting methods appropriate to data frames. For example, to look at the 178th row, type

survey[178,]

    gender genderID greek homeTown ethnicity religion religious ACT GPA
178      1        1     0        2         0        1         5  25 3.7
    partyHr drinks sexPartners relationships firstKissAge favPeriod
178       0      0           1             1           16         4
    hoursCallParents socialMedia texts good_well parentRelationship workHr
178               14           5     5         6                 10      0
    percentTuition career
178            100      4

This command shows all columns in row 178. If we only want to look at columns 1, 4 and 5, we can type

survey[178,c(1,4,5)]

    gender homeTown ethnicity
178      1        2         0

To look at the first 10 rows of column 4, type

survey[1:10,4]

 [1] 1 2 2 2 0 3 2 2 3 1

Usually we don’t remember what each column represents. That is why we go through the detail in setting and changing the column names. It is much easier to pull up a particular column using its column name. For example, to see the first 20 rows of the “GPA” column, type

survey$GPA[1:20]

 [1] 3.7 3.4 3.6 3.0 4.0 3.2 1.7 4.0 2.7 2.7 4.0 3.8 3.8 3.4 3.3 3.1 3.6
[18] 4.0 3.2 3.0

or survey[["GPA"]][1:20]. These two commands are equivalent to survey[1:20,9] and survey[1:20,"GPA"] since the “GPA” column is column 9. It is also possible to create an alias to each column name without using the prefix survey$ by the command attach(survey). However, I discourage using it for beginners as it can lead to confusion. See Good practice in the ?attach help page for detail.

Column names are much easier to use when they are short and descriptive. For example, to look at the first 10 rows in columns gender, ACT and GPA, we can use the following command:

survey[1:10, c("gender","ACT","GPA")]

   gender ACT GPA
1       1  27 3.7
2       1  25 3.4
3       1  27 3.6
4       0  29 3.0
5       1  25 4.0
6       1  27 3.2
7       0  35 1.7
8       1  30 4.0
9       1  23 2.7
10      0  27 2.7

Sometimes, we want to add new columns to a data frame. For example, to add a column called “one”, use the command

survey$one <- 1

Since “one” was not in the data frame, the above command creates a new column with the name “one”. We can also use

survey[["all 2's"]] <- 2

to create a new column named “all 2’s”. When we type names(survey), we see that these new columns are placed at the last two columns of the data frame:

names(survey)

 [1] "gender"             "genderID"           "greek"             
 [4] "homeTown"           "ethnicity"          "religion"          
 [7] "religious"          "ACT"                "GPA"               
[10] "partyHr"            "drinks"             "sexPartners"       
[13] "relationships"      "firstKissAge"       "favPeriod"         
[16] "hoursCallParents"   "socialMedia"        "texts"             
[19] "good_well"          "parentRelationship" "workHr"            
[22] "percentTuition"     "career"             "one"               
[25] "all 2's"

To remove a column from the data frame, we set it to NULL:

survey$one <- NULL
survey[["all 2's"]] <- NULL

names(survey)

 [1] "gender"             "genderID"           "greek"             
 [4] "homeTown"           "ethnicity"          "religion"          
 [7] "religious"          "ACT"                "GPA"               
[10] "partyHr"            "drinks"             "sexPartners"       
[13] "relationships"      "firstKissAge"       "favPeriod"         
[16] "hoursCallParents"   "socialMedia"        "texts"             
[19] "good_well"          "parentRelationship" "workHr"            
[22] "percentTuition"     "career"

Suppose we are interested in students having GPA greater than 3.8, we can use the command

highGPA <- survey[survey$GPA > 3.8,]

to take a subset of the data frame for students with GPA greater than 3.8. The command

highGPA$GPA

  [1] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
 [18] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0
 [35] 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9
 [52] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 3.9 4.0 4.0
 [69] 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0
 [86] 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[103] 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
[120] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[137] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[154] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0
[171] 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0
[188] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[205] 3.9 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 3.9 4.0
[222] 4.0 4.0 4.0 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[239] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 3.9 4.0
[256] 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0 3.9
[273] 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[290] 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
[307] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 4.0

displays the students’ GPA in this new data frame. The total number of students in this group is the number of rows in highGPA:

nrow(highGPA)

[1] 321

`summary()`

The function summary(x) shows a table summarizing the object x. For example,

summary(survey)

     gender          genderID          greek           homeTown    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000  
 Median :1.0000   Median :1.0000   Median :0.0000   Median :2.000  
 Mean   :0.6631   Mean   :0.6693   Mean   :0.2383   Mean   :1.779  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :2.0000   Max.   :1.0000   Max.   :3.000  
   ethnicity        religion       religious           ACT       
 Min.   :0.000   Min.   :0.000   Min.   : 0.000   Min.   :12.00  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.: 1.000   1st Qu.:25.00  
 Median :1.000   Median :0.000   Median : 5.000   Median :27.00  
 Mean   :1.329   Mean   :2.278   Mean   : 4.215   Mean   :27.25  
 3rd Qu.:3.000   3rd Qu.:6.000   3rd Qu.: 7.000   3rd Qu.:30.00  
 Max.   :5.000   Max.   :7.000   Max.   :10.000   Max.   :36.00  
      GPA           partyHr           drinks        sexPartners    
 Min.   :1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:3.100   1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median :3.500   Median : 4.000   Median : 3.000   Median : 1.000  
 Mean   :3.424   Mean   : 6.059   Mean   : 6.858   Mean   : 2.797  
 3rd Qu.:3.900   3rd Qu.: 9.000   3rd Qu.:10.000   3rd Qu.: 3.000  
 Max.   :4.000   Max.   :50.000   Max.   :50.000   Max.   :50.000  
 relationships     firstKissAge     favPeriod     hoursCallParents
 Min.   : 0.000   Min.   : 3.00   Min.   :0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.: 2.000  
 Median : 1.000   Median :15.00   Median :3.000   Median : 3.000  
 Mean   : 1.147   Mean   :16.46   Mean   :3.179   Mean   : 4.836  
 3rd Qu.: 2.000   3rd Qu.:18.00   3rd Qu.:4.000   3rd Qu.: 7.000  
 Max.   :25.000   Max.   :26.00   Max.   :4.000   Max.   :50.000  
  socialMedia         texts          good_well      parentRelationship
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000    
 1st Qu.: 1.000   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.: 7.000    
 Median : 2.000   Median : 5.000   Median : 5.000   Median : 8.000    
 Mean   : 3.181   Mean   : 5.793   Mean   : 5.307   Mean   : 8.027    
 3rd Qu.: 4.000   3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.:10.000    
 Max.   :10.000   Max.   :50.000   Max.   :10.000   Max.   :10.000    
     workHr       percentTuition       career     
 Min.   : 0.000   Min.   :  0.00   Min.   :0.000  
 1st Qu.: 0.000   1st Qu.: 20.00   1st Qu.:2.000  
 Median : 0.000   Median : 80.00   Median :2.000  
 Mean   : 5.123   Mean   : 62.74   Mean   :2.383  
 3rd Qu.:10.000   3rd Qu.:100.00   3rd Qu.:4.000  
 Max.   :50.000   Max.   :100.00   Max.   :5.000

lists the minimum, maximum, mean and 3 quartiles in each column. (Recall that the 1st quartile is the 25th percentile, the median is the 50th percentile, and the 3th quartile is the 75th percentile.) Since some of the columns are categorical, some of the numbers are meaningless. You can compare this summary table with the one on the data program: after loading the data, click Summary Statistics block at the top of the page.

The summary() function can also be applied to one column. For example, the command

summary(survey$drink)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   3.000   6.858  10.000  50.000

summarizes the number of drinks per week for students taking the survey. The command

summary(survey$gender)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.6631  1.0000  1.0000

summarizes the values in the gender column. This column is categorical: “male” is represented by “0” and “female” is represented by “1”. The mean of this column represents the proportion of female students in the sample: \[mean(gender) = \frac{1}{\rm total\ number}\sum gender = \frac{\rm number\ of\ females}{\rm total\ number}\] So we see that 66.31% of the students who took the survey are female.

Convert Numbers to Descriptive Characters

Instead of using 0 and 1 to represent male and female, it may be more convenient to use “male” and “female” directly. It is easy to do that in R. To be safe, we copy the survey$gender vector to a new vector:

gender <- survey$gender

Then we replace all 0’s by “male” and all 1’s by “female”:

gender[gender==0] <- "male"
gender[gender==1] <- "female"

The command gender[gender==0] takes the subset of the gender vector with the value 0. The gender[gender==0] <- "male" command thus replaces all 0’s in gender to “male”. Similarly, gender[gender==1] <- "female" replaces all 1’s in gender to “female”.

Let’s check the first 10 elements of gender and compare them with those in survey$gender:

gender[1:10]

 [1] "female" "female" "female" "male"   "female" "female" "male"  
 [8] "female" "female" "male"

survey$gender[1:10]

 [1] 1 1 1 0 1 1 0 1 1 0

This shows that we have done it correctly. We can calculate the proportion of “female” in gender and see if it matches the 66.31% mentioned above.

mean(gender=="female")

[1] 0.6631486

The two proportions match, as expected. In the above command, gender=="female" is a vectorized operation, returning a logical vector with value TRUE for gender equal to “female” and FALSE otherwise. The command mean(gender=="female") then calculates the proportion of TRUE’s in the logical vector gender=="female".

The same trick can be applied to other categorical variables. According to the survey data page in the data program, in survey$ethnicity, “White” is coded as 0, “Black or African American” is 1, “Hispanic/Latino” is 2, “Asian” is 3, “Mixed” is 4, and “Other” is 5. The following code chunk change the numbers to descriptive characters:

ethnicity <- survey$ethnicity
ethnicity[ethnicity==0] <- "White"
ethnicity[ethnicity==1] <- "Black"
ethnicity[ethnicity==2] <- "Hispanic"
ethnicity[ethnicity==3] <- "Asian"
ethnicity[ethnicity==4] <- "Mixed"
ethnicity[ethnicity==5] <- "Other"

If you think typing these all out one by one is tedious, you can consider doing it using a for loop:

ethnicity2 <- survey$ethnicity
nums = 0:5
chars = c("White","Black","Hispanic","Asian","Mixed","Other")
for (i in seq_along(nums)) {
  ethnicity2[ethnicity2==nums[i]] <- chars[i]
}

The seq_along(nums) function is the same as 1:(length(nums)), generating an integer sequence 1, 2, 3, …, length(nums), or 1:6. We can check that ethnicity and ethnicity2 are identical:

identical(ethnicity,ethnicity2)

[1] TRUE

If you are happy with the result of the conversions, you can copy these character vectors back to the respective columns of the data frame survey:

survey$gender <- gender
survey$ethnicity <- ethnicity

`table()`

Another useful command is the table(x) function. It counts the numbers of each item in vector x. For example, gender is a vector containing only two categories: “male” and “female”. Therefore, the command

table(gender)

gender
female   male 
   754    383

counts the number of males and females in gender. We can convert the counts to percentages by dividing the counts by n (total number of students calculated above) and multiply by 100:

n <- length(gender)
table(gender)/n*100

gender
  female     male 
66.31486 33.68514

We can do the same with ethnicity:

table(ethnicity)/n*100

ethnicity
    Asian     Black  Hispanic     Mixed     Other     White 
26.385224  7.475814 11.697449  3.957784  1.407212 49.076517

This shows that the majority (49.1%) of students are White. The rest are: 26.4% Asian, 11.7% Hispanic/Latino, 7.5% Black or African American, 4% Mixed and 1.4% Other.

Other categorical variables can be analyzed in the same way. For example, the command

table(survey$religion)/n*100


        0         1         2         3         4         5         6 
53.649956  5.716799  2.902375  2.902375  3.430079  6.332454 13.280563 
        7 
11.785400

shows the percentages of students in each religion group. Since we don’t convert the numeric codes to descriptive characters in this column, we have to look up the description of this survey data in the data program to see what these numbers represent: 0=Christian; 1=Jewish; 2=Muslim; 3=Hindu; 4=Buddhist; 5=Religious but not one of the above; 6=Agnostic; 7=Atheist. To make things easier, we convert them to descriptive characters. This time let’s do it directly without copying to a new vector:

nums <- 0:7
chars <- c('Christian','Jewish','Muslim','Hindu','Buddhist','Other Religion',
           'Agnostic','Atheist')
for (i in seq_along(nums)) {
  survey$religion[ survey$religion==nums[i] ] <- chars[i]  
}

When we run the table() function again, we see

table(survey$religion)/n*100


      Agnostic        Atheist       Buddhist      Christian          Hindu 
     13.280563      11.785400       3.430079      53.649956       2.902375 
        Jewish         Muslim Other Religion 
      5.716799       2.902375       6.332454

Suppose we decide that we don’t like the order the summary table displays. We want to preserve the original order: ‘Christian’, ‘Jewish’, ‘Muslim’, ‘Hindu’, ‘Buddhist’, ‘Other Religion’, ‘Agnostic’, and ‘Atheist’. One way is to convert the character vector survey$religion to a factor vector with the levels set to the desired order (if you forget what a factor variable is, read section 5.11 of the textbook):

survey$religion <- factor(survey$religion, levels=c('Christian','Jewish','Muslim',
                          'Hindu','Buddhist','Other Religion','Agnostic','Atheist'))

table(survey$religion)/n*100


     Christian         Jewish         Muslim          Hindu       Buddhist 
     53.649956       5.716799       2.902375       2.902375       3.430079 
Other Religion       Agnostic        Atheist 
      6.332454      13.280563      11.785400

We see that the table() function now displays the summary in the order specified by the order of the levels.

When the summary() function is applied to a factor variable, it behaves the same way as table():

summary(survey$religion)/n*100

     Christian         Jewish         Muslim          Hindu       Buddhist 
     53.649956       5.716799       2.902375       2.902375       3.430079 
Other Religion       Agnostic        Atheist 
      6.332454      13.280563      11.785400

The table() function can also be used to generate a contingency table. For example, the command

table(survey$religion, survey$ethnicity)

                
                 Asian Black Hispanic Mixed Other White
  Christian         90    68       75    21     8   348
  Jewish             0     0        1     3     1    60
  Muslim            18     3        0     1     5     6
  Hindu             32     0        0     0     1     0
  Buddhist          34     0        1     1     0     3
  Other Religion    15     3       32     4     0    18
  Agnostic          53    10       10     7     0    71
  Atheist           58     1       14     8     1    52

shows the number of students in each religion in each ethnic group.

Export Data to Files

We can export a data frame to a file using the write.table() command. For example, to export survey to a file named ‘Stat100_Survey2_Fall2015.dat’ in the working directory, type

write.table(survey,'Stat100_Survey2_Fall2015.dat', row.names=FALSE)

The option row.names=FALSE tells R not to include the row names, which are just row numbers in our example. By default, each column is separated by a space.

A more commonly used data format is csv (comma separated values), where columns are separated by commas. CSV files can be opened by many software, including Excel. To export to a csv file, we specify the option sep=',' in the write.table() function. Alternatively, we can use the write.csv() function:

write.csv(survey,'Stat100_Survey2_Fall2015.csv', row.names=FALSE)

The file ‘Stat100_Survey2_Fall2015.csv’ can be opened using Excel and you will see that it has the same column names as the data frame. The gender, ethnicity and religion columns are characters instead of integers.

To load data from a csv file, use the read.csv() function:

survey_reload <- read.csv('Stat100_Survey2_Fall2015.csv')

The read.csv() function is just the read.table() function but with the default setting sep=',' and header=TRUE. If you examine this data frame, you will find something interesting:

class(survey_reload$gender)

[1] "factor"

class(survey_reload$ethnicity)

[1] "factor"

class(survey_reload$religion)

[1] "factor"

It is not surprising that the religion column is a factor vector since we converted it, but we see that the gender and ethnicity columns are also factors. In addition, the levels in the religion column are not in the same order as we specified above:

levels(survey_reload$religion)

[1] "Agnostic"       "Atheist"        "Buddhist"       "Christian"     
[5] "Hindu"          "Jewish"         "Muslim"         "Other Religion"

What is going on? It turns out that write.table() and write.csv() do not store column classes. When the file is being loaded, R converts columns containing strings to factors by default. If we want to preserve the column classes, we need to use the save() function:

save(survey, file='Stat100_Survey2_Fall2015.RData')

The extension .RData is a commonly used extension for R data files.

You can now quit R without worrying about losing the changes you have made. The next time you open R, type

load('Stat100_Survey2_Fall2015.RData')

to load the R data file. After typing this command, the data frame survey will appear in your working space. Try it!

You can type ls() to see a list of variables in your working space:

ls()

 [1] "chars"         "ethnicity"     "ethnicity2"    "gender"       
 [5] "highGPA"       "i"             "n"             "nums"         
 [9] "survey"        "survey_reload"

The command rm(...) deletes the variables from the working space. For example, after typing

rm(gender,chars,nums,survey_reload)

the 4 variables gender, chars, nums and survey_reload no longer exist:

ls()

[1] "ethnicity"  "ethnicity2" "highGPA"    "i"          "n"         
[6] "survey"

Since ls() returns a vector listing all the variables in the working space, the command

rm(list=ls())

clears all variables in the working space:

ls()

character(0)

If we load the R data file ‘Stat100_Survey2_Fall2015.RData’,

load('Stat100_Survey2_Fall2015.RData')

the data frame survey reappears in the working space:

ls()

[1] "survey"

class(survey)

[1] "data.frame"

The column classes, as well as the factor levels, are also the same as before:

class(survey$gender)

[1] "character"

class(survey$ethnicity)

[1] "character"

class(survey$religion)

[1] "factor"

levels(survey$religion)

[1] "Christian"      "Jewish"         "Muslim"         "Hindu"         
[5] "Buddhist"       "Other Religion" "Agnostic"       "Atheist"

We see that the save() function is useful to save R objects. The drawback, however, is that it cannot be read by other software.

Another method to save an R data to a file is to use the dput() function:

dput(survey, file='Stat100_Survey2_Fall2015.R')

Unlike save(), a file outputted by dput() is not in binary format. You can view the content using a text editor, although it is not meant to be read by a text editor. Since dput() outputs extra information such as the column classes of a data frame, the file size is larger than that outputted by write.table() or write.csv(). To load the file to R, we use the dget() function. Unlike load(), we can assign a different name for the object(s) in the file when we load the data:

survey2 <- dget('Stat100_Survey2_Fall2015.R')

Like save(), dput() preserves the data structure of the R object. So when we load the data we get exactly the same structure as the original data:

class(survey2$gender)

[1] "character"

class(survey2$ethnicity)

[1] "character"

class(survey2$religion)

[1] "factor"

levels(survey2$religion)

[1] "Christian"      "Jewish"         "Muslim"         "Hindu"         
[5] "Buddhist"       "Other Religion" "Agnostic"       "Atheist"

To confirm that survey2 is an exact copy of survey, type

identical(survey,survey2)

[1] TRUE