Here we demonstrate how to load a data file to R and perform simple manipulations. We will use the second survey data of Stat 100 classes conducted in Fall 2015 to demonstrate the process.
You have been analyzing the Stat 100 survey data in the homework assignments. However, we have gone through the process of reformating the data before asking you to load them to R. Here we show you some of the techniques we use to reformat the data. The data in the real world are in general messier and need a lot of cleaning up before they can be analyzed.
To download the survey data, go to Statistics Department’s data program website, choose stat100 from the first drop-down menu on the left, then choose Survey 2, Fall 2015 (combined). After the data are loaded, you should see the following:
This page lists all the questions in the survey and the coding used by the data program for the categorical data. Click the Display/Modify Data block on the upper left. You will enter a page that shows the data. Click the Download Raw Data box on the upper right of this page. Save the data ‘2015fall_survey02_combined.dat’ to your computer. The data in each column is separated by space.
To load the data file to R, first move the file to your R’s working directory, or alternatively, change your R’s working directory to the directory of the file. You can find out your R’s current working directory using the getwd()
command. If you forget how to set your working directory, watch one of these videos again: Windows, Mac.
Once the file ‘2015fall_survey02_combined.dat’ is in the working directory, type
survey <- read.table('2015fall_survey02_combined.dat', header=TRUE)
to load the data to a data frame named survey
. The header=TRUE
option is to tell R that the first row in the file is a header and the data begin in the second row. Using the dim()
function, we can find out the number of rows and columns of survey
:
dim(survey)
[1] 1137 23
It indicates that there are 1137 observations (rows) and 23 columns. The column names are
names(survey)
[1] "Gender.0.Male.1.Female.2.Other.."
[2] "Gender_ID.0.Male.1.Female.2.Other.."
[3] "Greek.0.No.1.Yes.."
[4] "Home_Town.0.Small_Town.1.Medium_City.2.Big_City_Suburb.3.Big_City.."
[5] "Ethnicity.0.White.1.Black.2.Hispanic.3.Asian.4.Mixed.5.Other.."
[6] "Religion.0.Christian.1.Jewish.2.Muslim.3.Hindu.4.Buddhist.5.Other_Religion.6.Agnostic.7.Atheist.."
[7] "Religious"
[8] "ACT"
[9] "GPA"
[10] "Party_Hours_per_week"
[11] "Drinks_per_week"
[12] "Num._Sex_Partners"
[13] "Num._Relationships"
[14] "First_Kiss_Age.26.still_waiting.."
[15] "Fav._Life_Period.0.Kindergarten.1.lower_grades.2.middle_school.3.High_School.4.College.."
[16] "Hours_call_parents"
[17] "Social_media.10.V10_or_more.."
[18] "Texts"
[19] "Good_or_Well"
[20] "Parent_Relationship"
[21] "Work_Hours_per_week"
[22] "Percent_Parents_pay_tuition"
[23] "Career.0.No_Idea.1.Some_Idea.2.Nervous.3.Concrete_plan.4.Grad_School.5.Other.."
When the file is loaded using the read.table()
function with header=TRUE
, R tries to assign the column names based on the characters in the first row. R checks the names and makes sure that they are syntactically valid variable names: variable names in R cannot contain space; they can consist of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. The column names in the file ‘2015fall_survey02_combined.dat’ contain many information that is needed for the data program, but these names are invalid variable names in R. As a result, R adjusts the names. If you want to preserve the column names as it is in the file, you can use the option check.names=FALSE
:
survey <- read.table('2015fall_survey02_combined.dat', header=TRUE, check.names=FALSE)
names(survey)
[1] "Gender(0=Male;1=Female;2=Other;)"
[2] "Gender_ID(0=Male;1=Female;2=Other;)"
[3] "Greek(0=No;1=Yes;)"
[4] "Home_Town(0=Small_Town;1=Medium_City;2=Big_City_Suburb;3=Big_City;)"
[5] "Ethnicity(0=White;1=Black;2=Hispanic;3=Asian;4=Mixed;5=Other;)"
[6] "Religion(0=Christian;1=Jewish;2=Muslim;3=Hindu;4=Buddhist;5=Other_Religion;6=Agnostic;7=Atheist;)"
[7] "Religious"
[8] "ACT"
[9] "GPA"
[10] "Party_Hours_per_week"
[11] "Drinks_per_week"
[12] "Num._Sex_Partners"
[13] "Num._Relationships"
[14] "First_Kiss_Age(26=still_waiting;)"
[15] "Fav._Life_Period(0=Kindergarten;1=lower_grades;2=middle_school;3=High_School;4=College;)"
[16] "Hours_call_parents"
[17] "Social_media(10=V10_or_more;)"
[18] "Texts"
[19] "Good_or_Well"
[20] "Parent_Relationship"
[21] "Work_Hours_per_week"
[22] "Percent_Parents_pay_tuition"
[23] "Career(0=No_Idea;1=Some_Idea;2=Nervous;3=Concrete_plan;4=Grad_School;5=Other;)"
Even though the column names have now been preserved, they are still very inconvenient to use. We can change the names by assigning different names to names(survey)
:
names(survey) <- c("gender","genderID","greek","homeTown","ethnicity",
"religion","religious","ACT","GPA","partyHr",
"drinks","sexPartners","relationships","firstKissAge",
"favPeriod","hoursCallParents","socialMedia","texts",
"good_well","parentRelationship","workHr",
"percentTuition","career")
names(survey)
[1] "gender" "genderID" "greek"
[4] "homeTown" "ethnicity" "religion"
[7] "religious" "ACT" "GPA"
[10] "partyHr" "drinks" "sexPartners"
[13] "relationships" "firstKissAge" "favPeriod"
[16] "hoursCallParents" "socialMedia" "texts"
[19] "good_well" "parentRelationship" "workHr"
[22] "percentTuition" "career"
These column names have been shortened for convenience. The detail description of each column can be found on the survey data page at the data program.
To take a look at the data in R, type View(survey)
. R opens a new window and displays the data in a spreadsheet.
If we want to perform data analysis, probably the first thing we want to check is if there are missing values in the data frame. This can easily be done using the command
sum(is.na(survey))
[1] 0
This command counts the number of NAs in the data frame, which is 0 in our case, meaning that there are no missing values.
We can look at particular rows or columns of the data in survey
using the subsetting methods appropriate to data frames. For example, to look at the 178th row, type
survey[178,]
gender genderID greek homeTown ethnicity religion religious ACT GPA
178 1 1 0 2 0 1 5 25 3.7
partyHr drinks sexPartners relationships firstKissAge favPeriod
178 0 0 1 1 16 4
hoursCallParents socialMedia texts good_well parentRelationship workHr
178 14 5 5 6 10 0
percentTuition career
178 100 4
This command shows all columns in row 178. If we only want to look at columns 1, 4 and 5, we can type
survey[178,c(1,4,5)]
gender homeTown ethnicity
178 1 2 0
To look at the first 10 rows of column 4, type
survey[1:10,4]
[1] 1 2 2 2 0 3 2 2 3 1
Usually we don’t remember what each column represents. That is why we go through the detail in setting and changing the column names. It is much easier to pull up a particular column using its column name. For example, to see the first 20 rows of the “GPA” column, type
survey$GPA[1:20]
[1] 3.7 3.4 3.6 3.0 4.0 3.2 1.7 4.0 2.7 2.7 4.0 3.8 3.8 3.4 3.3 3.1 3.6
[18] 4.0 3.2 3.0
or survey[["GPA"]][1:20]
. These two commands are equivalent to survey[1:20,9]
and survey[1:20,"GPA"]
since the “GPA” column is column 9. It is also possible to create an alias to each column name without using the prefix survey$
by the command attach(survey)
. However, I discourage using it for beginners as it can lead to confusion. See Good practice in the ?attach
help page for detail.
Column names are much easier to use when they are short and descriptive. For example, to look at the first 10 rows in columns gender
, ACT
and GPA
, we can use the following command:
survey[1:10, c("gender","ACT","GPA")]
gender ACT GPA
1 1 27 3.7
2 1 25 3.4
3 1 27 3.6
4 0 29 3.0
5 1 25 4.0
6 1 27 3.2
7 0 35 1.7
8 1 30 4.0
9 1 23 2.7
10 0 27 2.7
Sometimes, we want to add new columns to a data frame. For example, to add a column called “one”, use the command
survey$one <- 1
Since “one” was not in the data frame, the above command creates a new column with the name “one”. We can also use
survey[["all 2's"]] <- 2
to create a new column named “all 2’s”. When we type names(survey)
, we see that these new columns are placed at the last two columns of the data frame:
names(survey)
[1] "gender" "genderID" "greek"
[4] "homeTown" "ethnicity" "religion"
[7] "religious" "ACT" "GPA"
[10] "partyHr" "drinks" "sexPartners"
[13] "relationships" "firstKissAge" "favPeriod"
[16] "hoursCallParents" "socialMedia" "texts"
[19] "good_well" "parentRelationship" "workHr"
[22] "percentTuition" "career" "one"
[25] "all 2's"
To remove a column from the data frame, we set it to NULL:
survey$one <- NULL
survey[["all 2's"]] <- NULL
names(survey)
[1] "gender" "genderID" "greek"
[4] "homeTown" "ethnicity" "religion"
[7] "religious" "ACT" "GPA"
[10] "partyHr" "drinks" "sexPartners"
[13] "relationships" "firstKissAge" "favPeriod"
[16] "hoursCallParents" "socialMedia" "texts"
[19] "good_well" "parentRelationship" "workHr"
[22] "percentTuition" "career"
Suppose we are interested in students having GPA greater than 3.8, we can use the command
highGPA <- survey[survey$GPA > 3.8,]
to take a subset of the data frame for students with GPA greater than 3.8. The command
highGPA$GPA
[1] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
[18] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[35] 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9
[52] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 3.9 4.0 4.0
[69] 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0
[86] 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[103] 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
[120] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[137] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[154] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0
[171] 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0
[188] 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[205] 3.9 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 3.9 3.9 4.0
[222] 4.0 4.0 4.0 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[239] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 3.9 4.0
[256] 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 3.9 4.0 3.9
[273] 3.9 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
[290] 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0
[307] 4.0 4.0 4.0 4.0 3.9 4.0 4.0 4.0 4.0 4.0 4.0 3.9 3.9 4.0 4.0
displays the students’ GPA in this new data frame. The total number of students in this group is the number of rows in highGPA
:
nrow(highGPA)
[1] 321
summary()
The function summary(x)
shows a table summarizing the object x. For example,
summary(survey)
gender genderID greek homeTown
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
Median :1.0000 Median :1.0000 Median :0.0000 Median :2.000
Mean :0.6631 Mean :0.6693 Mean :0.2383 Mean :1.779
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:2.000
Max. :1.0000 Max. :2.0000 Max. :1.0000 Max. :3.000
ethnicity religion religious ACT
Min. :0.000 Min. :0.000 Min. : 0.000 Min. :12.00
1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 1.000 1st Qu.:25.00
Median :1.000 Median :0.000 Median : 5.000 Median :27.00
Mean :1.329 Mean :2.278 Mean : 4.215 Mean :27.25
3rd Qu.:3.000 3rd Qu.:6.000 3rd Qu.: 7.000 3rd Qu.:30.00
Max. :5.000 Max. :7.000 Max. :10.000 Max. :36.00
GPA partyHr drinks sexPartners
Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:3.100 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000
Median :3.500 Median : 4.000 Median : 3.000 Median : 1.000
Mean :3.424 Mean : 6.059 Mean : 6.858 Mean : 2.797
3rd Qu.:3.900 3rd Qu.: 9.000 3rd Qu.:10.000 3rd Qu.: 3.000
Max. :4.000 Max. :50.000 Max. :50.000 Max. :50.000
relationships firstKissAge favPeriod hoursCallParents
Min. : 0.000 Min. : 3.00 Min. :0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.: 2.000
Median : 1.000 Median :15.00 Median :3.000 Median : 3.000
Mean : 1.147 Mean :16.46 Mean :3.179 Mean : 4.836
3rd Qu.: 2.000 3rd Qu.:18.00 3rd Qu.:4.000 3rd Qu.: 7.000
Max. :25.000 Max. :26.00 Max. :4.000 Max. :50.000
socialMedia texts good_well parentRelationship
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 1.000 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 7.000
Median : 2.000 Median : 5.000 Median : 5.000 Median : 8.000
Mean : 3.181 Mean : 5.793 Mean : 5.307 Mean : 8.027
3rd Qu.: 4.000 3rd Qu.: 7.000 3rd Qu.: 7.000 3rd Qu.:10.000
Max. :10.000 Max. :50.000 Max. :10.000 Max. :10.000
workHr percentTuition career
Min. : 0.000 Min. : 0.00 Min. :0.000
1st Qu.: 0.000 1st Qu.: 20.00 1st Qu.:2.000
Median : 0.000 Median : 80.00 Median :2.000
Mean : 5.123 Mean : 62.74 Mean :2.383
3rd Qu.:10.000 3rd Qu.:100.00 3rd Qu.:4.000
Max. :50.000 Max. :100.00 Max. :5.000
lists the minimum, maximum, mean and 3 quartiles in each column. (Recall that the 1st quartile is the 25th percentile, the median is the 50th percentile, and the 3th quartile is the 75th percentile.) Since some of the columns are categorical, some of the numbers are meaningless. You can compare this summary table with the one on the data program: after loading the data, click Summary Statistics block at the top of the page.
The summary()
function can also be applied to one column. For example, the command
summary(survey$drink)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 3.000 6.858 10.000 50.000
summarizes the number of drinks per week for students taking the survey. The command
summary(survey$gender)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.6631 1.0000 1.0000
summarizes the values in the gender column. This column is categorical: “male” is represented by “0” and “female” is represented by “1”. The mean of this column represents the proportion of female students in the sample: \[mean(gender) = \frac{1}{\rm total\ number}\sum gender = \frac{\rm number\ of\ females}{\rm total\ number}\] So we see that 66.31% of the students who took the survey are female.
Instead of using 0 and 1 to represent male and female, it may be more convenient to use “male” and “female” directly. It is easy to do that in R. To be safe, we copy the survey$gender
vector to a new vector:
gender <- survey$gender
Then we replace all 0’s by “male” and all 1’s by “female”:
gender[gender==0] <- "male"
gender[gender==1] <- "female"
The command gender[gender==0]
takes the subset of the gender
vector with the value 0. The gender[gender==0] <- "male"
command thus replaces all 0’s in gender
to “male”. Similarly, gender[gender==1] <- "female"
replaces all 1’s in gender
to “female”.
Let’s check the first 10 elements of gender
and compare them with those in survey$gender
:
gender[1:10]
[1] "female" "female" "female" "male" "female" "female" "male"
[8] "female" "female" "male"
survey$gender[1:10]
[1] 1 1 1 0 1 1 0 1 1 0
This shows that we have done it correctly. We can calculate the proportion of “female” in gender
and see if it matches the 66.31% mentioned above.
mean(gender=="female")
[1] 0.6631486
The two proportions match, as expected. In the above command, gender=="female"
is a vectorized operation, returning a logical vector with value TRUE for gender equal to “female” and FALSE otherwise. The command mean(gender=="female")
then calculates the proportion of TRUE’s in the logical vector gender=="female"
.
The same trick can be applied to other categorical variables. According to the survey data page in the data program, in survey$ethnicity
, “White” is coded as 0, “Black or African American” is 1, “Hispanic/Latino” is 2, “Asian” is 3, “Mixed” is 4, and “Other” is 5. The following code chunk change the numbers to descriptive characters:
ethnicity <- survey$ethnicity
ethnicity[ethnicity==0] <- "White"
ethnicity[ethnicity==1] <- "Black"
ethnicity[ethnicity==2] <- "Hispanic"
ethnicity[ethnicity==3] <- "Asian"
ethnicity[ethnicity==4] <- "Mixed"
ethnicity[ethnicity==5] <- "Other"
If you think typing these all out one by one is tedious, you can consider doing it using a for
loop:
ethnicity2 <- survey$ethnicity
nums = 0:5
chars = c("White","Black","Hispanic","Asian","Mixed","Other")
for (i in seq_along(nums)) {
ethnicity2[ethnicity2==nums[i]] <- chars[i]
}
The seq_along(nums)
function is the same as 1:(length(nums))
, generating an integer sequence 1, 2, 3, …, length(nums), or 1:6. We can check that ethnicity
and ethnicity2
are identical:
identical(ethnicity,ethnicity2)
[1] TRUE
If you are happy with the result of the conversions, you can copy these character vectors back to the respective columns of the data frame survey
:
survey$gender <- gender
survey$ethnicity <- ethnicity
table()
Another useful command is the table(x)
function. It counts the numbers of each item in vector x
. For example, gender
is a vector containing only two categories: “male” and “female”. Therefore, the command
table(gender)
gender
female male
754 383
counts the number of males and females in gender
. We can convert the counts to percentages by dividing the counts by n
(total number of students calculated above) and multiply by 100:
n <- length(gender)
table(gender)/n*100
gender
female male
66.31486 33.68514
We can do the same with ethnicity
:
table(ethnicity)/n*100
ethnicity
Asian Black Hispanic Mixed Other White
26.385224 7.475814 11.697449 3.957784 1.407212 49.076517
This shows that the majority (49.1%) of students are White. The rest are: 26.4% Asian, 11.7% Hispanic/Latino, 7.5% Black or African American, 4% Mixed and 1.4% Other.
Other categorical variables can be analyzed in the same way. For example, the command
table(survey$religion)/n*100
0 1 2 3 4 5 6
53.649956 5.716799 2.902375 2.902375 3.430079 6.332454 13.280563
7
11.785400
shows the percentages of students in each religion group. Since we don’t convert the numeric codes to descriptive characters in this column, we have to look up the description of this survey data in the data program to see what these numbers represent: 0=Christian; 1=Jewish; 2=Muslim; 3=Hindu; 4=Buddhist; 5=Religious but not one of the above; 6=Agnostic; 7=Atheist. To make things easier, we convert them to descriptive characters. This time let’s do it directly without copying to a new vector:
nums <- 0:7
chars <- c('Christian','Jewish','Muslim','Hindu','Buddhist','Other Religion',
'Agnostic','Atheist')
for (i in seq_along(nums)) {
survey$religion[ survey$religion==nums[i] ] <- chars[i]
}
When we run the table()
function again, we see
table(survey$religion)/n*100
Agnostic Atheist Buddhist Christian Hindu
13.280563 11.785400 3.430079 53.649956 2.902375
Jewish Muslim Other Religion
5.716799 2.902375 6.332454
Suppose we decide that we don’t like the order the summary table displays. We want to preserve the original order: ‘Christian’, ‘Jewish’, ‘Muslim’, ‘Hindu’, ‘Buddhist’, ‘Other Religion’, ‘Agnostic’, and ‘Atheist’. One way is to convert the character vector survey$religion
to a factor vector with the levels set to the desired order (if you forget what a factor variable is, read section 5.11 of the textbook):
survey$religion <- factor(survey$religion, levels=c('Christian','Jewish','Muslim',
'Hindu','Buddhist','Other Religion','Agnostic','Atheist'))
table(survey$religion)/n*100
Christian Jewish Muslim Hindu Buddhist
53.649956 5.716799 2.902375 2.902375 3.430079
Other Religion Agnostic Atheist
6.332454 13.280563 11.785400
We see that the table()
function now displays the summary in the order specified by the order of the levels.
When the summary()
function is applied to a factor variable, it behaves the same way as table()
:
summary(survey$religion)/n*100
Christian Jewish Muslim Hindu Buddhist
53.649956 5.716799 2.902375 2.902375 3.430079
Other Religion Agnostic Atheist
6.332454 13.280563 11.785400
The table()
function can also be used to generate a contingency table. For example, the command
table(survey$religion, survey$ethnicity)
Asian Black Hispanic Mixed Other White
Christian 90 68 75 21 8 348
Jewish 0 0 1 3 1 60
Muslim 18 3 0 1 5 6
Hindu 32 0 0 0 1 0
Buddhist 34 0 1 1 0 3
Other Religion 15 3 32 4 0 18
Agnostic 53 10 10 7 0 71
Atheist 58 1 14 8 1 52
shows the number of students in each religion in each ethnic group.
We can export a data frame to a file using the write.table()
command. For example, to export survey
to a file named ‘Stat100_Survey2_Fall2015.dat’ in the working directory, type
write.table(survey,'Stat100_Survey2_Fall2015.dat', row.names=FALSE)
The option row.names=FALSE
tells R not to include the row names, which are just row numbers in our example. By default, each column is separated by a space.
A more commonly used data format is csv (comma separated values), where columns are separated by commas. CSV files can be opened by many software, including Excel. To export to a csv file, we specify the option sep=','
in the write.table()
function. Alternatively, we can use the write.csv()
function:
write.csv(survey,'Stat100_Survey2_Fall2015.csv', row.names=FALSE)
The file ‘Stat100_Survey2_Fall2015.csv’ can be opened using Excel and you will see that it has the same column names as the data frame. The gender, ethnicity and religion columns are characters instead of integers.
To load data from a csv file, use the read.csv()
function:
survey_reload <- read.csv('Stat100_Survey2_Fall2015.csv')
The read.csv()
function is just the read.table()
function but with the default setting sep=','
and header=TRUE
. If you examine this data frame, you will find something interesting:
class(survey_reload$gender)
[1] "factor"
class(survey_reload$ethnicity)
[1] "factor"
class(survey_reload$religion)
[1] "factor"
It is not surprising that the religion column is a factor vector since we converted it, but we see that the gender and ethnicity columns are also factors. In addition, the levels in the religion column are not in the same order as we specified above:
levels(survey_reload$religion)
[1] "Agnostic" "Atheist" "Buddhist" "Christian"
[5] "Hindu" "Jewish" "Muslim" "Other Religion"
What is going on? It turns out that write.table()
and write.csv()
do not store column classes. When the file is being loaded, R converts columns containing strings to factors by default. If we want to preserve the column classes, we need to use the save()
function:
save(survey, file='Stat100_Survey2_Fall2015.RData')
The extension .RData is a commonly used extension for R data files.
You can now quit R without worrying about losing the changes you have made. The next time you open R, type
load('Stat100_Survey2_Fall2015.RData')
to load the R data file. After typing this command, the data frame survey
will appear in your working space. Try it!
You can type ls()
to see a list of variables in your working space:
ls()
[1] "chars" "ethnicity" "ethnicity2" "gender"
[5] "highGPA" "i" "n" "nums"
[9] "survey" "survey_reload"
The command rm(...)
deletes the variables from the working space. For example, after typing
rm(gender,chars,nums,survey_reload)
the 4 variables gender
, chars
, nums
and survey_reload
no longer exist:
ls()
[1] "ethnicity" "ethnicity2" "highGPA" "i" "n"
[6] "survey"
Since ls()
returns a vector listing all the variables in the working space, the command
rm(list=ls())
clears all variables in the working space:
ls()
character(0)
If we load the R data file ‘Stat100_Survey2_Fall2015.RData’,
load('Stat100_Survey2_Fall2015.RData')
the data frame survey
reappears in the working space:
ls()
[1] "survey"
class(survey)
[1] "data.frame"
The column classes, as well as the factor levels, are also the same as before:
class(survey$gender)
[1] "character"
class(survey$ethnicity)
[1] "character"
class(survey$religion)
[1] "factor"
levels(survey$religion)
[1] "Christian" "Jewish" "Muslim" "Hindu"
[5] "Buddhist" "Other Religion" "Agnostic" "Atheist"
We see that the save()
function is useful to save R objects. The drawback, however, is that it cannot be read by other software.
Another method to save an R data to a file is to use the dput()
function:
dput(survey, file='Stat100_Survey2_Fall2015.R')
Unlike save()
, a file outputted by dput()
is not in binary format. You can view the content using a text editor, although it is not meant to be read by a text editor. Since dput()
outputs extra information such as the column classes of a data frame, the file size is larger than that outputted by write.table()
or write.csv()
. To load the file to R, we use the dget()
function. Unlike load()
, we can assign a different name for the object(s) in the file when we load the data:
survey2 <- dget('Stat100_Survey2_Fall2015.R')
Like save()
, dput()
preserves the data structure of the R object. So when we load the data we get exactly the same structure as the original data:
class(survey2$gender)
[1] "character"
class(survey2$ethnicity)
[1] "character"
class(survey2$religion)
[1] "factor"
levels(survey2$religion)
[1] "Christian" "Jewish" "Muslim" "Hindu"
[5] "Buddhist" "Other Religion" "Agnostic" "Atheist"
To confirm that survey2
is an exact copy of survey
, type
identical(survey,survey2)
[1] TRUE