Here we introduce several useful functions in R: mean(), median(), max(), min(), sort(), var(), sd(), cor, which.min and which.max.

Mean, Median, Maximum, Minimum and Sorting

Let’s first create a numeric vector:

x <- c(1.4, 5.66, 7.13, 9.21)

The max() and min() functions return the maximum and minimum of a vector:

max(x)
[1] 9.21
min(x)
[1] 1.4

The mean() and median() functions return the mean and median:

mean(x)
[1] 5.85
median(x)
[1] 6.395

The sort() function sorts the vector in increasing order:

sort(x)
[1] 1.40 5.66 7.13 9.21

There are other optional parameters we can set in these functions. Use, e.g., ?sort in the R console to pull up a help page. For example, we can sort the vector in decreasing order using

sort(x, decreasing=TRUE)
[1] 9.21 7.13 5.66 1.40

Variance and Standard Deviation

The var() and sd() functions calculate the variance and standard deviation of a vector. Note that they are defined as \[Var(X)= \frac{1}{N-1}\sum_{i=1}^N (X_i -\mu_X)^2 \ \ \ , \ \ \ sd(X)=\sqrt{Var(X)} ,\] where \(N\) is the number of elements in \(X\) and \(\mu_X\) is the mean of \(X\). These are called the sample variance and sample standard deviation. They are used to estimated the population variance and population standard deviation based on the random sample. Recall that as you have learned in Stat 100, the population variance \(\sigma^2\) is defined as \[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i -\mu_X)^2\] So you have to be careful in using the var() and sd() functions in R. For example, if we want to calculate the population standard deviation, defined as the square root of \(\sigma^2\) above, for the x vector, we need to use the expression

stdx <- sd(x) * sqrt((length(x)-1)/length(x))

Here length(x) returns the number of elements in the vector x:

length(x)
[1] 4

The factor sqrt((length(x)-1)/length(x)), or \(\sqrt{(N-1)/N}\), takes into account the \(1/N\) and \(1/(N-1)\) differences between the sd() function and the expression for \(\sigma\). Another way of calculating \(\sigma\) is to use the fact that \(\sigma\) is the square root of the mean of \((X-\mu_X)^2\):

stdx2 <- sqrt( mean( (x-mean(x))^2 ) )

We can confirm that these two expressions give the same result:

c(stdx, stdx2, stdx-stdx2)
[1] 2.862106 2.862106 0.000000

Correlation

Now that we have mean and standard deviation, we can calculate the Z-score, defined as \[ Z_i = \frac{X_i-\mu_X}{\sigma}\] This can be done easily with R’s vectorized operation:

Zx <- (x-mean(x))/stdx
Zx
[1] -1.55479923 -0.06638469  0.44722315  1.17396077

Consider another numeric vector y of length 4:

y <- c(1.53, -3.45, 6.7, 4.63)

We can easily convert it to the Z score as well:

stdy <- sd(y) * sqrt((length(y)-1)/length(y))
Zy <- (y-mean(y))/stdy
Zy
[1] -0.2151968 -1.5181512  1.1374687  0.5958793

Recall that the correlation coefficient \(r\) of two sets of variables X and Y is defined as the mean of the product their Z scores: \[ r = \frac{1}{N}\sum_{i=1}^N Z_{xi} Z_{yi} \] The correlation between the vectors x and y can be calculated using

mean(Zx*Zy)
[1] 0.4109028

Since the correlation coefficient is an important concept in statistics, R already has a built-in function, cor(), to compute it directly:

cor(x,y)
[1] 0.4109028

To demonstrate another use of the cor() function, we create a third vector:

z <- c(2.39,3.19,8.31,-4.67)

With 3 sets of data x, y, z, we can calculate the correlation coefficients between all pairs of the 3 variables:

cor(x,y)
[1] 0.4109028
cor(x,z)
[1] -0.3078786
cor(y,z)
[1] 0.07096528

There is a simpler way of calculating all 3 correlations by combining the cbind() function and the cor() function.

The cbind() function is mentioned in Section 5.9 of the textbook. It combines vectors into a matrix whose columns are the input vectors. For example,

cbind(x,y,z)
        x     y     z
[1,] 1.40  1.53  2.39
[2,] 5.66 -3.45  3.19
[3,] 7.13  6.70  8.31
[4,] 9.21  4.63 -4.67

creates a 4×3 matrix. The first column is x, second column is y and third column is z.

The cor() function accepts a matrix argument. In cor(m), if m is a \(p\times q\) matrix, cor(m) returns a \(q \times q\) matrix whose \((i,j)\) element is the correlation coefficient between the \(i\)th column vector and \(j\)th column vector of m. For example, in

cor(cbind(x,y,z))
           x          y           z
x  1.0000000 0.41090275 -0.30787858
y  0.4109028 1.00000000  0.07096528
z -0.3078786 0.07096528  1.00000000

the (1,1) element is cor(x,x), which is 1; the (1,2) element is cor(x,y); the (1,3) element is cor(x,z); the (2,1) element is cor(y,x), which is equal to cor(x,y); the (2,2) element is cor(y,y), which is 1; the (2,3) element is cor(y,z), and so on. The returned \(3\times 3\) matrix is called the correlation matrix for the variables \(x\), \(y\), and \(z\).

Position of Maximum and Minimum

The max() and min() functions return the maximum and minimum value in a vector, but they don’t tell us where the maximum and minimum occurs. If we want that information, we can use the which.min() and which.max() functions, which return the index where the first maximum and minimum occurs. As an example, consider 5 students in Stat 100 with the following heights and weights:

height <- c(63,71,78,72,73)
weight <- c(120,167,190,165,290)

In this data set, height is measured in inches and weight is measued in pounds. The height of the tallest student is

max(height)
[1] 78

which is 78 inches. Suppose we want to know the weight of this tallest student. We can first find out the index of the maximum in the height vector and then look at the value of the weight vector at that index:

i <- which.max(height)
i
[1] 3
weight[i]
[1] 190

The command which.max(height) returns 3, indicating that the maximum value in height is the 3rd element. The weight of this tallest student is weight[3] or 190 pounds. Similarly, if we want to find the weight of the shortest student, we will use i <- which.min(height) to find the index of the minimum and then weight[i] to find the corresponding value of weight.

Note that which.min()/which.max() returns the index of the first minimum/maximum in case there are ties. For example, if we have the following vector x.

x <- c(1,5,10,10,5,1)

Then which.max(x) returns 3 since the third element is the first index where the maximum occurs. Similarly, which.min(x) returns 1. In many applications, maximum/minimum only occurs once and so which.min() and which.max() are sufficient. In the case of ties and we want to find all the indices, we will use the function which() combined with min() or max(). For example, to find all the indices of the minimum in x, we can use the command

which(x==min(x))
[1] 1 6

You will learn the function which() in lesson 8 of swirl.