R provides a number of functions associated with commonly used probability distributions. For each probability distribution there are four functions associated with it. They carry the prefixes d
, p
, q
and r
. The prefix d
stands for “density”, returning the probability density function; p
means “probability”, returning the cumulative distribution function; q
means “quantile”, returning the inverse of the cumulative distribution function; r
means “random”, returning random numbers following the specified probability distribution.
In this section we introduce the R functions associated with the most commonly used probability distribution in Statistics: the normal distribution. The associated R functions are dnorm()
, pnorm()
, qnorm()
and rnorm()
. Type ?dnorm
in the R console for the detail of their usage.
dnorm()
The dnorm(x)
function returns the density of the standard normal distribution (mean=0, standard deviation=1) at x. The mathematical function for the distribution is \[ f(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\] For example,
dnorm(-3:3)
[1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967
[7] 0.004431848
returns the density \(f(x)\) at \(x\) = -3, -2, -1, 0, 1, 2 and 3. We see that the distribution is symmetric about \(x = 0\). You get the same numbers by plugging the numbers -3:3
to the formula of the normal curve:
1/sqrt(2*pi)*exp(-(-3:3)^2/2)
[1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967
[7] 0.004431848
The standard normal curve can be plotted using the command
curve(dnorm(x), xlim=c(-3,3))
The xlim=c(-3,3)
tells R to plot the function in the range \(-3 \leq x \leq 3\). Type ?curve
to see other available options for the curve()
function.
The dnorm()
function has other options that allow you to choose normal distributions with another mean and standard deviation (again type ?dnorm
to see the usage). The mathematical equation for the distribution is \[ f(x;\mu,\sigma)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x-\mu)^2/(2\sigma^2)}\] Here \(\mu\) is the mean and \(\sigma\) is the standard deviation. For example, the following commands plot the standard normal curve, \(f(x)\) (black) and a normal curve with mean = 1 and standard deviation = 2, \(f(x;1,2)\) (red):
curve(dnorm(x), xlim=c(-6,6))
curve(dnorm(x, mean=1, sd=2), col="red", add=TRUE)
The first line in the code chunk plots the standard normal curve in the range \(-6 \leq x \leq 6\). In the second line, col="red"
plots the function with a red line. The add=TRUE
means that the plot should be added to the first plot instead of generating a new plot. We see that the second curve is symmetric about \(x=1\) and is more spread out compared to the standard normal curve.
pnorm()
and qnorm()
The pnorm(z)
function returns the cumulative probability of the standard normal distribution at Z score \(z\). That is, it’s the area under the standard normal curve to the left of \(z\) (the area of the shaded blue region in the plot below).
For example,
pnorm(1.65)
[1] 0.9505285
This means that the probability of getting a Z score smaller than 1.65 is 0.95 or 95%. Sometimes, we want to find the area of the right-side tail (area of the shaded region below).
This is especially the case when we want to find the (one-sided) p-value corresponding to a Z score. There are many ways to do this. The first method is to use the fact that the normal curve is symmetric, so \(P(Z>z)=P(Z<-z)\). For example, \(P(Z>1.65)=P(Z<-1.65)\) and \(P(Z>1.65)\) can be computed by
pnorm(-1.65)
[1] 0.04947147
The second method is to use \(P(Z>1.65)=1-P(Z<1.65)\) and \(P(Z>1.65)\) can be computed by
1-pnorm(1.65)
[1] 0.04947147
The third method is to the option lower.tail=FALSE
in pnorm()
:
pnorm(1.65,lower.tail=FALSE)
[1] 0.04947147
We see that the three methods produce the same result, as expected. I prefer the first method as it involves the least typing.
The qnorm()
function is the inverse of the pnorm()
function: qnorm(y)
returns the value \(x\) so that pnorm(x)=y
. For example, the Z score corresponding to the 95th percentile is
qnorm(0.95)
[1] 1.644854
We can check that the result is correct:
pnorm( qnorm(0.95) )
[1] 0.95
Like the pnorm()
function, the option lower.tail=FALSE
refers to the area of the right tail. For example,
qnorm(0.05,lower.tail=FALSE)
[1] 1.644854
This means that the Z score corresponding to the area of the right tail equaling 0.05 is 1.645. It can also be obtained by the following expression because of the symmetry of the normal curve (the Z score is the negative of the Z score corresponding to the area of the left tail being 0.05):
-qnorm(0.05)
[1] 1.644854
rnorm()
The rnorm()
function is a random number generator (type ?rnorm
to see its usage). For example,
rnorm(10)
[1] 0.4053546 -0.6914374 -0.5623105 0.2652098 -0.9166066 0.6804066
[7] -0.7471598 -1.1550152 -0.6181327 0.7914630
generates 10 random numbers following the standard normal distribution. It returns different numbers the next time you run it:
rnorm(10)
[1] 2.4412312 1.2042262 0.5926276 0.3633825 -1.9873193 0.5377273
[7] -0.9754285 -0.9924121 -1.4767626 -0.9644406
The command
rnorm(20,mean=1,sd=3)
[1] 7.4726921 -4.1817994 1.4767907 -0.9934669 3.2575528 0.4522079
[7] 1.7966542 -3.3782191 1.9933302 0.9305993 -1.2038704 0.8175358
[13] -1.5344569 3.9262773 0.2872347 3.0199772 -1.6123880 3.0520579
[19] 0.2618046 -2.6131915
generates 20 random numbers following the normal distribution with mean = 1 and standard deviation = 3. All random numbers generated by computers are not truly random, in the sense that these numbers are generated based on some rules. They appear to be random and have many of the statistical properties of true random numbers. For this reason they are often called the pseudo random numbers. Pseudo random numbers are adequate in most statistical applications.
In many applications, we want to be able to reproduce the results of our data analysis. It is then important to use the set.seed(n)
function to generate the same set of random numbers. Here \(n\) is an integer. For example, the commands
set.seed(107281)
rnorm(10)
[1] 0.12977398 -0.93544505 0.44900614 -0.85894095 0.03539138
[6] -0.31035867 -1.01223948 -0.70036977 -0.92594897 -0.53929791
sets a seed (107281) and then generates 10 standard normal random numbers. When we run rnorm(10)
again, we get different numebrs:
rnorm(10)
[1] 0.81412055 -0.65041261 0.76975912 1.46504444 -0.07391547
[6] -0.28979138 -1.43520277 -0.20861926 -0.44830295 -0.58084481
However, when we reset the seed to 107281 and generate 20 standard normal random numbers,
set.seed(107281)
rnorm(20)
[1] 0.12977398 -0.93544505 0.44900614 -0.85894095 0.03539138
[6] -0.31035867 -1.01223948 -0.70036977 -0.92594897 -0.53929791
[11] 0.81412055 -0.65041261 0.76975912 1.46504444 -0.07391547
[16] -0.28979138 -1.43520277 -0.20861926 -0.44830295 -0.58084481
we get the exact same 20 numbers as before. You will see that I always use set.seed()
before generating a new set of random numbers. This is mainly for the purpose of reproducibility. In general, it is a good idea to use set.seed()
if you are working on a project that requires the use of random numbers, at least at the beginning. This is especially useful if the project involves complicated coding as the reproducibility makes the code debugging easier.
Random numbers are essential in statistical simulations and we will talk more about them later in the course.
Now let’s study some properties of the random number generated by the rnorm()
function. The following code chunk generates 5 sets of standard normal random numbers.
set.seed(17289143)
# Generate 5 sets of standard normal random nums with length=1000
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
The # symbol is a comment character. R ignores everything following # in the rest of a line. One important property of random numbers is that they are independent, and so are uncorrelated. Let’s check to see if the 5 sets of random numbers are uncorrelated by computing the correlation matrix:
cor(cbind(x1,x2,x3,x4,x5))
x1 x2 x3 x4 x5
x1 1.000000000 -0.002318008 0.017276589 0.081279622 0.042118493
x2 -0.002318008 1.000000000 0.002570233 -0.010809180 0.042593306
x3 0.017276589 0.002570233 1.000000000 -0.020561573 0.018972720
x4 0.081279622 -0.010809180 -0.020561573 1.000000000 0.001248967
x5 0.042118493 0.042593306 0.018972720 0.001248967 1.000000000
We see that the off-diagonal elements of the matrix are small, indicating that they are uncorrelated.
Now let’s generate \(10^5\) standard normal random numbers to analyze their statistical properties.
set.seed(2819237)
x <- rnorm(1e5)
We can look at the mean and (sample) standard deviation:
mean(x)
[1] -0.00424089
sd(x)
[1] 0.999141
We see that the mean and sd are very close to 0 and 1, respectively, as expected. To verify that the data is normally distributed, we generate a histogram using the hist()
function:
hist(x)
The histogram looks like a normal curve, but we should be able to do a better job by breaking the histogram into smaller intervals. Like many other R functions, there are parameters in the hist()
function that allow us to customize our histogram. As you are diving into the R world, you will encounter many R functions having many optional parameters. It is important to note that you don’t need to memorize all the parameters as you can easily find them by pulling up a help page using ?
. If you don’t understand the help page, do an internet search and you will be amazed to find that many webpages have the information you are looking for.
Here is a better histogram for our data:
hist(x, freq=FALSE, breaks=100)
The freq=FALSE
option is to show the density (fraction of points in each interval) on the y-axis instead of frequency (number of points in each interval). With this option, the total area of the histogram is normalized to 1. The breaks=100
option is to draw the histogram with 100 equally spaced intervals. This histogram looks more like the standard normal curve. In fact, we can plot the standard normal curve on top of the histogram to have a better comparison:
hist(x, freq=FALSE, breaks=100)
curve(dnorm(x),col="red",add=TRUE)
We thus verify that the rnorm()
function generates random numbers following the normal distribution. Let’s try something new.
Exercise 1: Make a histogram of \(x^2\). Does it resemble anything you are familiar with?
The four functions associated with the \(\chi^2\) distribution are dchisq()
, pchisq()
, qchisq()
, rchisq()
. The usage is essentially the same as the four functions associated with the normal distribution. Again, use ?dchisq
to find out more information.
Before we demonstrate the use of these four functions, let’s first talk about what \(\chi^2\) distribution is. A \(\chi^2\) distribution with \(k\) degrees of freedom is constructed by the sum of the square of \(k\) independent, standard normal random variables. Specifically, if \(X_1\), \(X_2\), \(\cdots\), \(X_k\) are independent random variables and they all follow the standard normal distribution, then the random variable \[ Q = X_1^2+ X_2^2 + \cdots + X_k^2 \] follows the \(\chi^2\) distribution with \(k\) degrees of freedom. If you find the concept hard to follow, read this page.
dchisq()
and rchisq()
The mathematical functions for the \(\chi^2\) density distribution can be found in many standard textbooks and from the internet (e.g., Wikipedia). To see what the \(\chi^2\) distributions look like, we use the dchisq()
function to plot the density functions:
curve(dchisq(x,1),xlim=c(0,15),ylim=c(0,0.6),ylab="Chi Square Density")
curve(dchisq(x,2),col="red",lty=2,add=TRUE)
curve(dchisq(x,3),col="blue",lty=3,add=TRUE)
curve(dchisq(x,5),col="dark green",lty=4,add=TRUE)
curve(dchisq(x,10),col="brown",lty=5,add=TRUE)
legend(12,0.55,c("k=1","k=2","k=3","k=5","k=10"),
col=c("black","red","blue","dark green","brown"),lty=1:5)
Unlike dnorm()
, we need to pass at least two arguments to dchisq()
. The first argument is the value of \(x\) we want to compute the density and the second one is the degree of freedom. In the plot above, \(\chi^2\) density functions with degrees of freedom \(k=1\), 2, 3, 5 and 10 are plotted together. There are a few new things in the above code chunk: ylim
is used to set the y-range of the plot; ylab
is used to change the y-axis label; lty
is used to change the line type (1=solid, 2=dashed, 3=dotted, 4=dash-dotted, 5=long-dashed); legend()
is used to display the legend. You can type ?legend
to see the usage, or alternatively google “r add lenged to plot” to find webpages with the information (this was actually what I did since I forgot how to add legend to a plot).
Now you know enough to answer the question at the end of last section. From the definition of the \(\chi^2\) distribution, we know that the histogram of \(x^2\) in Exercise 1 in the last section should follow the \(\chi^2\) distribution with one degree of freedom. We can check that by plotting the histogram and the \(\chi^2\) density curve together:
# Regenerate the random numbers x from the previous section
# You can skip this step if you haven't closed the R console
set.seed(2819237)
x <- rnorm(1e5)
# Plot the x^2 histogram and the 1-df chi-square curve together
hist(x^2, freq=FALSE, breaks=200,xlim=c(0,5))
curve(dchisq(x,1),col="red",add=TRUE)
Let’s now generate random numbers following the \(\chi^2\) distribution with df=3 from 3 independent standard normal random variables:
set.seed(2712819)
X1 <- rnorm(1e5)
X2 <- rnorm(1e5)
X3 <- rnorm(1e5)
Q <- X1^2 + X2^2 + X3^2
Let’s check that Q
follows the \(\chi^2\) distribution with df=3.
hist(Q, freq=FALSE, breaks=100)
curve(dchisq(x,3),col="red",add=TRUE)
In practical applications, we don’t generate random numbers following \(\chi^2\) distribution from the rnorm()
function. The rchisq()
function is more convenient to use. For example,
set.seed(90546723)
Q <- rchisq(1e5,10)
generates \(10^5\) random numbers following the \(\chi^2\) distribution with 10 degrees of freedom:
hist(Q, freq=FALSE, breaks=100)
curve(dchisq(x,10),col="red",add=TRUE)
pchisq() and qchisq()
The pchisq(x,df)
function returns the cumulative probability \(P(\chi^2<x)\) for the \(\chi^2\) distribution with df degrees of freedom, i.e. the area of the shaded region in the plot of the \(\chi^2\) density curve below.
For the p-value calculation, we usually want the area of the tail, i.e. \(P(\chi^2>x)\), the shaded region below.
In this case, we use pchisq(x,df,lower.tail=FALSE)
or 1-pchisq(x,df)
. For example, the p-value corresponding to \(\chi^2=21.78\) and df=10 is
pchisq(21.78,10,lower.tail=FALSE)
[1] 0.01626544
The qchisq()
is the inverse of pchisq()
: if pchisq(x)=y
, then qchisq(y)=x
. For example, the \(\chi^2\) value corresponding to a 5% p-value and df=10 is
qchisq(0.05,10,lower.tail=FALSE)
[1] 18.30704
Another distribution mentioned in Stat 100 is Student’s \(t\) distribution. The four functions associated with the \(t\) distributions are dt()
, pt()
, qt()
and rt()
.
The \(t\) distribution is constructed from the normal and \(\chi^2\) distributions. Let \(Z\) and \(W\) be two independent random variables. If \(Z\) follows the standard normal distribution and \(W\) follows the \(\chi^2\) distribution with \(k\) degrees of freedom, then the random variable \[T = \frac{Z}{\sqrt{W/k}}\] is said to follow the \(t\) distribution with \(k\) degrees of freedom. The mathematical functions for the distribution can again be found in many textbooks and from the internet (e.g., Wikipedia).
dt()
and rt()
Let’s generate \(10^5\) random numbers following the \(t\) distribution with 5 df from the definition of the \(t\) distribution:
Z <- rnorm(1e5)
W <- rchisq(1e5,5)
T <- Z/sqrt(W/5)
As before, we plot a histogram for the data:
hist(T,breaks=100,freq=FALSE)
This doesn’t look very nice. We see that most of the data are concentrated between -5 and 5. So instead of using equally spaced interval, we want more intervals between -5 and 5 and less intervals outside the range. As stated in the help page of ?hist
, this can be done by providing the breaks
parameter a numeric vector containing the break points. In the following, I set 100 break points between -5 and 5 using the seq()
function, one break point at the minimum of x
and one break point at the maximum of x
:
hist(T, breaks=c(min(T), seq(-5,5,length.out=100), max(T)), freq=FALSE)
Since the region outside (-5,5) is pretty boring, we can exclude it from the plot by setting the xlim
parameter. We also want to plot the dt()
function on the same graph. So we type the following commands:
hist(T, breaks=c(min(T), seq(-5,5,length.out=100), max(T)), freq=FALSE,
xlim=c(-5,5))
curve(dt(x,5),col="red",add=TRUE)
As you can see, the \(t\) distribution is symmetric about \(x=0\) and resembles the normal curve. You may recall from Stat 100 that the \(t\) distribution approaches the normal distribution when the degree of freedom \(k\rightarrow \infty\). In general, the \(t\) curve falls more slowly than the normal curve at large \(|x|\).
The following code chunk plots the \(t\) curves with various degrees of freedom.
curve(dt(x,1),xlim=c(-5,5),ylim=c(0,0.4),ylab="Student's t Density")
curve(dt(x,2),col="red",lty=2,add=TRUE)
curve(dt(x,5),col="blue",lty=3,add=TRUE)
curve(dt(x,20),col="dark green",lty=4,add=TRUE)
curve(dnorm(x),col="brown",lty=5,add=TRUE)
legend(2,0.38,c("k=1","k=2","k=5","k=20","normal curve"),
col=c("black","red","blue","dark green","brown"),lty=1:5)
As you can see, the \(t\) curve is already quite close to the standard normal curve for \(k=20\).
The rt(n,df)
function generates \(n\) random numbers following the \(t\) distribution with df degrees of freedom.
Exercise 2: Generate random numbers following the t distribution and then make a histogram. Superpose the density curve of the t distribution on the histogram and check that the histogram matches the density curve.
pt()
and qt()
The usage of the pt()
and qt()
functions are exactly the same as that of the pchisq()
and qchisq()
functions. For example, the p-value corresponding to \(t\)=1.5 and df=4 is
pt(1.5,4,lower.tail=FALSE)
[1] 0.104
This is also equal to pt(-1.5,4)
since the \(t\) curve is symmetric.
The value of \(t\) corresponding to a p-value of 5% and df=4 is
qt(0.05,4,lower.tail=FALSE)
[1] 2.131847
This is also equal to -qt(0.05,4)
.
In addition to the three probability distributions mentioned here, R has built-in functions for many other probability distributions. An incomplete list of the available distributions can be found on this page. This is mostly for your reference.
In Stat 200, you will encounter a few more probability distributions we haven’t mentioned in this note. The most important of them is the F distribution, which are constructed from two independent \(\chi^2\) random variables. You will explore this distribution through a Lon Capa exercise later in the course.