Suppose we have a set of data points (yi, x1i, x2i, …, xpi) with i=1,2,…, n. A linear model attempts to find a set of p+1 parameters β0, β1, β2, …, βp that best fits Y for the given set of p independent variables X1, X2, …, Xp via the equation \[y_i = \hat{y}_i + \epsilon_i = \beta_0 + \beta_1 x_{1i}+\beta_2 x_{2i} + \cdots + \beta_p x_{pi} + \epsilon_i\] In this standard notation, \(\hat{y}_i=\beta_0 + \beta_1 x_{1i}+\beta_2 x_{2i} + \cdots + \beta_p x_{pi}\) denotes the fitted value and \(\epsilon_i\) denotes the residual for the i’th observation.
A linear model is widely used in Statistics because of its simplicity and its well-known properties. It should be pointed out that the term “linear” in linear model refers to a model being linear in the fitting parameters β0, β1, β2, …, βp but not on the independent variables X1, X2, …, Xp. For example, we can fit \(y_i\) with the model \[\hat{y}_i = \beta_0 + \beta_1 x_i + \beta_2 \sqrt{x_i} + \beta_3 x_i^2 + \beta_4 \log x_i + \beta_5 \frac{x_i^3}{1+2^{x_i}}\] using the technique of multiple linear regression since the model is linear in \(\beta_0,\beta_1,\cdots,\beta_5\). We simply label \[ x_{1i} = x_i \ \ , \ \ x_{2i}=\sqrt{x_i} \ \ , \ \ x_{3i}=x_i^2 \ \ , \ \ x_{4i}=\log x_i \ \ , \ \ x_{5i}= \frac{x_i^3}{1+2^{x_i}} \] and the fitting equation becomes \[ \hat{y}_i =\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i} + \beta_5 x_{5i} \] which is the equation for a multiple linear model with p=5.1
In other cases, a nonlinear model can be turned into a linear model by transformation of variables. For example, suppose a model has the form \[\hat{y}_i = \sqrt{\beta_0 + \beta_1 x_i}\] This is clearly a nonlinear model. However, if we introduce a new variable \(\hat{z}_i=\hat{y}_i^2\), the equation becomes \[\hat{z}_i = \beta_0 + \beta_1 x_i\] which is a linear model. The coefficients \(\beta_0\) and \(\beta_1\) can be computed using the technique of linear regression.
Here we will use three examples to demonstrate the technique of transformation of variables.
This csv file has three columns containing the information of the monthly average temperatures at Champaign from the year 1990 to 2015. The first two columns are the year and month. The third column is the monthly average temperature (in °F) at the specified year and month. The data file was created by taking monthly average of the temperature data downloaded from the Illinois Climate Network. The csv file can be loaded to R using the command
cmi <- read.csv("cmi_aveTemperature.csv")
To see how the data look, We make a plot of the monthly average temperature versus month:
plot(T_ave ~ month, data=cmi, las=1, pch=20,
ylab=expression(paste("Monthly average temperature ( ",degree,"F)")))
From the plot, the monthly average temperature \(T\) varies between about 80°F and 20°F. The hottest month is around July.
The simplest model for fitting the data is to use a sinusoidal function. Since we know the temperature variation is periodic with a period of 12 months, we can construct the following simple model. \[\hat{T} = T_0 + A \cos \left[ \frac{\pi}{6} (M - M_h)\right]\] Here \(\hat{T}\) is the predicted monthly average temperature, \(M\) is the month. The fitting parameters are \(T_0\), \(A\) and \(M_h\). The average of the cosine function is 0, so \(T_0\) represents the average temperature over a year. The cosine function \(\cos(x)\) has a maximum value of 1 at \(x=0\). This means that the maximum average temperature occurs at \(M=M_h\). Thus \(M_h\) represents the hottest month of the year, although it is a parameter that doesn’t have to be an integer. The factor π/6 in the cosine function is determined to make the period of temperature variation to be 12 months, since the cosine function is periodic with a period of 2π. Finally, the maximum and minimum of the cosine function is ±1, so the maximum and minimum average temperature in the model is \(T_0 \pm A\). Therefore, \(A\) represents the amplitude of seasonal temperature variation from \(T_0\).
Before we tackle the full model, it is useful to consider a slightly simplified model. Instead of treating \(M_h\) as a fitting parameter, we fix it to be 7 since we see from the plot that the hottest month of the year is July. In this case, the model simplifies to \[\hat{T} = T_0 + A \cos \left[ \frac{\pi}{6} (M - 7)\right]\] This simplified model is a simple linear model in disguise since it is linear in the fitting parameters \(T_0\) and \(A\). We can express the model in the standard form by introducing a new variable
\[X = \cos \left[ \frac{\pi}{6} (M - 7)\right]\] In terms of \(X\), the model becomes \[\hat{T} = T_0 + AX\] which is a linear model written in the standard form. To see how good the model may be, we make a plot of monthly average temperature versus \(X\):
plot(T_ave ~ cos(pi/6*(month-7)), data=cmi, ylab="T",las=1,pch=20,
xlab="X = cos[pi(month-7)/6]")
From this plot, it looks promising that the data can be fit by a straight line fairly well. We now determine the parameters \(T_0\) and \(A\) by simple linear regression:
fit1 <- lm(T_ave ~ cos(pi/6*(month-7)), data=cmi)
summary(fit1)
Call:
lm(formula = T_ave ~ cos(pi/6 * (month - 7)), data = cmi)
Residuals:
Min 1Q Median 3Q Max
-15.6428 -2.1797 0.0442 2.2298 13.8788
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.5383 0.2204 238.36 <2e-16 ***
cos(pi/6 * (month - 7)) 23.9566 0.3117 76.86 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.893 on 310 degrees of freedom
Multiple R-squared: 0.9501, Adjusted R-squared: 0.95
F-statistic: 5907 on 1 and 310 DF, p-value: < 2.2e-16
With R2 = 0.95, the fit appears to be quite good. The parameter T0 = 52.54°F is the value of the intercept, and A = 23.96°F is the slope. This means that the average temperature over the year is about 52.54°F. This value is exactly the same as the mean of T_ave
:
mean(cmi$T_ave)
[1] 52.53825
The seasonal temperature variation from this yearly average temperature is A ≈ 24°F.
We can visualize the fit from the following plot
plot(T_ave ~ cos(pi/6*(month-7)), data=cmi, ylab="T",las=1,pch=20,
xlab="X = cos[pi(month-7)/6]")
# Recall: abline(fit1) can be used to add the regression line to the plot
abline(fit1, col="red")
In terms of the variable month
, the fit is a cosine curve. We can use the curve()
function to plot the fitted curve together with the data points:
# Recall: the fitted coefficients are stored in the vector fit1$coefficients
T0 <- fit1$coefficients[1]
A <- fit1$coefficients[2]
plot(T_ave ~ month, data=cmi, ylab="T",las=1,pch=20, xlab="month")
curve(T0+A*cos(pi/6*(x-7)), col="red", add=TRUE)
Another way to make the same plot is to use the predict()
function to calculate \(\hat{T}\) for a number of points with different values of month
and then use the lines()
function to add the curve for \(\hat{T}\). Here is the code:
# Create 100 points equally spaced between 1 and 12
mon_pts <- seq(1,12,length.out=100)
# Calculate predicted T for the 100 points
fitted.T <- predict(fit1, newdata=data.frame(month=mon_pts))
plot(T_ave ~ month, data=cmi, ylab="T",las=1,pch=20, xlab="month")
# add the predicted curve
lines(mon_pts,fitted.T,col="red")
We see that the model fits the data fairly well. We can also plot the monthly average temperature as a function of time. To do that, we create a new column named ‘Month1990’, which is the month counted from the year 1990:
cmi$Month1990 <- cmi$month + (cmi$year-1990)*12
The value of ‘Month1990’ is the same as ‘month’ in the year 1990, but it is cumulative rather than periodic: Month1990 = 13 for January 1991; Month1990 = 14 for February 1991 and so on. We can use this new column to plot T as a function of time:
plot(T_ave ~ Month1990, data=cmi, type='l',las=1,
ylab=expression(paste("Monthly average temperature ( ",degree,"F)")))
# Recall: the fitted values are stored in the vector fit1$fitted.values
lines(fit1$fitted.values ~ cmi$Month1990, col="red")
Here the black line is the data (since we set type='l'
). We use the lines()
function to add the fitted line (red) to the plot. The fit is quite good considering the simplicity of the model.
We now return to the full model, where \(M_h\) is a free parameter \[\hat{T} = T_0 + A \cos \left[ \frac{\pi}{6} (M - M_h)\right]\] Using the trigonometric identity \(\cos(a-b)=\cos a \cos b + \sin a \sin b\), we can write \[\cos \left[ \frac{\pi}{6} (M - M_h)\right] = \cos \frac{\pi}{6}M \cos \frac{\pi}{6}M_h + \sin \frac{\pi}{6}M \sin \frac{\pi}{6}M_h\] and the full model can be written as \[\hat{T} = T_0 + A \cos \frac{\pi}{6}M_h \cos \frac{\pi}{6}M + A \sin \frac{\pi}{6}M_h \sin \frac{\pi}{6}M\] Define two parameters \[\beta_1 = A \cos \frac{\pi}{6}M_h \ \ \ , \ \ \ \beta_2 = A \sin \frac{\pi}{6}M_h\] and the model becomes \[\hat{T} = T_0 + \beta_1 \cos \frac{\pi}{6}M + \beta_2 \sin \frac{\pi}{6}M\] This is again a linear model because \(T\) is linear in the fitting parameters \(T_0\), \(\beta_1\) and \(\beta_2\). If we define \(X_1=\cos \frac{\pi}{6}M\) and \(X_2 = \sin \frac{\pi}{6}M\), the model becomes \[\hat{T}=T_0 + \beta_1 X_1 + \beta_2 X_2\] This is a linear model written in the standard form. What we have just done is to trade A and Mh for β1 and β2. There are still 3 fitting parameters (T0, β1 and β2) in this model.
We use lm()
to find \(T_0\), \(\beta_1\) and \(\beta_2\):
fit2 <- lm(T_ave ~ cos(pi/6*month) + sin(pi/6*month), data=cmi)
summary(fit2)
Call:
lm(formula = T_ave ~ cos(pi/6 * month) + sin(pi/6 * month), data = cmi)
Residuals:
Min 1Q Median 3Q Max
-16.1809 -2.1600 0.0087 2.5647 14.8108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.5383 0.2165 242.69 <2e-16 ***
cos(pi/6 * month) -20.2089 0.3062 -66.01 <2e-16 ***
sin(pi/6 * month) -12.9103 0.3062 -42.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.824 on 309 degrees of freedom
Multiple R-squared: 0.9521, Adjusted R-squared: 0.9517
F-statistic: 3068 on 2 and 309 DF, p-value: < 2.2e-16
We see that T0 = 52.54°F remains unchanged, β1 = -20.21°F and β2 = -12.91°F. These values of β1 and β2 correspond to A = 23.98°F and Mh = 7.086. 2
We can still use the predict()
function to predict for a new data set. For example, to calculate the predicted monthly average in April, we use the command
predict(fit2, newdata=data.frame(month = 4))
1
51.46205
and we get 51.5°F. We can also substitute month = 4 to the equation of the model:
fit2$coef[1] + fit2$coef[2]*cos(pi/6*4) + fit2$coef[3]*sin(pi/6*4)
(Intercept)
51.46205
We see that we get the same answer. Note that we use the abbreviation fit2$coef
instead of fit2$coefficients
to take advantage of R’s partial matching of names. As pointed out in Peng’s textbook, this is very useful during interactive work, but you shouldn’t use partial matching if you are writing longer scripts, functions, or programs to avoid confusion.
How does the full model compared to the simplified model? We see from the summary of model fit that the full model has almost the same residual standard error as the simplified model. The following is a plot comparing the two models.
plot(T_ave ~ month, data=cmi, ylab="T",las=1,pch=20, xlab="month")
curve(T0+A*cos(pi/6*(x-7)), col="red", add=TRUE)
curve(fit2$coef[1] + fit2$coef[2]*cos(pi/6*x) + fit2$coef[3]*sin(pi/6*x),
col="blue", lty=2, add=TRUE)
legend(6,50,c("data","simplified","full"), pch=c(20,-1,-1),
col=c("black","red","blue"), lty=0:2)
The plot tells the same story. The full model is similar to the simplified model and it does not fit the data much better. We conclude that including the extra parameter \(M_h\) in the model does not significantly improve the accuracy. It is not very surprising that the simplified model is almost as good as the full model because our guess Mh = 7 is close to the optimal value Mh = 7.086.
When a ball bearing is dropped from rest, it falls freely due to Earth’s gravity. You may remember from an introductory physics course that the distance traveled \(s\) as a function of time \(t\) is given by the formula \[s=\frac{1}{2} gt^2\] Here \(g\) is a constant known as the acceleration due to gravity. It measures the strength of gravity near Earth’s surface. By carefully measuring the time taken for a ball bearing to fall a certain distance, we can use the formula to determine \(g\). In fact, this is a lab experiment many students have done in an introductory physics course.
Consider the following data from a physics lab experiment.
t <- seq(0,0.5,0.1)
s <- c(0,0.053,0.196,0.443,0.773, 1.257)
Here \(s\) is distance traveled in meters and \(t\) is time in seconds. According to the formula, the plot of \(s\) versus \(t^2/2\) should be a straight line with a slope equal to \(g\) passing through the origin. We take a look at the plot:
plot(s ~ I(t^2/2), pch=19, xlab=expression(t^2/2), ylab="s",las=1)
Note that we need to put the expression t^2/2
inside the I()
function to tell R that the quantity t^2/2
should be treated as one variable (see, e.g., this tutorial for detail).
By making the transformation \(x=t^2/2\), the model \(s=gt^2/2\) becomes \(s = g x\), which is a simple linear model written in the standard form. We use the lm()
function to find the slope:
fit <- lm(s ~ I(t^2/2))
summary(fit)
Call:
lm(formula = s ~ I(t^2/2))
Residuals:
1 2 3 4 5 6
0.0031931 0.0063538 -0.0001639 -0.0023601 -0.0212348 0.0142120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.003193 0.007805 -0.409 0.703
I(t^2/2) 9.967848 0.122200 81.570 1.35e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01331 on 4 degrees of freedom
Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
F-statistic: 6654 on 1 and 4 DF, p-value: 1.354e-07
With R2 = 0.9994, the fit is very good. There is a small value of intercept, but the p-value is 0.7, indicating that the intercept is not significant. The slope is 9.97. The 95% confidence interval of the slope is
confint(fit, "I(t^2/2)")
2.5 % 97.5 %
I(t^2/2) 9.628567 10.30713
This means that the 95% confidence interval for the measured value of \(g\) is (9.6, 10.3) m/s2. The accepted value of \(g\) is 9.8m/s2.
Here we only consider the log phase and stationary phase. The logistic model gives a function that starts out looking like the log phase and flattens out like the stationary phase, so we’ll use that function. In this model, the bacterial population density \(P\) as a function of time is given by the equation \[\frac{P_{i+1}-P_i}{P_i} = r \left( 1 - \frac{P_i}{P_{\rm max}} \right)\] Here \(P_i\) is the bacterial population density at time \(t=t_i\), and it is assumed that \(t_{i+1}-t_i=\Delta t\) is a constant. That is, the population density \(P\) is measured at a constant time interval in the experiment. The parameter \(r\) is the growth rate, measuring how fast the fractional increase in \(P\) in the log phase when the population density \(P \ll P_{\rm max}\). As \(P\) increases, the term \(P/P_{\rm max}\) increases, and the growth rate is suppressed by a factor \((1-P/P_{\rm max})\). The parameter \(P_{\rm max}\) is called the carrying capacity. When \(P_i\) reaches \(P_{\rm max}\), the equation becomes \((P_{i+1}-P_i)/P_i=0\) or \(P_{i+1}=P_i=P_{\rm max}\) and so the population stops growing thereafter. Thus, \(P_{\rm max}\) measures the maximum population density the bacteria can reach and is the population density in the stationary phase.
Consider the following data from an experiment.
t <- seq(0,200,10)
P <- c(1, 1.67, 2.64, 4.45, 7.27, 10.56, 15.18, 23.45, 30.14, 36.83, 42.93,
44.09, 44.98, 45.37, 45.08, 45.25, 45.35, 45.58, 44.84, 45.39, 45.01)
The time \(t\) is in minutes and the bacterial population density \(P\) is normalized to 1 at the beginning of the experiment (\(t=0\)). In this experiment, the time interval between successive measurements is \(\Delta t = 10\) minutes. We first look at the plot of \(P(t)\):
plot(P ~ t, pch=19, las=1, xlab="t (minutes)", ylab="Population density")
The log phase and stationary phase can be seen clearly. The logistic model is a nonlinear model, but we can turn it into a linear model by writing \(x_i=P_i\) and defining a new variable \(y_i\) as \[y_i = \frac{P_{i+1}-P_i}{P_i}\] With these new variables, the logistic model becomes \[ y_i = r - \frac{r}{P_{\rm max}} x_i\] which is a linear model in the standard form. The intercept is the growth rate \(r\) and the slope is \(-r/P_{\rm max}\).
To fit the model, we need to create the x and y variables. Since \(y_i\) involves both \(P_i\) and \(P_{i+1}\), the number of the \(\{ y_i \}\) data is one less than the \(\{ P_i \}\) data. We want the \(\{ x_i \}\) data to have the same number as the \(\{ y_i \}\) data to fit the model. Here is a code that generates x and y:
n <- length(P) # number of data points in P
x <- P[-n] # remove the last element from P; x has one less element than P
y <- rep(NA, n-1) # initial y to a vector of NAs of length n-1
# Calculate y
for (i in 1:(n-1)) {
y[i] <- (P[i+1]-P[i])/P[i]
}
In this code, the calculation of y
is done via a for-loop. If you abhor for-loops, you will be pleased to learn that the calculation of y
can also be accomplished by the following one-line code:
y <- (P[-1]-P[-n])/P[-n]
The logic goes like this: P[-1]
returns a vector of length n-1
with the first element removed. So P[-1]
stores P[2], P[3], P[4], ..., P[n]
. Similarly, P[-n]
returns a vector of length n-1
storing P[1], P[2], P[3], ..., P[n-1]
. Therefore, (P[-1] - P[-n])/P[-n]
returns a vector of length n-1
storing (P[2]-P[1])/P[1], (P[3]-P[2])/P[2], ..., (P[n]-P[n-1])/P[n-1]
, which is y
. This index shifting trick was introduced in one of Week 4’s Lon Capa Homework problems.
We now take a look at the plot of y versus x:
plot(y ~ x, pch=19, las=1)
We use the lm()
function to fit a linear model:
fit <- lm(y ~ x)
summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.100133 -0.010122 0.000298 0.009774 0.106741
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6580277 0.0197281 33.35 < 2e-16 ***
x -0.0144910 0.0005768 -25.12 1.82e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04634 on 18 degrees of freedom
Multiple R-squared: 0.9723, Adjusted R-squared: 0.9707
F-statistic: 631.1 on 1 and 18 DF, p-value: 1.822e-15
From the intercept, we conclude that the growth rate is r = 0.66. Since slope = r/Pmax, we have Pmax = -r/slope, which is equal to 45.41. The 95% confidence intervals for the intercept and slope are
confint(fit)
2.5 % 97.5 %
(Intercept) 0.61658059 0.69947485
x -0.01570289 -0.01327911
This means that the 95% CI for \(r\) is (0.62, 0.70). Computing the 95% CI for Pmax is complicated because \(P_{\rm max}\) involves the division of intercept and slope and the two parameters are correlated. We will not do the calculation here.3 From the plot of \(P\) versus \(t\), we see that \(P\) is saturated around 45-46, so it seems that Pmax ≈ 45 or 46 is a good estimate.
We plot the data and fitted line together to visualize the quality of the fit:
plot(y ~ x, pch=19, las=1)
abline(fit, col="red")
We are, however, more interested in comparing the fitted values of \(P\) with the data. They can be calculated from the definition of \(y_i\): \[y_i = \frac{P_{i+1}-P_i}{P_i} \ \ \ \Rightarrow \ \ \ P_{i+1} = P_i (1 + y_i)\] Therefore, the fitted value of the (i+1)th population density \(\hat{P}_{i+1}\) is related to the fitted value \(\hat{y}_i\), by \[\hat{P}_{i+1} = \hat{P}_i (1+ \hat{y}_i)\] To proceed, we need to know the first value of \(\hat{P}\). Since \(P\) is normalized to 1 at \(t=0\), we also set \(\hat{P}_1=P_1=1\). The rest of \(\hat{P}_i\) can be computed using the above equation. Here is a code that computes the fitted values of \(P_i\):
P.fitted <- rep(NA,n) # Initialize P.fitted
P.fitted[1] <- 1
for (i in 1:(n-1)) {
P.fitted[i+1] = P.fitted[i]*(1+fit$fitted.values[i])
}
For those of you who abhor for-loops, here is a one-line code to compute \(\hat{P}\):
P.fitted <- cumprod( c(1, 1+fit$fitted.values) )
The trick is to use the cumprod()
function you learned in Week 10’s Lon Capa Homework problem on the mortgage calculation.
Now we can plot the fitted line together with the data:
plot(P ~ t, pch=20, las=1, xlab="t (minutes)",ylab="Population density")
lines(P.fitted ~ t, col="blue")
legend(120,30,c("data","fitted line"),col=c("black","blue"), pch=c(20,-1), lty=0:1)
We see that the logistic model provides a good fitting curve to the data points.
The reason why this is the case is that in the linear model we find parameters β0, β1, β2, …, βp that minimize the sum of square error (SSE). So SSE is considered as a function of the β parameters in the minimization procedure. The independent variables X1, X2, …, Xp, as well as Y are treated as constants in the minimization procedure.↩
The conversion can be carried out from the definition \(\beta_1 = A \cos \frac{\pi}{6}M_h\) and \(\beta_2 = A \sin \frac{\pi}{6}M_h\). The equations can be combined to give \(A=\sqrt{\beta_1^2+\beta_2^2}\) = 23.98°F, very close to the simplified model. It follows from the definition of β1 and β2 that \[\cos \frac{\pi}{6}M_h = \frac{\beta_1}{A} \ \ \ , \ \ \ \sin \frac{\pi}{6}M_h = \frac{\beta_2}{A}\] The simplest way to calculate Mh is to use the atan2()
function:
# Mh = 6/pi*atan2(beta2, beta1); beta2 is stored in fit2$coefficients[3] and
# beta1 is stored in fit2$coefficients[2]
Mh <- 6/pi*atan2(fit2$coefficients[3],fit2$coefficients[2])
Mh
sin(pi/6 * month)
-4.914261
Since sine and cosine functions are periodic with a period of 2π, adding Mh a multiple of 12 resulting in the same β1 and β2. So there are infinite number of solutions for Mh. If we add 12 to the above number, we get 7.086, which is the solution closest to 7. ↩
We can, however, perform a simulation to find the 95% CI approximately (feel free to skip the following if you don’t understand it). The idea is to add random noise to the variable y. The noise is distributed normally with mean 0 and standard deviation equal to the residual standard error. We then use the new data to fit the two parameters \(r\) and \(P_{\rm max}\). Repeat this experiment many times and then find the 2.5th and 97.5th percentiles of these parameters. These are the estimated 95% CI for \(r\) and \(P\). Here is a code that performs the simulation:
## Calculate parameters that need only be computed once
ny <- length(y)
# residual standard error of the linear model
res <- summary(lm(y ~ x))$sigma
# mean, standard deviation and Z-score of x
mean_x <- mean(x)
stdx <- sd(x)*sqrt((ny-1)/ny)
Zx <- (x-mean_x)/stdx
# Define a function that adds random normal noise to y, fit a linear model
# and extract the parameters r and Pmax
find_paras <- function(y, ny, Zx, res, mean_x, stdx) {
# Add a normal random noise of mean 0 and sd=res to y
yn <- y + rnorm(ny, sd=res)
# Fit a linear model. Instead of using lm(), we do it using the
# formula slope = cor(x,y) * stdy/stdx since many parameters
# in the formula are fixed and only need to be computed once
mean_yn <- sum(yn)/ny
stdy <- sqrt(sum((yn - mean_yn)^2)/ny)
Zy <- (yn - mean_yn)/stdy # Z score of y
cor_xy <- sum(Zx*Zy)/ny # cor(x,y)
slope <- cor_xy * stdy/stdx
# Extract r and Pmax: r = intercept, Pmax = -r/slope
r <- mean_yn - slope*mean_x
Pmax <- -r/slope
output <- c(r, Pmax)
names(output) <- c("r","Pmax")
output
}
# Perform simulation with 100,000 experiments
set.seed(3472651)
paras <- replicate(1e5, find_paras(y, ny, Zx, res, mean_x, stdx))
Now that we have computed 100,000 values of \(r\) and \(P_{\rm max}\), we use the quantile()
function to find the 2.5th and 97.5th percentiles for \(r\) and \(P_{\rm max}\) (see ?quantile
for the syntax). These are the estimated 95% CIs for the two parameters.
CI_r <- quantile(paras["r",], probs=c(0.025, 0.975))
CI_Pmax <- quantile(paras["Pmax",], probs=c(0.025, 0.975))
CI_r
2.5% 97.5%
0.6191557 0.6967139
CI_Pmax
2.5% 97.5%
43.60278 47.39859
We see that the estimated 95% CI for r is close to that calculated by the confint()
function. This gives us confidence for the CI result for \(P_{\rm max}\). According to the calculation, the 95% CI for \(P_{\rm max}\) is about (43.6, 47.4). ↩