# load tidyverse packages
library(tidyverse)
# load the data
stock <- read_csv("AAPLStock.csv")
Parsed with column specification:
cols(
Date = col_date(format = ""),
Open = col_double(),
High = col_double(),
Low = col_double(),
Close = col_double(),
Volume = col_integer(),
`Adj Close` = col_double()
)
stock
, which is the percentage of the change of the stock’s closing price (in column Close
) compared to the closing price in the previous trade day (note that time is decreasing down the rows).Here I introduce the lead()
command in dplyr
. For a given vector x
, lead(x,n)
(n
is a positive integer) shifts the vector x
ahead by n
elements and then replace the last n
elements by NA
s. The default value of n
is 1. The following examples should make this clear.
(x <- 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
lead(x)
[1] 2 3 4 5 6 7 8 9 10 NA
lead(x,2)
[1] 3 4 5 6 7 8 9 10 NA NA
lead(x,5)
[1] 6 7 8 9 10 NA NA NA NA NA
lead(x,9)
[1] 10 NA NA NA NA NA NA NA NA NA
There is another similar function called lag()
. See ?lag
or simply try it out by replacing lead
by lag
in the examples above to see how it works.
Now we can easily create the Change
column using mutate()
and lead()
:
stock <- mutate(stock, Change = (Close - lead(Close))/lead(Close)*100)
head(stock)
# A tibble: 6 x 8
Date Open High Low Close Volume `Adj Close` Change
<date> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 2016-12-30 116.65 117.20 115.43 115.82 30253100 115.82 -0.77957935
2 2016-12-29 116.45 117.11 116.40 116.73 14963300 116.73 -0.02569287
3 2016-12-28 117.52 118.02 116.20 116.76 20582000 116.76 -0.42640286
4 2016-12-27 116.52 117.80 116.49 117.26 18071900 117.26 0.63508841
5 2016-12-23 115.59 116.52 115.59 116.52 14181200 116.52 0.19777797
6 2016-12-22 116.35 116.51 115.64 116.29 25789800 116.29 -0.65777978
tail(stock)
# A tibble: 6 x 8
Date Open High Low Close Volume `Adj Close` Change
<date> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 1980-12-19 28.250 28.375 28.250 28.250 12157600 0.419419 6.103283
2 1980-12-18 26.625 26.750 26.625 26.625 18362400 0.395293 2.898551
3 1980-12-17 25.875 26.000 25.875 25.875 21610400 0.384158 2.475248
4 1980-12-16 25.375 25.375 25.250 25.250 26432000 0.374879 -7.339450
5 1980-12-15 27.375 27.375 27.250 27.250 43971200 0.404572 -5.217391
6 1980-12-12 28.750 28.875 28.750 28.750 117258400 0.426842 NA
We can check that it gives the same result as the following base R commands given in the Lon Capa solution.
# remove the newly created `Change` column and copied to a new tibble
stock2 <- select(stock, -Change)
# base R code to create the `Change` column
n <- nrow(stock2)
stock2$Change <- NA
stock2$Change[-n] <- (stock2$Close[-n] - stock2$Close[-1])/stock2$Close[-1] * 100
# check to see if the result is the same as the tidyverse approach:
identical(stock,stock2)
[1] TRUE
This base R command is still my favorite method:
n <- nrow(stock)
mean(stock$Change[-n] > 0)
[1] 0.4889451
If we must use dplyr
, we can first use filter()
to filter out the NA and then (%>%
) summarize()
to compute the mean:
filter(stock, !is.na(Change)) %>% summarize(mean(Change > 0))
# A tibble: 1 x 1
`mean(Change > 0)`
<dbl>
1 0.4889451
but this seems to be a complicated way to carry out the command mean(stock$Change > 0, na.rm=TRUE)
.
Here dplyr
’s arrange()
command comes in handy:
arrange(stock, Change)
# A tibble: 9,092 x 8
Date Open High Low Close Volume `Adj Close`
<date> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 2014-06-09 92.7000 93.8800 91.7500 93.700 75415000 89.291786
2 2000-09-29 28.1875 29.0000 25.3750 25.750 1855410200 1.675311
3 2005-02-28 44.6800 45.1400 43.9600 44.860 162902600 5.837240
4 1987-06-16 41.5000 41.7500 38.0000 41.500 85680000 1.234147
5 2000-06-21 50.5000 56.9375 50.3125 55.625 122500000 3.618998
6 1987-10-19 48.2500 48.2500 35.5000 36.500 119000000 1.086855
7 1983-09-26 25.8750 25.8750 24.3750 24.875 192192000 0.369311
8 1993-07-16 28.5000 29.6250 26.5000 27.500 530149200 0.865228
9 1987-10-26 34.5000 35.0000 27.6250 28.000 78400000 0.833752
10 2008-09-29 119.6200 119.6800 100.5900 105.260 655514300 13.696564
# ... with 9,082 more rows, and 1 more variables: Change <dbl>
Oh-oh, we can’t see the Change
column. No worry. We can use select()
to rearrange columns:
arrange(stock, Change) %>% select(Change, Date:`Adj Close`)
# A tibble: 9,092 x 8
Change Date Open High Low Close Volume
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <int>
1 -85.48570 2014-06-09 92.7000 93.8800 91.7500 93.700 75415000
2 -51.86916 2000-09-29 28.1875 29.0000 25.3750 25.750 1855410200
3 -49.58984 2005-02-28 44.6800 45.1400 43.9600 44.860 162902600
4 -47.13376 1987-06-16 41.5000 41.7500 38.0000 41.500 85680000
5 -45.06173 2000-06-21 50.5000 56.9375 50.3125 55.625 122500000
6 -24.35233 1987-10-19 48.2500 48.2500 35.5000 36.500 119000000
7 -23.46154 1983-09-26 25.8750 25.8750 24.3750 24.875 192192000
8 -23.07692 1993-07-16 28.5000 29.6250 26.5000 27.500 530149200
9 -21.12676 1987-10-26 34.5000 35.0000 27.6250 28.000 78400000
10 -17.91953 2008-09-29 119.6200 119.6800 100.5900 105.260 655514300
# ... with 9,082 more rows, and 1 more variables: `Adj Close` <dbl>
OR
arrange(stock, Change) %>% select(Change, everything())
# A tibble: 9,092 x 8
Change Date Open High Low Close Volume
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <int>
1 -85.48570 2014-06-09 92.7000 93.8800 91.7500 93.700 75415000
2 -51.86916 2000-09-29 28.1875 29.0000 25.3750 25.750 1855410200
3 -49.58984 2005-02-28 44.6800 45.1400 43.9600 44.860 162902600
4 -47.13376 1987-06-16 41.5000 41.7500 38.0000 41.500 85680000
5 -45.06173 2000-06-21 50.5000 56.9375 50.3125 55.625 122500000
6 -24.35233 1987-10-19 48.2500 48.2500 35.5000 36.500 119000000
7 -23.46154 1983-09-26 25.8750 25.8750 24.3750 24.875 192192000
8 -23.07692 1993-07-16 28.5000 29.6250 26.5000 27.500 530149200
9 -21.12676 1987-10-26 34.5000 35.0000 27.6250 28.000 78400000
10 -17.91953 2008-09-29 119.6200 119.6800 100.5900 105.260 655514300
# ... with 9,082 more rows, and 1 more variables: `Adj Close` <dbl>
Here everything()
is a helper function in dplyr
. The command above means keeping all other columns to the right of Change
.
We see that the percentage of the maximum drop was -85.49% on 2014-06-09.
Alternatively, we can also use filter()
to single out the minimum of Change
:
filter(stock, Change==min(Change, na.rm=TRUE))
# A tibble: 1 x 8
Date Open High Low Close Volume `Adj Close` Change
<date> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 2014-06-09 92.7 93.88 91.75 93.7 75415000 89.29179 -85.4857
Again use the lead()
function to do the calculation. Also move the Close
, Change
and UpModel
columns to the left so that they can be displayed on the screen.
stock <- mutate(stock, UpModel = lead(Change)+lead(Change,2) > 0) %>%
select(Date, Close, Change, UpModel, everything())
head(stock)
# A tibble: 6 x 9
Date Close Change UpModel Open High Low Volume
<date> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <int>
1 2016-12-30 115.82 -0.77957935 FALSE 116.65 117.20 115.43 30253100
2 2016-12-29 116.73 -0.02569287 TRUE 116.45 117.11 116.40 14963300
3 2016-12-28 116.76 -0.42640286 TRUE 117.52 118.02 116.20 20582000
4 2016-12-27 117.26 0.63508841 FALSE 116.52 117.80 116.49 18071900
5 2016-12-23 116.52 0.19777797 FALSE 115.59 116.52 115.59 14181200
6 2016-12-22 116.29 -0.65777978 TRUE 116.35 116.51 115.64 25789800
# ... with 1 more variables: `Adj Close` <dbl>
tail(stock)
# A tibble: 6 x 9
Date Close Change UpModel Open High Low Volume
<date> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <int>
1 1980-12-19 28.250 6.103283 TRUE 28.250 28.375 28.250 12157600
2 1980-12-18 26.625 2.898551 FALSE 26.625 26.750 26.625 18362400
3 1980-12-17 25.875 2.475248 FALSE 25.875 26.000 25.875 21610400
4 1980-12-16 25.250 -7.339450 NA 25.375 25.375 25.250 26432000
5 1980-12-15 27.250 -5.217391 NA 27.375 27.375 27.250 43971200
6 1980-12-12 28.750 NA NA 28.750 28.875 28.750 117258400
# ... with 1 more variables: `Adj Close` <dbl>
You can check that the result is the same as the base R commands given in the Lon Capa solution.
I still prefer this base R command:
mean(stock$UpModel[1:(n-3)])
[1] 0.5251403
stock$UpModel
is TRUE when stock$Change>0
or stock$UpModel
is FALSE when stock$Change
<=0. What is the proportion of days where the model gives correct predictions?Again, this base R command is still my favorite:
mean(stock$UpModel[1:(n-3)] == (stock$Change[1:(n-3)] > 0))
[1] 0.4974145
With dplyr
, we can do
filter(stock, !is.na(UpModel)) %>% summarize( mean(UpModel == (Change>0)) )
# A tibble: 1 x 1
`mean(UpModel == (Change > 0))`
<dbl>
1 0.4974145
This seems to be a complicated way of doing with(stock, mean(UpModel == (Change>0), na.rm=TRUE))
.