# load tidyverse packages
library(tidyverse)

The daily stock market prices are available on many internet sites. Here is a data file containing the daily stock price for the Apple Inc.. It was downloaded from Yahoo! Finance.

# load the data
stock <- read_csv("AAPLStock.csv")
Parsed with column specification:
cols(
  Date = col_date(format = ""),
  Open = col_double(),
  High = col_double(),
  Low = col_double(),
  Close = col_double(),
  Volume = col_integer(),
  `Adj Close` = col_double()
)

Add a column named “Change” to the tibble stock, which is the percentage of the change of the stock’s closing price (in column Close) compared to the closing price in the previous trade day (note that time is decreasing down the rows).

Here I introduce the lead() command in dplyr. For a given vector x, lead(x,n) (n is a positive integer) shifts the vector x ahead by n elements and then replace the last n elements by NAs. The default value of n is 1. The following examples should make this clear.

(x <- 1:10)
 [1]  1  2  3  4  5  6  7  8  9 10
lead(x)
 [1]  2  3  4  5  6  7  8  9 10 NA
lead(x,2)
 [1]  3  4  5  6  7  8  9 10 NA NA
lead(x,5)
 [1]  6  7  8  9 10 NA NA NA NA NA
lead(x,9)
 [1] 10 NA NA NA NA NA NA NA NA NA

There is another similar function called lag(). See ?lag or simply try it out by replacing lead by lag in the examples above to see how it works.

Now we can easily create the Change column using mutate() and lead():

stock <- mutate(stock, Change = (Close - lead(Close))/lead(Close)*100)
head(stock)
# A tibble: 6 x 8
        Date   Open   High    Low  Close   Volume `Adj Close`      Change
      <date>  <dbl>  <dbl>  <dbl>  <dbl>    <int>       <dbl>       <dbl>
1 2016-12-30 116.65 117.20 115.43 115.82 30253100      115.82 -0.77957935
2 2016-12-29 116.45 117.11 116.40 116.73 14963300      116.73 -0.02569287
3 2016-12-28 117.52 118.02 116.20 116.76 20582000      116.76 -0.42640286
4 2016-12-27 116.52 117.80 116.49 117.26 18071900      117.26  0.63508841
5 2016-12-23 115.59 116.52 115.59 116.52 14181200      116.52  0.19777797
6 2016-12-22 116.35 116.51 115.64 116.29 25789800      116.29 -0.65777978
tail(stock)
# A tibble: 6 x 8
        Date   Open   High    Low  Close    Volume `Adj Close`    Change
      <date>  <dbl>  <dbl>  <dbl>  <dbl>     <int>       <dbl>     <dbl>
1 1980-12-19 28.250 28.375 28.250 28.250  12157600    0.419419  6.103283
2 1980-12-18 26.625 26.750 26.625 26.625  18362400    0.395293  2.898551
3 1980-12-17 25.875 26.000 25.875 25.875  21610400    0.384158  2.475248
4 1980-12-16 25.375 25.375 25.250 25.250  26432000    0.374879 -7.339450
5 1980-12-15 27.375 27.375 27.250 27.250  43971200    0.404572 -5.217391
6 1980-12-12 28.750 28.875 28.750 28.750 117258400    0.426842        NA

We can check that it gives the same result as the following base R commands given in the Lon Capa solution.

# remove the newly created `Change` column and copied to a new tibble
stock2 <- select(stock, -Change) 
# base R code to create the `Change` column
n <- nrow(stock2)
stock2$Change <- NA
stock2$Change[-n] <- (stock2$Close[-n] - stock2$Close[-1])/stock2$Close[-1] * 100
# check to see if the result is the same as the tidyverse approach:
identical(stock,stock2)
[1] TRUE

A positive value of “Change” means that the price went up. What is the proportion of days where the price went up in the data (not counting the NA in the last row, of course)?

This base R command is still my favorite method:

n <- nrow(stock)
mean(stock$Change[-n] > 0)
[1] 0.4889451

If we must use dplyr, we can first use filter() to filter out the NA and then (%>%) summarize() to compute the mean:

filter(stock, !is.na(Change)) %>% summarize(mean(Change > 0))
# A tibble: 1 x 1
  `mean(Change > 0)`
               <dbl>
1          0.4889451

but this seems to be a complicated way to carry out the command mean(stock$Change > 0, na.rm=TRUE).

On which date did the percentage of the price drop the most? What is the percentage of the drop?

Here dplyr’s arrange() command comes in handy:

arrange(stock, Change)
# A tibble: 9,092 x 8
         Date     Open     High      Low   Close     Volume `Adj Close`
       <date>    <dbl>    <dbl>    <dbl>   <dbl>      <int>       <dbl>
 1 2014-06-09  92.7000  93.8800  91.7500  93.700   75415000   89.291786
 2 2000-09-29  28.1875  29.0000  25.3750  25.750 1855410200    1.675311
 3 2005-02-28  44.6800  45.1400  43.9600  44.860  162902600    5.837240
 4 1987-06-16  41.5000  41.7500  38.0000  41.500   85680000    1.234147
 5 2000-06-21  50.5000  56.9375  50.3125  55.625  122500000    3.618998
 6 1987-10-19  48.2500  48.2500  35.5000  36.500  119000000    1.086855
 7 1983-09-26  25.8750  25.8750  24.3750  24.875  192192000    0.369311
 8 1993-07-16  28.5000  29.6250  26.5000  27.500  530149200    0.865228
 9 1987-10-26  34.5000  35.0000  27.6250  28.000   78400000    0.833752
10 2008-09-29 119.6200 119.6800 100.5900 105.260  655514300   13.696564
# ... with 9,082 more rows, and 1 more variables: Change <dbl>

Oh-oh, we can’t see the Change column. No worry. We can use select() to rearrange columns:

arrange(stock, Change) %>% select(Change, Date:`Adj Close`)
# A tibble: 9,092 x 8
      Change       Date     Open     High      Low   Close     Volume
       <dbl>     <date>    <dbl>    <dbl>    <dbl>   <dbl>      <int>
 1 -85.48570 2014-06-09  92.7000  93.8800  91.7500  93.700   75415000
 2 -51.86916 2000-09-29  28.1875  29.0000  25.3750  25.750 1855410200
 3 -49.58984 2005-02-28  44.6800  45.1400  43.9600  44.860  162902600
 4 -47.13376 1987-06-16  41.5000  41.7500  38.0000  41.500   85680000
 5 -45.06173 2000-06-21  50.5000  56.9375  50.3125  55.625  122500000
 6 -24.35233 1987-10-19  48.2500  48.2500  35.5000  36.500  119000000
 7 -23.46154 1983-09-26  25.8750  25.8750  24.3750  24.875  192192000
 8 -23.07692 1993-07-16  28.5000  29.6250  26.5000  27.500  530149200
 9 -21.12676 1987-10-26  34.5000  35.0000  27.6250  28.000   78400000
10 -17.91953 2008-09-29 119.6200 119.6800 100.5900 105.260  655514300
# ... with 9,082 more rows, and 1 more variables: `Adj Close` <dbl>

OR

arrange(stock, Change) %>% select(Change, everything())
# A tibble: 9,092 x 8
      Change       Date     Open     High      Low   Close     Volume
       <dbl>     <date>    <dbl>    <dbl>    <dbl>   <dbl>      <int>
 1 -85.48570 2014-06-09  92.7000  93.8800  91.7500  93.700   75415000
 2 -51.86916 2000-09-29  28.1875  29.0000  25.3750  25.750 1855410200
 3 -49.58984 2005-02-28  44.6800  45.1400  43.9600  44.860  162902600
 4 -47.13376 1987-06-16  41.5000  41.7500  38.0000  41.500   85680000
 5 -45.06173 2000-06-21  50.5000  56.9375  50.3125  55.625  122500000
 6 -24.35233 1987-10-19  48.2500  48.2500  35.5000  36.500  119000000
 7 -23.46154 1983-09-26  25.8750  25.8750  24.3750  24.875  192192000
 8 -23.07692 1993-07-16  28.5000  29.6250  26.5000  27.500  530149200
 9 -21.12676 1987-10-26  34.5000  35.0000  27.6250  28.000   78400000
10 -17.91953 2008-09-29 119.6200 119.6800 100.5900 105.260  655514300
# ... with 9,082 more rows, and 1 more variables: `Adj Close` <dbl>

Here everything() is a helper function in dplyr. The command above means keeping all other columns to the right of Change.

We see that the percentage of the maximum drop was -85.49% on 2014-06-09.

Alternatively, we can also use filter() to single out the minimum of Change:

filter(stock, Change==min(Change, na.rm=TRUE))
# A tibble: 1 x 8
        Date  Open  High   Low Close   Volume `Adj Close`   Change
      <date> <dbl> <dbl> <dbl> <dbl>    <int>       <dbl>    <dbl>
1 2014-06-09  92.7 93.88 91.75  93.7 75415000    89.29179 -85.4857

Consider a naive model of predicting whether the price will go up or not based on the values of “Change” in the previous two trade days: predict the price going up if the sum of “Change” in the previous two days is positive. Add a column named “UpModel” to the data frame. Set the value to “TRUE” if the model predicts the price goes up; otherwise set it to “FALSE”.

Again use the lead() function to do the calculation. Also move the Close, Change and UpModel columns to the left so that they can be displayed on the screen.

stock <- mutate(stock, UpModel = lead(Change)+lead(Change,2) > 0) %>% 
  select(Date, Close, Change, UpModel, everything())
head(stock)
# A tibble: 6 x 9
        Date  Close      Change UpModel   Open   High    Low   Volume
      <date>  <dbl>       <dbl>   <lgl>  <dbl>  <dbl>  <dbl>    <int>
1 2016-12-30 115.82 -0.77957935   FALSE 116.65 117.20 115.43 30253100
2 2016-12-29 116.73 -0.02569287    TRUE 116.45 117.11 116.40 14963300
3 2016-12-28 116.76 -0.42640286    TRUE 117.52 118.02 116.20 20582000
4 2016-12-27 117.26  0.63508841   FALSE 116.52 117.80 116.49 18071900
5 2016-12-23 116.52  0.19777797   FALSE 115.59 116.52 115.59 14181200
6 2016-12-22 116.29 -0.65777978    TRUE 116.35 116.51 115.64 25789800
# ... with 1 more variables: `Adj Close` <dbl>
tail(stock)
# A tibble: 6 x 9
        Date  Close    Change UpModel   Open   High    Low    Volume
      <date>  <dbl>     <dbl>   <lgl>  <dbl>  <dbl>  <dbl>     <int>
1 1980-12-19 28.250  6.103283    TRUE 28.250 28.375 28.250  12157600
2 1980-12-18 26.625  2.898551   FALSE 26.625 26.750 26.625  18362400
3 1980-12-17 25.875  2.475248   FALSE 25.875 26.000 25.875  21610400
4 1980-12-16 25.250 -7.339450      NA 25.375 25.375 25.250  26432000
5 1980-12-15 27.250 -5.217391      NA 27.375 27.375 27.250  43971200
6 1980-12-12 28.750        NA      NA 28.750 28.875 28.750 117258400
# ... with 1 more variables: `Adj Close` <dbl>

You can check that the result is the same as the base R commands given in the Lon Capa solution.

What is the proportion of days where the model predicts the price going up?

I still prefer this base R command:

mean(stock$UpModel[1:(n-3)])
[1] 0.5251403

The model gives a correct prediction if stock$UpModel is TRUE when stock$Change>0 or stock$UpModel is FALSE when stock$Change<=0. What is the proportion of days where the model gives correct predictions?

Again, this base R command is still my favorite:

mean(stock$UpModel[1:(n-3)] == (stock$Change[1:(n-3)] > 0))
[1] 0.4974145

With dplyr, we can do

filter(stock, !is.na(UpModel)) %>% summarize( mean(UpModel == (Change>0)) )
# A tibble: 1 x 1
  `mean(UpModel == (Change > 0))`
                            <dbl>
1                       0.4974145

This seems to be a complicated way of doing with(stock, mean(UpModel == (Change>0), na.rm=TRUE)).