Predict Tomorrow’s Apple Stocks with LSTMs
An Introductory Article on Time Series Forecasting with LSTMs
In this story, we will unveil the potential of LSTMs (and sequence models in general) in predicting Apple’s stock prices (or any company in general). The code we will use mostly improves upon and generalizes the related work by Greg Hogg which can be found here. By the end of this story, you will:
- Have a better grasp of stock markets.
- Have an understanding of how to employ LSTMs and other sequence models for predictive purposes such as forecasting time series data.
- Have the ability to predict tomorrow’s stocks (regardless of today) for pretty much any company on Yahoo Finance
Table of Contents
· Stock Market Basics
· Implementation
∘ Importing Necessary Packages
∘ Download Stocks
∘ Read and Plot the Data
∘ Format the Data
∘ Split and Normalize the Dataset
∘ Build the Model
∘ Train the Model
∘ Analyze the Model
∘ Predict Future Outputs
∘ Full Code
Stock Market Basics
A stock simply represents a share in the ownership of a company. The stock price of a company is the price of a single share. To understand what percentage of the company you will own after buying a single share, you need to look at its total number of shares outstanding. As of early November 2023, Apple has close to 16B shares and its stock price is around $176. This means, that by paying $176 you get to own 1/16B of the company which is less than 0.00000001% of Apple.
People buy stocks to earn money by: (i) selling them later at a higher price; and (ii) potentially earning a portion of the company’s profits. For instance, if Apple increases its own worth by making a new product that everyone buys, demand for its stock price will increase which will thus, increase the stock price. Apple will also pay a small dividend (e.g., $0.5 per share) every quarter for the profits it has made. Most real profit is gained by purchasing a stock when its price is low (e.g., a start-up), with the expectation that it will increase in the future, and then selling it at a higher price. Likewise, one can choose to sell a stock at an acceptable price when anticipating a future drop in its value.
The stock price fluctuates during the day, every day with supply and demand. Most markets have specific trading hours every day (e.g., 09:30 AM to 04:30 PM for the main U.S. stock markets).
For purposes of evaluating the stock price for a particular day, people look at the closing price which is the stock’s price shortly before the closing bell rings. Because this closing price does not take into account corporate actions that can happen after trading has finished for the day such as splits (e.g., doubling the number of outstanding shares to decrease the stock price) it always makes more sense to look at an adjusted version of it called the “Adjusted Close” when considering past data.
Other stock-related data specific to each trading day includes the following:
- “Open,” which represents the initial price of the stock for the day.
- “High,” which indicates the highest price at which the stock was traded during the day.
- “Low,” which signifies the lowest price at which the stock was traded during the day.
- “Volume,” which represents the total number of shares of the stock that were traded on that day.
Implementation
Importing Necessary Packages
Let’s start with the imports:
# for tabular and matrix operations and representation for the dataset
import pandas as pd
import numpy as np
# for normalizing the data
from sklearn.preprocessing import MinMaxScaler
# for plotting data and analysis
import matplotlib.pyplot as plt
# for defining the deep learning model
import pytorch_lightning as pl
import torch
import torch.nn as nn
from torch import optim, from_numpy
from torch.utils.data import TensorDataset, DataLoader
# for downloading updated stocks
from datetime import datetime
import requests
Download Stocks
To download stocks, we will be using Yahoo Finance. You can find historical data for the stocks of most companies there and it’s easy to automate downloading such data:
def download_stock_data(stock_symbol, start_date, end_date):
# Convert the start and end dates to Unix timestamps
start_date = int(datetime.strptime(start_date, "%d/%m/%y").timestamp())
end_date = int(datetime.strptime(end_date, "%d/%m/%y").timestamp())
# Create the download URL
url = "https://query1.finance.yahoo.com/v7/finance/download"
url += f"/{stock_symbol}"
url += f"?period1={start_date}&period2={end_date}"
url += f"&interval=1d&events=history"
url += "&includeAdjustedClose=true"
# Send a GET request to download the file
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
# Save if things are okay
if response.status_code == 200:
# Save the content to a file (you can customize the file name)
with open(f"{stock_symbol}.csv", "wb") as file:
file.write(response.content)
print(f"Data for {stock_symbol} downloaded successfully.")
else:
print(f"Failed to download. Status code: {response.status_code}")
This function takes the stock symbol (e.g., “AAPL” for Apple) and the time range for which historical stock data should be acquired. It saves it in the same directory as <stock_symbol>.csv
. Thus, we can download Apple’s stock data as follows:
stock = "AAPL" # can also try "AMZN" or "TSLA"
start_date = "06/11/99"
end_date = datetime.today().strftime('%d/%m/%y')
download_stock_data(stock, start_date, end_date)
We can also do the same for any company knowing it’s ticker symbol but beware that some of them (e.g., TSLA) didn’t go public before 2010 so be careful with setting the start_date.
Read and Plot the Data
Now that it’s downloaded, we can see a sample of the data (first three and last three rows in particular):
dataset = pd.read_csv(f'{stock}.csv')
dataset.head()
pd.concat([dataset.head(3), dataset.tail(3)])
The dataset has all the variables that we are familiar with, documented for each day since 1999–11–06 except for weekends (where there is no trading).
We are of course interested in the adjusted close
dataset = dataset[['Date', 'Adj Close']]
dataset['Date'] = pd.to_datetime(dataset['Date'])
We can observe its evolution through a quick plot
Imagine being able to convince your parent to buy Apple stocks for <$1 in 1999 or at least in 2020.
Format the Data
Sequence models such as RNNs, LSTMs, GRU, or transformers are by design great at learning from sequence data which is why they are used in tasks such as language modeling which can boil down to simply learning to predict the next word when given an incomplete sentence as a sequence of words. Our task is similar to this in the sense that we will train the LSTM to predict the closing price for a day given the closing price of the previous seven days. For this, we can format each row in the dataset to include the closing price of each day along with the closing price of the seven days before it. Something like this
To accomplish this, we need to create seven new columns, the first is the close column shifted down by one (looking one day before), the second is the close column shifted down by two (looking two days before), and so on.
lstm_dataset = dataset.copy()
for i in range(1, 8):
lstm_dataset['Adj Close-'+str(i)] = dataset['Adj Close'].shift(i)
lstm_dataset = lstm_dataset[7:]
We also got rid of the first seven rows because they will contain nulls (e.g., when the close column is shifted down by two days, the first two days in the column will become null).
We intend to have the LSTM predict tomorrow’s stock value based on that of the last seven days. For this, it doesn’t have to be trained on specific dates; it can suffice to learn how to evaluate the future by looking at the past seven numbers for the stock regardless to what day it is.
Thus, we will drop the dates column and reorder columns so that the target variable (Adj Close) appears on the right as you may be used to that.
# delete the date column
lstm_dataset.drop(columns=['Date'], inplace=True)
# reverse the order of the columns so the RNN takes the oldest price first
lstm_dataset = lstm_dataset.iloc[:, ::-1]
lstm_dataset.head(10)
Notice the diagonal pattern from the top right.
Split and Normalize the Dataset
In this, we want to
- Extract the target column in a separate variable.
- Split the data into training, validation and testing sets. We will aim for the training set to contain 95% of the data and for the validation set to contain the rest 5% of the data (about one year) except for the last fourteen days which we will use for a small test set. We use both a validation and a testing set so we can more freely make training decisions based on the validation data.
- Normalize the close price values to be in the -1 to 1 range which can help the model with learning.
- Convert Numpy matrices to PyTorch tensors because that’s what the deep learning model will expect
Splitting the Data
# extracting target column (last one)
x_data = lstm_dataset.iloc[:, :-1].values
y_data = lstm_dataset.iloc[:, -1].values
# train test split
train_percent = 0.95
test_size = 14
train_size = int(len(x_data) * train_percent)
x_train = x_data[:train_size]
y_train = y_data[:train_size].reshape(-1, 1)
x_val = x_data[train_size:-test_size]
y_val = y_data[train_size:-test_size].reshape(-1, 1)
x_test = x_data[-test_size:]
y_test = y_data[-test_size:].reshape(-1, 1)
Normalizing the data
# scale them using a min max scaler
x_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler = MinMaxScaler(feature_range=(-1, 1))
# transform the data
x_train = x_scaler.fit_transform(x_train)
y_train = y_scaler.fit_transform(y_train)
x_val = x_scaler.transform(x_val)
y_val = y_scaler.transform(y_val)
x_test = x_scaler.transform(x_test)
y_test = y_scaler.transform(y_test)
Tensorifying the Data
x_train_t = from_numpy(x_train).float().unsqueeze(2)
y_train_t = from_numpy(y_train).float()
x_val_t = from_numpy(x_val).float().unsqueeze(2)
y_val_t = from_numpy(y_val).float()
x_test_t = from_numpy(x_test).float().unsqueeze(2)
y_test_t = from_numpy(y_test).float()
Build the Model
We will be using an LSTM from PyTorch Lightning to train the model. PyTorch Lightning is a high-level deep learning library that uses PyTorch under the hood. We will make an LSTM class where we will:
- Define the model’s hyperparameters then use them to construct layers of the LSTM.
- Define the forward function for the model (how data flows through the defined layers) and define the optimizer that will be used for training.
- Prepare the data for the LSTM by converting it to Tensor format (like Numpy but on the GPU).
- Define the training and validation data loaders that group the data into batches that can be used for training and validation.
- Define the logic performed for each training/validation batch.
# set GPU if exists
device = 'cuda' if torch.cuda.is_available() else 'cpu'
class LSTM(pl.LightningModule):
### 1. model and optimizer definition
def __init__(self, hidden_size, num_layers):
super().__init__()
# hyperparameters
self.hidden_size = hidden_size
self.num_layers = num_layers
# layers (notice one node in and one node out)
self.lstm = nn.LSTM(1, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.lstm(x)
# out.shape => (batch_size, seq_len, hidden_size)
# feed in the last seq output
out = self.fc(out[:, -1, :])
return out
def configure_optimizers(self):
learning_rate = 0.001
optimizer = optim.Adam(self.parameters(), lr=learning_rate)
return optimizer
### 2. prepare and load data in batches
def prepare_data(self):
self.train_data = TensorDataset(x_train_t, y_train_t)
self.val_data = TensorDataset(x_val_t, y_val_t)
def train_dataloader(self):
train_loader = DataLoader(self.train_data, shuffle=True,
batch_size=16)
return train_loader
def val_dataloader(self):
val_loader = DataLoader(self.val_data, batch_size=16)
return val_loader
### 3. define training and validation steps
def training_step(self, batch, batch_idx):
# get x, y from batch
x_batch, y_batch = batch
# move to GPU if exists
x_batch, y_batch = x_batch.to(device), y_batch.to(device)
# feed to the model
output = self(x_batch)
# compare model output with y for loss
loss = nn.MSELoss()(output, y_batch)
# log the loss
self.log('train_loss', loss, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x_batch, y_batch = batch
x_batch, y_batch = x_batch.to(device), y_batch.to(device)
output = self(x_batch)
loss = nn.MSELoss()(output, y_batch)
self.log('val_loss', loss, prog_bar=True)
Observe that we can replace nn.LSTM()
with another sequence model such as nn.GRU()
or nn.RNN()
and expect it to normally work.
Train the Model
We can instantiate the model and train it as follows
# Create a Trainer instance
trainer = pl.Trainer(max_epochs=50)
# Train your model using the Trainer
machine = LSTM(hidden_size=6, num_layers=1)
trainer.fit(machine)
We start with a moderately small number of layers and neurons in the hidden layers to increase them if the model underfits. The number of epochs is trickier whatsoever because it’s hard to tell whether a specific number of epochs is too long or too short training time for the model.
Early stopping is a common strategy used to refrain from choosing a number of epochs. It operates by testing the model on new data not used in training after each training epoch (validation date) and it stops whenever it stops improving (signaling overfitting). Generally, because validation loss can sometimes coincidentally increase it’s more common to choose an integer patience and stop only if the model stops improving for that number of epochs.
For this, we train the model as follows (instead of the simple version above)
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint
# Define early stop and saving call backs
early_stop_callback = EarlyStopping(monitor="val_loss",
patience=5, mode="min")
checkpoint_callback = ModelCheckpoint(save_top_k=1,
monitor="val_loss", mode="min")
# Create a Trainer instance with the callbacks
trainer = pl.Trainer(max_epochs=100,
callbacks=[early_stop_callback, checkpoint_callback])
# Train your model using the Trainer
machine = LSTM(hidden_size=6, num_layers=1)
trainer.fit(machine)
# Load the best model
machine = LSTM.load_from_checkpoint(checkpoint_callback.best_model_path)
# Load the best model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
machine = LSTM.load_from_checkpoint(checkpoint_callback.best_model_path)\
.to("cpu")
The early stopping callback is responsible for stopping training whenever five epochs have passed with no improvement. Meanwhile, the check_point call back is responsible for loading the best model (e.g., the model as it was before the last five epochs if that’s the best overall).
For the most recent trial I did, training took 73 epochs.
Analyze the Model
Let’s define a simple function to compute the average percentage error in the predicted close.
def avg_error_percentage(y_actual, y_predicted):
# Calculate the absolute error between actual and predicted values
absolute_errors = abs(y_actual - y_predicted)
# Calculate the error percentage for each data point
error_percentages = (absolute_errors / y_actual) * 100
# Return the average error percentage
return round(np.mean(error_percentages), 2)
As we will need in plotting, let’s find the dates corresponding to the training, validation, and test sets:
dates = dataset['Date'][7:]
train_size = int(len(x_data) * 0.95)
dates_train = dates[:train_size]
dates_val = dates[train_size:-14]
dates_test = dates[-14:]
Now let’s define a function that takes training, validation, testing data and their dates to produce a plot for each that signifies the model’s predictions versus the real values of the close.
def time_series_analysis(x_datas, y_datas, dates):
# Create a figure with three subplots side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Labels for the subplots
subplot_labels = ['Training Close Prices',
'Validation Close Prices',
'Test Close Prices']
# Plot Actual Close and Predicted Close in the first two subplots
for i in range(3):
with torch.no_grad():
x_data, y_data = x_datas[i], y_datas[i]
y_pred = machine(x_data).detach().cpu().numpy().flatten()
# make a Pandas table
y_table = pd.DataFrame()
y_table['Date'] = dates[i]
y_table['Actual Close'] = \
y_scaler.inverse_transform(y_data.reshape(-1, 1))
y_table['Predicted Close'] = \
y_scaler.inverse_transform(y_pred.reshape(-1, 1))
# plot a line chart for actual and predict closes VS. time
axes[i].plot(y_table['Date'], y_table['Actual Close'],
color='teal', linewidth=1)
axes[i].plot(y_table['Date'], y_table['Predicted Close'],
color='pink', linewidth=1)
# scatter plot as well for test data
if i==2:
axes[i].scatter(y_table['Date'], y_table['Actual Close'],
color='teal', s=10, marker='o',
label='Actual Close (Scatter)')
axes[i].scatter(y_table['Date'], y_table['Predicted Close'],
color='pink', s=10, marker='o',
label='Predicted Close (Scatter)')
# set the plot's title
error_percentage = avg_error_percentage(y_table['Actual Close'],
y_table['Predicted Close'])
axes[i].set_title(subplot_labels[i]+
f" with average error {error_percentage}%")
# rotate x-axis ticks and show legend
axes[i].tick_params(axis='x', rotation=90)
axes[i].legend(['Actual Close', 'Predicted Close'])
# Adjust the layout
plt.tight_layout()
# Show the plots
plt.show()
When given the following
x_datas = [x_train_t,
x_val_t,
x_test_t
]
y_datas = [y_train_t,
y_val_t,
y_test_t]
dates = [dates_train, dates_val, dates_test]
time_series_analysis(x_datas, y_datas, dates)
This results in the following plots
We can zoom in on the first to see:
Visually the model seems to fit the training data really well. The high average error could be explained by specific instances where the model’s output largely diverged from the real close.
The model’s performance on validation data (which it has not been trained on) seems to be quite impressive with an average error of about 1.4%. Beware that in each prediction, the model had access to the closes from the previous seven days.
The model’s performance on validation data closely generalizes to test data:
Predict Future Outputs
Similar to how repeated next work prediction allows language models to construct entire valid sentences, we can write a function that lets the LSTM predict the next close autoregressively.
def predict_future(y_data, dates, num_future_days):
# Get the closes of the most recent seven days
last_seven_closes = y_data[-7:].unsqueeze(0)
# prepare a list to store future closes
future_preds = []
for i in range(num_future_days):
# predict tomorrow's close using the last seven closes
tomorrow_pred = machine(last_seven_closes)
# store it as a future prediction
tomorrow_pred_np = tomorrow_pred.detach().cpu().numpy().flatten()
tomorrow_pred_scalar = y_scaler.inverse_transform(\
tomorrow_pred_np.reshape(-1, 1))[0][0]
future_preds.append(tomorrow_pred_scalar)
# add the prediction to the last seven closes and remove the first
last_seven_closes = torch.cat((last_seven_closes,
tomorrow_pred.detach().unsqueeze(2)), dim=1)
last_seven_closes = last_seven_closes[:, 1:, :]
# make a table for the future predictions
future_preds_table = pd.DataFrame()
future_preds_table['Date'] = pd.date_range(start=(max(dates) +
pd.Timedelta(days=1)),
periods=num_future_days,
freq='B')
future_preds_table['Predicted Close'] = future_preds
return future_preds_table.T
Here we use the close prices last seven days in the test data (which are the last seven close prices in reality) to predict the close price for tomorrow. Then use that of tomorrow along with the last six close prices to predict the close price for after tomorrow and so on until we have predicted num_future_days
close prices. Of course, the problem with this is error accumulation the more days into the future we predict, the more likely are predictions to be imprecise.
Notice that by passing freq='B'
to pd.data_range
we exclude weekends, which are days when Apple stock trading does not occur.
This is the LSTM’s prediction for the first three days of next week:
Full Code
You can interact with the entire code in a Colab notebook via this link.
I hope this story has helped you understand more about stocks and how stock prediction could be done via sequence models and other deep learning concepts. Till next time, au revoir.