Classifying data with Python
Thomas Dickson
19 minute read
I’m working through some problems in the book An Introduction to Statistical Learning and I thought that one of my solutions might make an interesting post. I wanted to use a range of different Python features to build a simple machine learning pipeline from first principles.
The first part of this post specifies the problem, a binary classification problem, and introduces the fundamental differences between the classifiers considered. Subsequent sections introduce the dataset and perform exploratory data analysis to identify significant variables. I finish by introducing a simple machine learning pipeline built using standard Python libaries that applies combinations of preprocessing scripts and classifiers to the dataset.
The problem
I’m tackling Q10 from Chapter 4 of ISL that can be summarised as follows:
Compare the performance of logistic regression, linear discriminant analysis, quadratic discriminant analysis and k nearest neighbours on predicting whether a stock price will go up or down in the Weekly dataset.
The Weekly dataset records $1089$ stock market returns for $21$ years and can be downloaded from this location.
Maths
The Weekly dataset is an example of a binary classification problem where the response, $Y$ is binary given observations $X$. We want to model the probabilty that $Y$ belongs to a particular category. The equation below calculates the probability that $X$ belongs to the category $1$.
\[p(X) = Pr(Y = 1 | X)\]Lets describe the key differences between each classification method. Logistic regression models $p(X)$ using the logistic function:
\[p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\]The coefficients $\beta_0$ and $\beta_1$ are fitted to data using the Maximum Likelihood method (helpful explanation video here). The Maximum Likelihood method estimates values for $\beta_0$ and $\beta_1$ such that the predicted probability of the default for each object in the dataset is as close as possible to the actual observed value of object.
Linear discriminant analysis (LDA) models the distribution of the predictors $X$ given $Y$ and uses Bayes theorem to create estimates for $Pr(Y=k|X=x)$. LDA is more stable than LR when the classes are well defined or $n$ is small and the distribution of the predictors $X$ are approximately normal. LDA is also popular when there is more than two response classes.
The key assumption of LDA is that observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix which is commond to all classes. Quadratic Discriminant Analysis (QDA) assumes that each class has its own covariance matrix.
The k Nearest Neighbour algorithm estimates the conditional distribution of $Y$ given $X$ and then classifies an observation to the class with the highest estimated probability. Given a positive integer $K$ and test observation $x_0$ the k-NN first identifies the K points in the training data which are closest to $x_0$, represented by $N_0$. The classifier then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$.
\[Pr (Y = j|X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\]The dataset
Let’s look at the Weekly dataset in more detail. The Direction column is the response variable and records if the stock market went up or down. The question directs us to use the Volume column and the five lag variables as the predictors. I used this snippet to load the data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pyreadr
import pandas as pd
def get_dataset():
# include a try/except clause to load data if it doesn't exist locally
url = "https://github.com/cran/ISLR/blob/master/data/Weekly.rda?raw=true"
dst_path = "Weekly.rda"
if Path(dst_path).exists():
Weekly = pyreadr.read_r(dst_path)
else:
Weekly = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)
df_weekly = Weekly['Weekly']
df_weekly['Direction_text'] = df_weekly['Direction']
# turn quantitative into qualitative data
df_weekly['Direction'] = pd.get_dummies(df_weekly['Direction'])
return df_weekly
df_weekly = get_dataset()
print(df_weekly.describe().to_markdown())
Returns the table below. The Lag variables turn out to have identifical minimums and maximums as those values are within in the segment of data lagged from the Today column. We can also see that each column is not normalised, i.e. lies between $0$ and $1$. This could influence behaviour depending on what classifier is used. For example, classifiers such as k-NN which use distances to compare data points would be sensitive to any transformations in the original data set.
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction | |
---|---|---|---|---|---|---|---|---|---|
count | 1089 | 1089 | 1089 | 1089 | 1089 | 1089 | 1089 | 1089 | 1089 |
mean | 2000.05 | 0.150585 | 0.151079 | 0.147205 | 0.145818 | 0.139893 | 1.57462 | 0.149899 | 0.444444 |
std | 6.03318 | 2.35701 | 2.35725 | 2.3605 | 2.36028 | 2.36128 | 1.68664 | 2.35693 | 0.497132 |
min | 1990 | -18.195 | -18.195 | -18.195 | -18.195 | -18.195 | 0.087465 | -18.195 | 0 |
25% | 1995 | -1.154 | -1.154 | -1.158 | -1.158 | -1.166 | 0.332022 | -1.154 | 0 |
50% | 2000 | 0.241 | 0.241 | 0.241 | 0.238 | 0.234 | 1.00268 | 0.241 | 0 |
75% | 2005 | 1.405 | 1.409 | 1.409 | 1.409 | 1.405 | 2.05373 | 1.405 | 1 |
max | 2010 | 12.026 | 12.026 | 12.026 | 12.026 | 12.026 | 9.32821 | 12.026 | 1 |
Exploratory analysis
I need to see which variables have a statistically significant relationship with the response variable, Direction. For this the question leads me to use a logistic regression. The relevant statsmodels example I cannabalised can be found here.
Here’s the snippet I used to run a logistic regression model on the Weekly data set. As you can see I had to use a few chained methods to get the output of the model in markdown format which didn’t involve me having do do much copy and pasting. The null hypothesis, $H_0$, is that there is no relationship between Lag1, Lag2, Lag3, Lag4, Lag5, Volume and the Direction of the stock price.
1
2
3
4
5
6
7
8
9
10
import statsmodels.api as sm
X = df_weekly[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
y = df_weekly['Direction']
# building the model and fitting the data
log_reg = sm.Logit(y, X).fit()
log_reg_summary = log_reg.summary()
results_as_html = log_reg_summary.tables[1].as_html()
print(pd.read_html(results_as_html, header=0, index_col=0)[0].to_markdown())
This table is the output of the Logistic regression performed using 5 key variables to predict the Direction of a stock price. There are only two variables which show promise, Lag2 and Volume. As the p-values for Lag2 and Volume are the only ones below $0.05$ they are realistically the only ones worth considering as input variables to the classifiers 1.
coef | std err | z | $P>|z|$ | 0.025 | 0.975 | |
---|---|---|---|---|---|---|
Lag1 | 0.0327 | 0.026 | 1.250 | 0.211 | -0.019 | 0.084 |
Lag2 | -0.0682 | 0.027 | -2.556 | 0.011 | -0.12 | -0.016 |
Lag3 | 0.0081 | 0.026 | 0.306 | 0.759 | -0.044 | 0.060 |
Lag4 | 0.0194 | 0.026 | 0.740 | 0.459 | -0.032 | 0.071 |
Lag5 | 0.0069 | 0.026 | 0.261 | 0.794 | -0.045 | 0.058 |
Volume | -0.0569 | 0.027 | -2.125 | 0.034 | -0.109 | -0.004 |
Applying classifiers
Here is the fun part. The question wants to see how 4 different classifiers perform on the dataset. These classifiers are:
- Logistic regression
- Linear discriminant analysis
- Quadratic discriminant analysis
- k Nearest neighbours
I’m aware of the scikit-learn Pipeline functionality but I thought this stage might be a fun opportunity to make my own budget pipeline to explore the functionality of different core Python libraries.
I started by defining classes to describe the Preprocessing stage of the pipeline and the Classifier using the dataclasses module. The Preprocessing class holds the name and a function which performs a set of transforms on the dataset to create the test and training datasets. The Classifier holds the name, a scikit-learn classifier and optionally the classifier score and Preprocessing class. I’ve added optional type hints as these are a massive help when understanding code.
The run_classifier
function takes a list of $n$ Preprocessing and $m$ Classifier classes and uses itertools.product
to create an $n \times m$ list of pipelines to be run on the Weekly dataset. itertools.product
is similar to np.meshgrid
but for Python objects rather than np.array
s.
I could have just filtered for the maximum score on the dataframe returned from run_classifiers()
but it was also interesting to use attrgetter
from the operator
module to compare each Classifer class based on an attribute - in this case the score of the Classifier on the dataset.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# create budget pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator
from typing import List, Callable, Tuple, Optional
from dataclasses import dataclass
from operator import attrgetter
import itertools
pd.options.display.float_format = '${:,.5f}'.format
@dataclass
class Preprocess:
"""
Class to hold information on the preprocessing used on a dataset.
"""
name: str
transform: Callable
@dataclass
class Classifier:
"""
Class to hold information on a Classifier.
"""
name: str
model: BaseEstimator
preprocess: Optional[Preprocess] = None
score: float = 0.0
def lag2_preprocess(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""This is the default test training split required by problem."""
idx_train = df['Year'] < 2009.0
idx_test = df['Year'] > 2008.0
all_cols = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']
X_train = df['Lag2'][idx_train].values[:, None] # need to pass a 2d array to .fix(X, y)
y_train = df['Direction'][idx_train]
X_test = df['Lag2'][idx_test].values[:, None]
y_test = df['Direction'][idx_test]
return X_train, y_train, X_test, y_test
def lag2_volume_preprocess(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""This is the default test training split required by problem."""
idx_train = df['Year'] < 2009.0
idx_test = df['Year'] > 2008.0
all_cols = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']
X_train = df[['Lag2', 'Volume']][idx_train]
y_train = df['Direction'][idx_train]
X_test = df[['Lag2', 'Volume']][idx_test]
y_test = df['Direction'][idx_test]
return X_train, y_train, X_test, y_test
def apply_classifier(model: Classifier, dataset: pd.DataFrame) -> Classifier:
"""
Apply classifier to dataset given function to create training and test datasets.
"""
X_train, y_train, X_test, y_test = model.preprocess.transform(dataset)
model.model = model.model.fit(X_train, y_train)
y_predict = model.model.predict(X_test)
model.score = accuracy_score(y_predict, y_test)
return model
def run_classifiers():
# define preprocessing functions
preprocessors = [
Preprocess("Lag 2", lag2_preprocess),
Preprocess("Lag 2 and Volume", lag2_volume_preprocess),
]
# define classifier models
classifiers = [
Classifier("Logistic Regression", LogisticRegression(random_state=0)),
Classifier("Linear Discriminant Analysis", LinearDiscriminantAnalysis()),
Classifier("Quadratic Discriminant Analysis", QuadraticDiscriminantAnalysis()),
Classifier("kNN, k=1", KNeighborsClassifier(n_neighbors=1)),
Classifier("kNN, k=2", KNeighborsClassifier(n_neighbors=2))
]
# associate preprocessor function with classifiers for analysis
prepped_classifiers = []
for preprocessor, classifier in itertools.product(preprocessors, classifiers):
prepped_classifier = Classifier(classifier.name, classifier.model, preprocessor)
prepped_classifiers.append(prepped_classifier)
# load data
df_weekly = get_dataset()
# train classifiers on datasets
trained_classifiers = [apply_classifier(classifier, df_weekly) for classifier in prepped_classifiers]
# create results dataframe
df = pd.DataFrame({"Classifier name": [classifier.name for classifier in trained_classifiers],
"Preprocess name": [classifier.preprocess.name for classifier in trained_classifiers],
"Score": [classifier.score for classifier in trained_classifiers]})
with pd.option_context('display.float_format', '{:,.5f}'.format):
print(df.to_html())
# get classifiers with maximum scores
# returns the first classifier with the highest score in the event of a tie
best_classifier = max(trained_classifiers, key=attrgetter('score'))
print("The best classifier is {0} using the {1} preprocessor with a score of {2:.4f}".format(best_classifier.name, best_classifier.preprocess.name, best_classifier.score))
run_classifiers()
Returns:
1
The best classifier is Logistic Regression using the Lag 2 preprocessor with a score of 0.6250
The Logistic Regression and Linear Discriminant Analysis appear to perform the best across both preprocessing methods. They also produce the same results. This is because the only difference between Logistic Regression and Linear Discriminant Analysis lies in the procedure used to fit their coefficients - Logistic Regression uses the maximum likelihood method and the coefficients for Linear Discriminant Analysis are calculated from an estimated normal distribution. If the data is approximately Gaussian then both methods behave similarly. If the data is Gaussian then Linear Discriminant Analysis outperforms Logistic Regression and if the data isn’t Gaussian then the inverse is true.
Classifier name | Preprocess name | Score | |
---|---|---|---|
0 | Logistic Regression | Lag 2 | 0.62500 |
1 | Linear Discriminant Analysis | Lag 2 | 0.62500 |
2 | Quadratic Discriminant Analysis | Lag 2 | 0.58654 |
3 | kNN, k=1 | Lag 2 | 0.50000 |
4 | kNN, k=2 | Lag 2 | 0.58654 |
5 | Logistic Regression | Lag 2 and Volume | 0.53846 |
6 | Linear Discriminant Analysis | Lag 2 and Volume | 0.53846 |
7 | Quadratic Discriminant Analysis | Lag 2 and Volume | 0.47115 |
8 | kNN, k=1 | Lag 2 and Volume | 0.55769 |
9 | kNN, k=2 | Lag 2 and Volume | 0.61538 |
Review
The purpose of this post was to learn about different features of Python to create a machine learning pipeline which solves a binary classification problem. I applied four classifiers to a dataset of Weekly stock price and volume observations to predict whether the stock price went up or down. Along the way I reinvented the scikit learn pipeline functionality but also had fun with learning about different core Python libraries. If I was to continue working on this problem I’d probably have a look at applying different scaling methods to the problems
-
Typical p-value cutoffs are $1\%$ or $5\%$. Choosing which p-value to use and even whether they should be used in this way is an area of academic debate. I enjoyed this tweet on the subject. ↩