# Classifying data with Python

#### Thomas Dickson

I’m working through some problems in the book An Introduction to Statistical Learning and I thought that one of my solutions might make an interesting post. I wanted to use a range of different Python features to build a simple machine learning pipeline from first principles.

The first part of this post specifies the problem, a binary classification problem, and introduces the fundamental differences between the classifiers considered. Subsequent sections introduce the dataset and perform exploratory data analysis to identify significant variables. I finish by introducing a simple machine learning pipeline built using standard Python libaries that applies combinations of preprocessing scripts and classifiers to the dataset.

## The problem

I’m tackling Q10 from Chapter 4 of ISL that can be summarised as follows:

Compare the performance of logistic regression, linear discriminant analysis, quadratic discriminant analysis and k nearest neighbours on predicting whether a stock price will go up or down in the Weekly dataset.

The Weekly dataset records $1089$ stock market returns for $21$ years and can be downloaded from this location.

## Maths

The Weekly dataset is an example of a binary classification problem where the response, $Y$ is binary given observations $X$. We want to model the probabilty that $Y$ belongs to a particular category. The equation below calculates the probability that $X$ belongs to the category $1$.

$p(X) = Pr(Y = 1 | X)$

Lets describe the key differences between each classification method. Logistic regression models $p(X)$ using the logistic function:

$p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}$

The coefficients $\beta_0$ and $\beta_1$ are fitted to data using the Maximum Likelihood method (helpful explanation video here). The Maximum Likelihood method estimates values for $\beta_0$ and $\beta_1$ such that the predicted probability of the default for each object in the dataset is as close as possible to the actual observed value of object.

Linear discriminant analysis (LDA) models the distribution of the predictors $X$ given $Y$ and uses Bayes theorem to create estimates for $Pr(Y=k|X=x)$. LDA is more stable than LR when the classes are well defined or $n$ is small and the distribution of the predictors $X$ are approximately normal. LDA is also popular when there is more than two response classes.

The key assumption of LDA is that observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix which is commond to all classes. Quadratic Discriminant Analysis (QDA) assumes that each class has its own covariance matrix.

The k Nearest Neighbour algorithm estimates the conditional distribution of $Y$ given $X$ and then classifies an observation to the class with the highest estimated probability. Given a positive integer $K$ and test observation $x_0$ the k-NN first identifies the K points in the training data which are closest to $x_0$, represented by $N_0$. The classifier then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$.

$Pr (Y = j|X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)$

## The dataset

Let’s look at the Weekly dataset in more detail. The Direction column is the response variable and records if the stock market went up or down. The question directs us to use the Volume column and the five lag variables as the predictors. I used this snippet to load the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd

def get_dataset():

# include a try/except clause to load data if it doesn't exist locally
url = "https://github.com/cran/ISLR/blob/master/data/Weekly.rda?raw=true"
dst_path = "Weekly.rda"
if Path(dst_path).exists():
else:
df_weekly = Weekly['Weekly']
df_weekly['Direction_text'] = df_weekly['Direction']
# turn quantitative into qualitative data
df_weekly['Direction'] = pd.get_dummies(df_weekly['Direction'])
return df_weekly

df_weekly = get_dataset()
print(df_weekly.describe().to_markdown())


Returns the table below. The Lag variables turn out to have identifical minimums and maximums as those values are within in the segment of data lagged from the Today column. We can also see that each column is not normalised, i.e. lies between $0$ and $1$. This could influence behaviour depending on what classifier is used. For example, classifiers such as k-NN which use distances to compare data points would be sensitive to any transformations in the original data set.

Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
count 1089 1089 1089 1089 1089 1089 1089 1089 1089
mean 2000.05 0.150585 0.151079 0.147205 0.145818 0.139893 1.57462 0.149899 0.444444
std 6.03318 2.35701 2.35725 2.3605 2.36028 2.36128 1.68664 2.35693 0.497132
min 1990 -18.195 -18.195 -18.195 -18.195 -18.195 0.087465 -18.195 0
25% 1995 -1.154 -1.154 -1.158 -1.158 -1.166 0.332022 -1.154 0
50% 2000 0.241 0.241 0.241 0.238 0.234 1.00268 0.241 0
75% 2005 1.405 1.409 1.409 1.409 1.405 2.05373 1.405 1
max 2010 12.026 12.026 12.026 12.026 12.026 9.32821 12.026 1

## Exploratory analysis

I need to see which variables have a statistically significant relationship with the response variable, Direction. For this the question leads me to use a logistic regression. The relevant statsmodels example I cannabalised can be found here.

Here’s the snippet I used to run a logistic regression model on the Weekly data set. As you can see I had to use a few chained methods to get the output of the model in markdown format which didn’t involve me having do do much copy and pasting. The null hypothesis, $H_0$, is that there is no relationship between Lag1, Lag2, Lag3, Lag4, Lag5, Volume and the Direction of the stock price.

1
2
3
4
5
6
7
8
9
10
import statsmodels.api as sm

X = df_weekly[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
y = df_weekly['Direction']

# building the model and fitting the data
log_reg = sm.Logit(y, X).fit()
log_reg_summary = log_reg.summary()
results_as_html = log_reg_summary.tables[1].as_html()


This table is the output of the Logistic regression performed using 5 key variables to predict the Direction of a stock price. There are only two variables which show promise, Lag2 and Volume. As the p-values for Lag2 and Volume are the only ones below $0.05$ they are realistically the only ones worth considering as input variables to the classifiers 1.

coef std err z $P>|z|$ 0.025 0.975
Lag1 0.0327 0.026 1.250 0.211 -0.019 0.084
Lag2 -0.0682 0.027 -2.556 0.011 -0.12 -0.016
Lag3 0.0081 0.026 0.306 0.759 -0.044 0.060
Lag4 0.0194 0.026 0.740 0.459 -0.032 0.071
Lag5 0.0069 0.026 0.261 0.794 -0.045 0.058
Volume -0.0569 0.027 -2.125 0.034 -0.109 -0.004

## Applying classifiers

Here is the fun part. The question wants to see how 4 different classifiers perform on the dataset. These classifiers are:

1. Logistic regression
2. Linear discriminant analysis
4. k Nearest neighbours

I’m aware of the scikit-learn Pipeline functionality but I thought this stage might be a fun opportunity to make my own budget pipeline to explore the functionality of different core Python libraries.

I started by defining classes to describe the Preprocessing stage of the pipeline and the Classifier using the dataclasses module. The Preprocessing class holds the name and a function which performs a set of transforms on the dataset to create the test and training datasets. The Classifier holds the name, a scikit-learn classifier and optionally the classifier score and Preprocessing class. I’ve added optional type hints as these are a massive help when understanding code.

The run_classifier function takes a list of $n$ Preprocessing and $m$ Classifier classes and uses itertools.product to create an $n \times m$ list of pipelines to be run on the Weekly dataset. itertools.product is similar to np.meshgrid but for Python objects rather than np.arrays.

I could have just filtered for the maximum score on the dataframe returned from run_classifiers() but it was also interesting to use attrgetter from the operator module to compare each Classifer class based on an attribute - in this case the score of the Classifier on the dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# create budget pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier