Classifying data with Python

Thomas Dickson

19 minute read



I’m working through some problems in the book An Introduction to Statistical Learning and I thought that one of my solutions might make an interesting post. I wanted to use a range of different Python features to build a simple machine learning pipeline from first principles.

The first part of this post specifies the problem, a binary classification problem, and introduces the fundamental differences between the classifiers considered. Subsequent sections introduce the dataset and perform exploratory data analysis to identify significant variables. I finish by introducing a simple machine learning pipeline built using standard Python libaries that applies combinations of preprocessing scripts and classifiers to the dataset.

The problem

I’m tackling Q10 from Chapter 4 of ISL that can be summarised as follows:

Compare the performance of logistic regression, linear discriminant analysis, quadratic discriminant analysis and k nearest neighbours on predicting whether a stock price will go up or down in the Weekly dataset.

The Weekly dataset records $1089$ stock market returns for $21$ years and can be downloaded from this location.

Maths

The Weekly dataset is an example of a binary classification problem where the response, $Y$ is binary given observations $X$. We want to model the probabilty that $Y$ belongs to a particular category. The equation below calculates the probability that $X$ belongs to the category $1$.

\[p(X) = Pr(Y = 1 | X)\]

Lets describe the key differences between each classification method. Logistic regression models $p(X)$ using the logistic function:

\[p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\]

The coefficients $\beta_0$ and $\beta_1$ are fitted to data using the Maximum Likelihood method (helpful explanation video here). The Maximum Likelihood method estimates values for $\beta_0$ and $\beta_1$ such that the predicted probability of the default for each object in the dataset is as close as possible to the actual observed value of object.

Linear discriminant analysis (LDA) models the distribution of the predictors $X$ given $Y$ and uses Bayes theorem to create estimates for $Pr(Y=k|X=x)$. LDA is more stable than LR when the classes are well defined or $n$ is small and the distribution of the predictors $X$ are approximately normal. LDA is also popular when there is more than two response classes.

The key assumption of LDA is that observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix which is commond to all classes. Quadratic Discriminant Analysis (QDA) assumes that each class has its own covariance matrix.

The k Nearest Neighbour algorithm estimates the conditional distribution of $Y$ given $X$ and then classifies an observation to the class with the highest estimated probability. Given a positive integer $K$ and test observation $x_0$ the k-NN first identifies the K points in the training data which are closest to $x_0$, represented by $N_0$. The classifier then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$.

\[Pr (Y = j|X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\]

The dataset

Let’s look at the Weekly dataset in more detail. The Direction column is the response variable and records if the stock market went up or down. The question directs us to use the Volume column and the five lag variables as the predictors. I used this snippet to load the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pyreadr
import pandas as pd

def get_dataset():

    # include a try/except clause to load data if it doesn't exist locally
    url = "https://github.com/cran/ISLR/blob/master/data/Weekly.rda?raw=true"
    dst_path = "Weekly.rda"
    if Path(dst_path).exists():
        Weekly = pyreadr.read_r(dst_path)
    else:
        Weekly = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)
    df_weekly = Weekly['Weekly']
    df_weekly['Direction_text'] = df_weekly['Direction']
	# turn quantitative into qualitative data
    df_weekly['Direction'] = pd.get_dummies(df_weekly['Direction'])
    return df_weekly

df_weekly = get_dataset()
print(df_weekly.describe().to_markdown())

Returns the table below. The Lag variables turn out to have identifical minimums and maximums as those values are within in the segment of data lagged from the Today column. We can also see that each column is not normalised, i.e. lies between $0$ and $1$. This could influence behaviour depending on what classifier is used. For example, classifiers such as k-NN which use distances to compare data points would be sensitive to any transformations in the original data set.

  Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
count 1089 1089 1089 1089 1089 1089 1089 1089 1089
mean 2000.05 0.150585 0.151079 0.147205 0.145818 0.139893 1.57462 0.149899 0.444444
std 6.03318 2.35701 2.35725 2.3605 2.36028 2.36128 1.68664 2.35693 0.497132
min 1990 -18.195 -18.195 -18.195 -18.195 -18.195 0.087465 -18.195 0
25% 1995 -1.154 -1.154 -1.158 -1.158 -1.166 0.332022 -1.154 0
50% 2000 0.241 0.241 0.241 0.238 0.234 1.00268 0.241 0
75% 2005 1.405 1.409 1.409 1.409 1.405 2.05373 1.405 1
max 2010 12.026 12.026 12.026 12.026 12.026 9.32821 12.026 1

Exploratory analysis

I need to see which variables have a statistically significant relationship with the response variable, Direction. For this the question leads me to use a logistic regression. The relevant statsmodels example I cannabalised can be found here.

Here’s the snippet I used to run a logistic regression model on the Weekly data set. As you can see I had to use a few chained methods to get the output of the model in markdown format which didn’t involve me having do do much copy and pasting. The null hypothesis, $H_0$, is that there is no relationship between Lag1, Lag2, Lag3, Lag4, Lag5, Volume and the Direction of the stock price.

1
2
3
4
5
6
7
8
9
10
import statsmodels.api as sm

X = df_weekly[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
y = df_weekly['Direction']

# building the model and fitting the data
log_reg = sm.Logit(y, X).fit()
log_reg_summary = log_reg.summary()
results_as_html = log_reg_summary.tables[1].as_html()
print(pd.read_html(results_as_html, header=0, index_col=0)[0].to_markdown())

This table is the output of the Logistic regression performed using 5 key variables to predict the Direction of a stock price. There are only two variables which show promise, Lag2 and Volume. As the p-values for Lag2 and Volume are the only ones below $0.05$ they are realistically the only ones worth considering as input variables to the classifiers 1.

  coef std err z $P>|z|$ 0.025 0.975
Lag1 0.0327 0.026 1.250 0.211 -0.019 0.084
Lag2 -0.0682 0.027 -2.556 0.011 -0.12 -0.016
Lag3 0.0081 0.026 0.306 0.759 -0.044 0.060
Lag4 0.0194 0.026 0.740 0.459 -0.032 0.071
Lag5 0.0069 0.026 0.261 0.794 -0.045 0.058
Volume -0.0569 0.027 -2.125 0.034 -0.109 -0.004

Applying classifiers

Here is the fun part. The question wants to see how 4 different classifiers perform on the dataset. These classifiers are:

  1. Logistic regression
  2. Linear discriminant analysis
  3. Quadratic discriminant analysis
  4. k Nearest neighbours

I’m aware of the scikit-learn Pipeline functionality but I thought this stage might be a fun opportunity to make my own budget pipeline to explore the functionality of different core Python libraries.

I started by defining classes to describe the Preprocessing stage of the pipeline and the Classifier using the dataclasses module. The Preprocessing class holds the name and a function which performs a set of transforms on the dataset to create the test and training datasets. The Classifier holds the name, a scikit-learn classifier and optionally the classifier score and Preprocessing class. I’ve added optional type hints as these are a massive help when understanding code.

The run_classifier function takes a list of $n$ Preprocessing and $m$ Classifier classes and uses itertools.product to create an $n \times m$ list of pipelines to be run on the Weekly dataset. itertools.product is similar to np.meshgrid but for Python objects rather than np.arrays.

I could have just filtered for the maximum score on the dataframe returned from run_classifiers() but it was also interesting to use attrgetter from the operator module to compare each Classifer class based on an attribute - in this case the score of the Classifier on the dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# create budget pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator
from typing import List, Callable, Tuple, Optional
from dataclasses import dataclass
from operator import attrgetter
import itertools

pd.options.display.float_format = '${:,.5f}'.format

@dataclass
class Preprocess:
    """
    Class to hold information on the preprocessing used on a dataset.

    """
    name: str
    transform: Callable


@dataclass
class Classifier:
    """
    Class to hold information on a Classifier.

    """
    name: str
    model: BaseEstimator
    preprocess: Optional[Preprocess] = None
    score: float = 0.0


def lag2_preprocess(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """This is the default test training split required by problem."""
    idx_train = df['Year'] < 2009.0
    idx_test = df['Year'] > 2008.0
    all_cols = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']
    X_train = df['Lag2'][idx_train].values[:, None] # need to pass a 2d array to .fix(X, y)
    y_train = df['Direction'][idx_train]
    X_test = df['Lag2'][idx_test].values[:, None]
    y_test = df['Direction'][idx_test]
    return X_train, y_train, X_test, y_test


def lag2_volume_preprocess(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """This is the default test training split required by problem."""
    idx_train = df['Year'] < 2009.0
    idx_test = df['Year'] > 2008.0
    all_cols = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']
    X_train = df[['Lag2', 'Volume']][idx_train]
    y_train = df['Direction'][idx_train]
    X_test = df[['Lag2', 'Volume']][idx_test]
    y_test = df['Direction'][idx_test]
    return X_train, y_train, X_test, y_test


def apply_classifier(model: Classifier, dataset: pd.DataFrame) -> Classifier:
    """
    Apply classifier to dataset given function to create training and test datasets.

    """
    X_train, y_train, X_test, y_test = model.preprocess.transform(dataset)
    model.model = model.model.fit(X_train, y_train)
    y_predict = model.model.predict(X_test)
    model.score = accuracy_score(y_predict, y_test)
    return model


def run_classifiers():
    # define preprocessing functions
    preprocessors = [
        Preprocess("Lag 2", lag2_preprocess),
        Preprocess("Lag 2 and Volume", lag2_volume_preprocess),
    ]

    # define classifier models
    classifiers = [
        Classifier("Logistic Regression", LogisticRegression(random_state=0)),
        Classifier("Linear Discriminant Analysis", LinearDiscriminantAnalysis()),
        Classifier("Quadratic Discriminant Analysis", QuadraticDiscriminantAnalysis()),
        Classifier("kNN, k=1", KNeighborsClassifier(n_neighbors=1)),
        Classifier("kNN, k=2", KNeighborsClassifier(n_neighbors=2))
    ]

    # associate preprocessor function with classifiers for analysis
    prepped_classifiers = []
    for preprocessor, classifier in itertools.product(preprocessors, classifiers):
        prepped_classifier = Classifier(classifier.name, classifier.model, preprocessor)
        prepped_classifiers.append(prepped_classifier)

    # load data
    df_weekly = get_dataset()

    # train classifiers on datasets
    trained_classifiers = [apply_classifier(classifier, df_weekly) for classifier in prepped_classifiers]

    # create results dataframe
    df = pd.DataFrame({"Classifier name": [classifier.name for classifier in trained_classifiers],
                       "Preprocess name": [classifier.preprocess.name for classifier in trained_classifiers],
                       "Score": [classifier.score for classifier in trained_classifiers]})

    with pd.option_context('display.float_format', '{:,.5f}'.format):
        print(df.to_html())

    # get classifiers with maximum scores
    # returns the first classifier with the highest score in the event of a tie
    best_classifier = max(trained_classifiers, key=attrgetter('score'))
    print("The best classifier is {0} using the {1} preprocessor with a score of {2:.4f}".format(best_classifier.name, best_classifier.preprocess.name, best_classifier.score))


run_classifiers()

Returns:

1
The best classifier is Logistic Regression using the Lag 2 preprocessor with a score of 0.6250

The Logistic Regression and Linear Discriminant Analysis appear to perform the best across both preprocessing methods. They also produce the same results. This is because the only difference between Logistic Regression and Linear Discriminant Analysis lies in the procedure used to fit their coefficients - Logistic Regression uses the maximum likelihood method and the coefficients for Linear Discriminant Analysis are calculated from an estimated normal distribution. If the data is approximately Gaussian then both methods behave similarly. If the data is Gaussian then Linear Discriminant Analysis outperforms Logistic Regression and if the data isn’t Gaussian then the inverse is true.

Classifier name Preprocess name Score
0 Logistic Regression Lag 2 0.62500
1 Linear Discriminant Analysis Lag 2 0.62500
2 Quadratic Discriminant Analysis Lag 2 0.58654
3 kNN, k=1 Lag 2 0.50000
4 kNN, k=2 Lag 2 0.58654
5 Logistic Regression Lag 2 and Volume 0.53846
6 Linear Discriminant Analysis Lag 2 and Volume 0.53846
7 Quadratic Discriminant Analysis Lag 2 and Volume 0.47115
8 kNN, k=1 Lag 2 and Volume 0.55769
9 kNN, k=2 Lag 2 and Volume 0.61538

Review

The purpose of this post was to learn about different features of Python to create a machine learning pipeline which solves a binary classification problem. I applied four classifiers to a dataset of Weekly stock price and volume observations to predict whether the stock price went up or down. Along the way I reinvented the scikit learn pipeline functionality but also had fun with learning about different core Python libraries. If I was to continue working on this problem I’d probably have a look at applying different scaling methods to the problems

  1. Typical p-value cutoffs are $1\%$ or $5\%$. Choosing which p-value to use and even whether they should be used in this way is an area of academic debate. I enjoyed this tweet on the subject.