Calculating covariance matrices with Python
Thomas Dickson
2 minute read
Covariance matrices, also known as variance-covariance matrices, are used to quantify the relationship between two or more variables. This is another of my posts where I want to solve a problem using more fundamental numpy linear algebra functions rather than np.cov
. The key source for the notation below is from here.
To create a variance-covariance matrix from $\boldsymbol{X}$, which is an $n \times k$ matrix holding ordered sets of raw data, you start with transforming the raw data into deviation scores in matrix $\boldsymbol{x}$.
\[\boldsymbol{x} = \boldsymbol{X} - \boldsymbol{1} \boldsymbol{1'} \boldsymbol{X} (1/n)\]Where:
- $\boldsymbol{1}$ is a $n \times 1$ vector of ones.
- $\boldsymbol{1’}$ is the transpose of $\boldsymbol{1}$.
- $\boldsymbol{x}$ is an $n \times k$ matrix of deviation scores: $x_{11}, x_{12}, …, x_{nk}$.
- $\boldsymbol{X}$ is an $n \times k$ matrix of raw data.
We then need to calculate $\boldsymbol{x’x}$ which is the $k \times k$ deviation sums of squares and cross products matrix for $\boldsymbol{x}$. Then we divide each term in the deviation sums of squares and cross products matrix by $n$ to create the variance-covariance matrix $\boldsymbol{V}$.
\[\boldsymbol{V}=\boldsymbol{x'x}(1/n)\]Where:
- $\boldsymbol{V}$ is a $k \times k$ variance-covariance matrix.
- $\boldsymbol{x’}$ is the transpose of $\boldsymbol{x}$.
- $\boldsymbol{x’x}$ is the deviation sums of squares and cross product matrix.
- n is the number of measurements/rows/scores in each column of the data matrix $\boldsymbol{X}$.
Now we understand how to use linear algebra to calculate the variance-covariance matrix lets look at how we can use Python. Numpy has the function ‘np.cov’ that calculates the covariance matrix. Note that the bias
parameter in np.cov
normalises by $n$ if true rather than the default setting of $n-1$.
1
2
X = np.array([[1, 2], [3, 6]])
np.cov(X.T, bias=True)
1
2
array([[1., 2.],
[2., 4.]])
The convenience of this function is not enough for me - I want to check that I can get as close to the linear algebra as possible.
1
2
3
x = X - np.dot(np.ones(X.T.shape), X) * 1/x.shape[0]
var_co_var_x = np.dot(x.T, x) * 1/x.shape[0]
var_co_var_x
1
2
array([[1., 2.],
[2., 4.]])
Sweet.