Correlation between columns in DataFrame

Question

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:

z[k] = sum_n a[n] * conj(v[n+k])

while df.corr (by default) calculates the Pearson correlation coefficient.

The correlation coefficient (if it exists) is always between -1 and 1 inclusive.
The cross-correlation is not bounded.

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

The fact that the standard deviation of df['a'] and df['b'] is zero is what causes df.corr to be NaN everywhere.

From the comment below, it sounds like you are looking for Beta. It is related to Pearson’s correlation coefficient, but instead of dividing by the product of standard deviations:

enter image description here

you divide by a variance:

enter image description here

You can compute Beta using np.cov

cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)


def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
    """
    http://stackoverflow.com/a/13203189/190597 (unutbu)
    """
    dt = float(T) / N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size=N)
    W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###
    X = (mu - 0.5 * sigma ** 2) * t + sigma * W
    S = S0 * np.exp(X)  # geometric brownian motion ###
    return S

N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)

cov = np.cov(a, b)
print(cov)
# [[ 0.38234755  0.80525967]
#  [ 0.80525967  1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015

plt.plot(a)
plt.plot(b)
plt.show()

enter image description here

The ratio of mus is 2, and beta is ~2.1.

And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)

Leave a Comment Cancel reply