correlation
Correlation among multiple categorical variables
You can using pd.factorize df.apply(lambda x : pd.factorize(x)[0]).corr(method=’pearson’, min_periods=1) Out[32]: a c d a 1.0 1.0 1.0 c 1.0 1.0 1.0 d 1.0 1.0 1.0 Data input df=pd.DataFrame({‘a’:[‘a’,’b’,’c’],’c’:[‘a’,’b’,’c’],’d’:[‘a’,’b’,’c’]}) Update from scipy.stats import chisquare df=df.apply(lambda x : pd.factorize(x)[0])+1 pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df]) Out[123]: 0 1 2 3 0 0.0 0.0 0.0 0.0 1 0.0 0.0 … Read more
numpy corrcoef – compute correlation matrix while ignoring missing data
One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not. import pandas as pd import numpy as np # data # ============================== np.random.seed(0) df = pd.DataFrame(np.random.randn(100,5), columns=list(‘ABCDE’)) df[df < 0] = np.nan df … Read more
Computing the correlation coefficient between two multi-dimensional arrays
Correlation (default ‘valid’ case) between two 2D arrays: You can simply use matrix-multiplication np.dot like so – out = np.dot(arr_one,arr_two.T) Correlation with the default “valid” case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position. Row-wise Correlation Coefficient calculation for two 2D arrays: def … Read more
pandas columns correlation with statistical significance
To calculate all the p-values at once, you can use calculate_pvalues function (code below): df = pd.DataFrame({‘A’:[1,2,3], ‘B’:[2,5,3], ‘C’:[5,2,1], ‘D’:[‘text’,2,3] }) calculate_pvalues(df) The output is similar to the corr() (but with p-values): A B C A 0 0.7877 0.1789 B 0.7877 0 0.6088 C 0.1789 0.6088 0 Details: Column D is automatically ignored as it … Read more
How to interpret the values returned by numpy.correlate and numpy.corrcoef?
numpy.correlate simply returns the cross-correlation of two vectors. if you need to understand cross-correlation, then start with http://en.wikipedia.org/wiki/Cross-correlation. A good example might be seen by looking at the autocorrelation function (a vector cross-correlated with itself): import numpy as np # create a vector vector = np.random.normal(0,1,size=1000) # insert a signal into vector vector[::50]+=10 # perform … Read more
Correlated features and classification accuracy
Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features … Read more