I have an Excel worksheet which is 250 rows by 10 column of data. My dependent variable is n_nnld_trp and I am trying to find which independent variables are highly correlated with it to use in a linear regression model.
I want to make a table like this to summarize the correlation data as well as identify any cases of multi-collinearity using the equation in the picture:
到目前为止,我设法使用数据表来获取每行的平均值,我的因变量是 n_hhld_trp:
trip_mean = pd.pivot_table(read_excel, index=['n_hhld_trip'],
aggfunc=np.mean)
print (trip_mean.head ())
我发现很难使如上所示的相关表,我欢迎并感谢任何帮助。
Numpy 具有计算任何此类常见事物的所有必要函数,因此计算 r 数据帧最简单的方法是:
import numpy as np
r = np.corrcoef(df.values)
或者,要在单独的变量对之间进行计算,您可以向corrcoef
函数提供一个较小的数组,或者直接计算它:
r = np.cov(df.n_nnld_trp.values, df.other_col.values) / (np.std(df.n_nnld.trp.values) * np.std(df.other_col.values))
经过几个小时的挖掘,我得到了我想要呈现它的方式。对于任何想要做类似的事情的人,请参见的代码:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pearson_correlation = read_excel.corr(method='pearson')
print(pearson_correlation)
enter image description here
本站系公益性非盈利分享网址,本文来自用户投稿,不代表码文网立场,如若转载,请注明出处
评论列表(19条)