Python-相關系數(shù)矩陣和顯著性矩陣

昵稱QAb6ICvc 2022-09-13 發(fā)布于浙江

展開全文

今天的重點是用Python來制作相關性矩陣以及顯著性p value矩陣，如果僅僅得出了相關系數(shù)矩陣的話還是缺乏說服力，所以顯著性測試是有必要的

相關性分析可能是使用頻率較高的一種統(tǒng)計方法。對于不同數(shù)據(jù)集的分布特征往往會選擇使用相對應的方法，例如如果數(shù)據(jù)是正態(tài)分布的話，那就選擇皮爾遜相關系數(shù)，否則的話就去用斯皮爾曼，另外還有kendall相關系數(shù)。So，在具體選擇哪一種方法之前可能需要對數(shù)據(jù)的分布特征進行檢驗，如可以使用Kolmogorov-Smirnov方法來檢驗數(shù)據(jù)的分布形態(tài)，詳細的方法使用可以查看scipy的官網(wǎng)介紹。由于這不是今天的重點，所以skip

今天的重點是用Python來制作相關性矩陣以及顯著性p value矩陣，如果僅僅得出了相關系數(shù)矩陣的話還是缺乏說服力，所以顯著性測試是有必要的。

放一張平時常見的相關系數(shù)矩陣圖，也可以叫cross-corr plot吧。這個是沒有對應的p value的，那么可能的情況就是第一個變量和第二個變量即使相關系數(shù)在0.5，但是它的p value可能大于0.05或者0.01了，但是一般情況下相關系數(shù)高的變量，其p value一般也是比較小的，一般也就是通過顯著性測試的，所以可能出于這個假設，較多的相關矩陣圖就沒有對應的p value。

為了更加科學，我這里先繪制了相關系數(shù)矩陣，然后再繪制p value值的分布情況。

import matplotlib.pyplot as pltimport seaborn as snsplt.rcParams['font.family'] = 'Times New Roman'# method='spearman'def correlation_heatmap(train): correlations = train.corr(method='spearman') cmap = sns.diverging_palette(220,20, center='light',as_cmap=True)

fig, ax = plt.subplots(figsize=(20,20),dpi=300) mask = np.zeros_like(correlations, dtype=np.bool) # 將mask右上三角(列號》=行號)設置為True mask[np.triu_indices_from(mask)] = False sns.heatmap(correlations, mask=mask, cmap=cmap, fmt='.2f', square=True, linewidths=.2, annot=True, vmax=0.99,cbar_kws={'shrink': .50,'extend':'both','pad':0.02} ) plt.savefig('./spearmancorr.png') plt.show(); # correlation_heatmap(X_train[X_train.columns[sorted_idx]])correlation_heatmap(df)

得出相關系數(shù)矩陣，圖形有點大

繼續(xù)獲取p value并繪制出來

from scipy.stats import pearsonr,spearmanrdef corr_sig(df=None):    p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))    for col in df.columns:        for col2 in df.drop(col,axis=1).columns:            _ , p = spearmanr(df[col],df[col2])            p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p    return p_matrix
p_values = corr_sig(df)mask = np.invert(np.tril(p_values<0.05))