手把手教你用Python構(gòu)建logit、負(fù)二項(xiàng)回歸、決策樹與隨機(jī)森林機(jī)器學(xué)習(xí)模型

計(jì)量經(jīng)濟(jì)圈 2021-06-05

展開全文

本次更新的主要內(nèi)容為利用Python中的statsmodels庫構(gòu)建logit與負(fù)二項(xiàng)回歸模型，以及利用sklearn庫構(gòu)建決策樹以及隨機(jī)森林模型。內(nèi)容源自同濟(jì)大學(xué)研究生課程《高級(jí)數(shù)理統(tǒng)計(jì)》的第一次大作業(yè)，一共有五道題，這里我把題目和我的答案都貼上來，僅供參考，不一定對(duì)，大家可以通過公眾號(hào)直接后臺(tái)私信我，多多互相交流。

1. 請(qǐng)采用計(jì)數(shù)數(shù)據(jù)分析模型（Count Data Model），對(duì)Crash Frequency.xls文件的數(shù)據(jù)進(jìn)行建模分析，并回答以下問題：

（1）所采用計(jì)數(shù)數(shù)據(jù)分析模型的類型及原因；

（2）事故發(fā)生頻率顯著相關(guān)的因素；

（3）模型整體擬合優(yōu)度的評(píng)價(jià)。

其中，Grade_G為分類變量，代表路段坡度；其余變量為連續(xù)變量，分別代表交通周轉(zhuǎn)量（Logvmt）、平均能見度（Visibility）、平均氣溫（Temperature）、小時(shí)降雨量（Precipitation）和平均速度（Speed）。

Answer:

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport statsmodels.api as smfrom statsmodels.stats.outliers_influence import variance_inflation_factorimport warningswarnings.simplefilter('ignore') sns.set_style('whitegrid')

首先導(dǎo)入相關(guān)數(shù)據(jù)

df_1 = pd.read_excel('Crash frequency.xls')

描述性統(tǒng)計(jì)

可以看到，0值在Crash_Freq中的分布大于50%，且均值（1.096）遠(yuǎn)小于方差（3.802），可以考慮建立零膨脹模型。

def summary_statistics(df): df_summary = df.describe().T df_summary['std'] = df.var() df_summary.columns = ['count', 'mean', 'var', 'min', '25%', '50%', '75%', 'max'] return df_summary

summary_statistics(df_1)

計(jì)算變量的方差膨脹因子，檢查多重共線性問題

可以看到，所有變量的方差膨脹因子（VIF）均小于5，說明沒有多重共線性問題存在。

def cal_vif(df,col): df = sm.add_constant(df) col.append('const') df_vif = df.loc[:,col]

    return pd.DataFrame({'variable':col,'VIF':[variance_inflation_factor(np.matrix(df_vif),i) for i in range(len(col))]}).iloc[:-1]

cal_vif(df_1,['Grade_G','Logvmt', 'Visibility', 'Temperature',       'Precipitation', 'Speed'])

plt.figure(figsize=(10,6))sns.histplot(x='Crash_Freq',data=df_1,binwidth=0.5,color='blue')plt.xticks(range(0,df_1['Crash_Freq'].max()+1),fontsize=15)plt.yticks(fontsize=15)plt.xlabel('Crash_Freq',fontsize=16)plt.ylabel('Count',fontsize=16)plt.show()

在stata中導(dǎo)入數(shù)據(jù)后輸入下面代碼建立零膨脹負(fù)二項(xiàng)模型

zinb Crash_Freq Grade_G-Speed, inflate(Grade_G Logvmt) vuong

可以看到，Vuong檢驗(yàn)的結(jié)果只在10%水平下顯著（輕微顯著），為了保證模型的簡潔性，我們采用負(fù)二項(xiàng)回歸重新建模。

model_1_nb = sm.NegativeBinomial.from_formula('Crash_Freq ~ Grade_G + Logvmt + Visibility + Temperature + \                                      Precipitation + Speed ',data=df_1).fit(method='ncg',maxiter=1000)

print(model_1_nb.summary2())

模型結(jié)果表明

事故發(fā)生頻率與交通周轉(zhuǎn)量及路段坡度顯著正相關(guān)（即交通周轉(zhuǎn)量越大、坡度越大的路段，事故發(fā)生的頻率越高）。
事故發(fā)生頻率與平均能見度、平均氣溫以及平均速度顯著負(fù)相關(guān) （即平均能見度越低、氣溫越低以及平均速度越低的路段，事故發(fā)生率越高）。

模型擬合優(yōu)度方面

離散系數(shù)alpha在0.1%水平下顯著，表明有必要采用負(fù)二項(xiàng)回歸。
模型的AIC與BIC分別為579.29和607.13，偽R方為0.180。

為了更好地評(píng)價(jià)負(fù)二項(xiàng)回歸模型的擬合優(yōu)度，我們利用相同的變量擬合泊松回歸模型

model_1_poisson = sm.Poisson.from_formula('Crash_Freq ~ Grade_G + Logvmt + Visibility + Temperature + \                                      Precipitation + Speed ',data=df_1).fit(method='ncg',maxiter=1000)

print(model_1_poisson.summary2())

可以發(fā)現(xiàn)泊松回歸的AIC與BIC都明顯大于負(fù)二項(xiàng)回歸的AIC與BIC，再次表明負(fù)二項(xiàng)回歸的擬合優(yōu)度更高。

2. Red light running.xls文件是研究人員對(duì)四個(gè)交叉口開展闖紅燈調(diào)查的記錄數(shù)據(jù)。

其中，Running為二進(jìn)制變量，表明觀測(cè)對(duì)象是否闖紅燈（1-闖紅燈，0-未闖紅燈）；Intersection為分類變量，表明交叉口編號(hào)；Local為二進(jìn)制變量，表明觀測(cè)對(duì)象是否是滬牌車（1-滬牌車，0-外地牌照車）；Passenger為二進(jìn)制變量，表明觀測(cè)對(duì)象是否載有乘客（1-載有乘客，0-未載有乘客）；Male為二進(jìn)制變量，表明駕駛?cè)耸欠駷槟行裕?-男性，0-女性）；Age為分類變量，表明駕駛?cè)说哪挲g分組。請(qǐng)基于此數(shù)據(jù)，采用恰當(dāng)?shù)姆治瞿Ｐ?，回答以下問題：

（1）闖紅燈行為顯著相關(guān)的變量；

（2）4個(gè)交叉口間闖紅燈行為是否有顯著差異；

（3）計(jì)算在交叉口1，一位年齡為45歲的男性滬牌車駕駛員（未載客）其闖紅燈的概率（給出計(jì)算公式）。

Answer:

首先導(dǎo)入相關(guān)數(shù)據(jù)

df_2 = pd.read_excel('Red light running.xls')

生成交叉口和年齡的啞變量

def get_dummy_var(df,col): dummy = pd.get_dummies(df[col]) new_col = [] for i in dummy.columns: new_col.append(col + '_' + str(i)) dummy.columns = new_col return dummy

df_2 = pd.concat([df_2,get_dummy_var(df_2,'Intersection'),get_dummy_var(df_2,'Age')],axis=1)

描述性統(tǒng)計(jì)

可以看到，三號(hào)交叉口和40歲年齡的比例最高，因此在建模時(shí)講其設(shè)置為參照項(xiàng)。

summary_statistics(df_2)

檢查共線性

所有變量VIF均小于5，表明沒有多重共線性存在。

cal_vif(df_2,['Local', 'Male', 'Passenger',       'Intersection_1', 'Intersection_2', 'Intersection_4',       'Age_20', 'Age_25', 'Age_30', 'Age_35', 'Age_38','Age_45',       'Age_50', 'Age_55'])

建立二項(xiàng)logistics模型

model_2_logit = sm.Logit.from_formula('Running ~ Local + Male + Passenger + Intersection_1 + Intersection_2 + Intersection_4 + \

                                                 Age_20 + Age_25 + Age_30 + Age_35 + Age_38 + Age_45 + Age_50 + Age_55',data=df_2).fit(method='ncg',maxiter=1000)

print(model_2_logit.summary2())

通過模型結(jié)果可以發(fā)現(xiàn)，與闖紅燈行為顯著相關(guān)的變量有：

闖紅燈行為與滬牌車、男性顯著正相關(guān)，表明滬牌車與男性駕駛員更容易闖紅燈。
闖紅燈行為與有載客顯著負(fù)相關(guān)，表明有載客的車輛更不容易闖紅燈。
闖紅燈行為與駕駛員年齡顯著負(fù)相關(guān)，具體來講，相比40歲的駕駛員，25和30歲的駕駛員更容易闖紅燈，且25歲的駕駛員闖紅燈的概率更高，而其余年齡的駕駛員在闖紅燈行為方面與40歲的駕駛員沒有顯著區(qū)別。

4個(gè)交叉口間闖紅燈行為是否有顯著差異？回答：是

通過模型結(jié)果我們可以發(fā)現(xiàn)，Intersection_1, Intersection_2, Intersection_4三個(gè)變量均顯著，這表明1，2，4號(hào)交叉口闖紅燈行為與3號(hào)交叉口相比均有顯著差異，進(jìn)一步繪制這三個(gè)變量的Odds Ratio：

from math import e,log

plt.scatter([-0.8803,-1.9788,0,0.4991],[4,3,2,1],color='k',marker='s')plt.hlines(4,-1.2981,-0.4625,color='k',linewidth=1.5)plt.hlines(3,-2.6077,-1.3500,color='k',linewidth=1.5)plt.hlines(1,-0.0146,1.0127 ,color='k',linewidth=1.5)for i in [log(0.1),log(0.5),log(1.5),log(2)]:    plt.vlines(i,0,5,linestyle='--',color='grey',linewidth=0.5)plt.vlines(0,0,5,color='k',linewidth=0.5)plt.ylim(0.5,4.5)plt.grid(False)plt.yticks(range(1,5),['Intersection_4', 'Intersection_3', 'Intersection_2', 'Intersection_1'],fontsize=14)plt.xticks([log(0.1),log(0.5),log(1),log(1.5),log(2)],[0.1,0.5,1,1.5,2],fontsize=13)plt.xlabel('Odds Ratio',fontsize=14)plt.show()

可以發(fā)現(xiàn)，四個(gè)交叉口發(fā)生闖紅燈行為的概率大小為交叉口4>交叉口3>交叉口1>交叉口2。

（3）計(jì)算在交叉口1，一位年齡為45歲的男性滬牌車駕駛員（未載客）其闖紅燈的概率（給出計(jì)算公式）。

3. 我覺得第三題的數(shù)據(jù)描述很不清楚，因此這道題跳過。

4. 請(qǐng)基于Severity.csv文件，針對(duì)事故嚴(yán)重程度（severity變量）構(gòu)建決策樹模型：

（1）給出Split Criterion計(jì)算依據(jù)

（2）給出決策樹結(jié)構(gòu)圖；

（3）對(duì)模型分類規(guī)則進(jìn)行解釋（解讀三條規(guī)則即可）。

變量解釋：#### 事故嚴(yán)重程度（severity），PDO為僅財(cái)產(chǎn)損失事故、INJ為受傷事故；中央隔離帶寬度（Med_Width）；路肩寬度（Inside_Shld）；限速（Speed_Limit）;貨車比例（Truck_Per）；溫度（temperature）；能見度（visibility）；1小時(shí)降雨量（1hourprecip）；運(yùn)行速度方差（speed std）；日均流量對(duì)數(shù)（logaadt）；車道數(shù)（lane）；季節(jié)（snow_season）；路面結(jié)冰（ice）；路面濕滑（slush）；是否陡坡（steep grade）。

Answer:

首先導(dǎo)入相關(guān)數(shù)據(jù)

df_4 = pd.read_csv('severity binary.csv')

### 變量編碼df_4['severity'] = df_4['severity'].map({'PDO':0,'INJ':1})df_4['lane'] = df_4['lane'].map({'two':2,'thr':3})df_4['snow_season'] = df_4['snow_season'].map({'dry':0,'snow':1})df_4['ice'] = df_4['ice'].map({'no':0,'yes':1})df_4['slush'] = df_4['slush'].map({'no':0,'yes':1})df_4['steep grade'] = df_4['steep grade'].map({'no':0,'yes':1})

描述性統(tǒng)計(jì)

可以看到，僅有7.3%的樣本為受傷事故，在該場景下，我們希望盡可能預(yù)測(cè)出受傷事故，因此我采用召回率（Recall）作為模型性能的評(píng)價(jià)指標(biāo)，以便更多的找到受傷事故樣本。此外，對(duì)于所有樣本，變量ice均為no，因此在建模過程中，我舍棄了該變量。

summary_statistics(df_4)

開始構(gòu)建決策樹

from sklearn.tree import DecisionTreeClassifier,export_graphvizfrom sklearn.model_selection import GridSearchCV,train_test_splitimport graphviz

x = df_4.drop(columns=['severity','ice'])y = df_4['severity']name = x.columns

利用網(wǎng)格搜索，標(biāo)定決策樹最優(yōu)超參數(shù)

cv_params = {'min_samples_split':range(1,20), 'max_depth':range(1,20),'criterion':['gini','entropy']}ind_params = {'random_state': 10}optimized_GBM = GridSearchCV(DecisionTreeClassifier(**ind_params),cv_params,scoring='recall', cv=5, n_jobs=-1, verbose=10)optimized_GBM.fit(x, y)print('最優(yōu)分?jǐn)?shù):', optimized_GBM.best_score_)

獲得最優(yōu)超參數(shù)：split criterion為gini，最大深度為9，最小劃分樣本數(shù)量為2

print(optimized_GBM.best_params_)

{'criterion': 'gini', 'max_depth': 9, 'min_samples_split': 2}

構(gòu)建決策樹

clf = DecisionTreeClassifier(criterion='gini', max_depth=9, min_samples_split=2,random_state=10)clf.fit(x,y)

畫出決策樹結(jié)構(gòu)圖

dot_data = export_graphviz(clf, out_file=None, feature_names=name,                     filled=True, rounded=True,  class_names=['PDO','INJ'],                     special_characters=True)  graph = graphviz.Source(dot_data)  graph.render('決策樹結(jié)構(gòu)')graph

解讀決策樹分類規(guī)則

對(duì)上述紅、黃、藍(lán)三個(gè)分類路徑進(jìn)行解讀：

首先決策樹根據(jù)運(yùn)行速度方差進(jìn)行樣本劃分，如果運(yùn)行速度方差大于1.515，樣本進(jìn)入右側(cè)分類路徑，這同樣符合常理，當(dāng)?shù)缆愤\(yùn)行速度方差大時(shí)，說明低速高速車輛同時(shí)存在，發(fā)生車禍時(shí)危險(xiǎn)程度更大。
對(duì)于右側(cè)樣本，繼續(xù)根據(jù)路肩寬度進(jìn)行劃分，如果路肩寬度小于等于2，樣本進(jìn)入左側(cè)劃分區(qū)域，如果路肩寬度大于2，樣本則進(jìn)入右側(cè)劃分區(qū)域。
對(duì)于速度方差大于1.515、路肩寬度小于等于2的樣本，如果路面濕滑，則被預(yù)測(cè)為PDO樣本，如果路面不濕滑且運(yùn)行速度方差小于等于1.726，則被預(yù)測(cè)為PDO樣本。
對(duì)于速度方差大于1.515、路肩寬度大于2的樣本，如果溫度小于-9，則被預(yù)測(cè)為INJ樣本。

5. 請(qǐng)基于Severity.csv文件，構(gòu)建隨機(jī)森林模型對(duì)事故嚴(yán)重程度的影響因素進(jìn)行重要度排序：

（1）給出隨機(jī)森林設(shè)定參數(shù)；

（2）給出變量重要度排序結(jié)果。

根據(jù)網(wǎng)格搜索選擇最優(yōu)參數(shù)

通常，針對(duì)隨機(jī)森林，有三個(gè)重要參數(shù)，分別為：弱學(xué)習(xí)器（決策樹）數(shù)量，弱學(xué)習(xí)器（決策樹）的深度（代表了弱學(xué)習(xí)器的復(fù)雜程度）以及每次進(jìn)行節(jié)點(diǎn)劃分時(shí)使用的特征數(shù)量。下面，我將利用袋外樣本recall指標(biāo)，對(duì)這三個(gè)參數(shù)的最優(yōu)組合進(jìn)行標(biāo)定。其中，弱學(xué)習(xí)器數(shù)量設(shè)置為[100,1600,100]，最大特征數(shù)設(shè)置為[3,6,9]，弱學(xué)習(xí)器深度設(shè)置為[1,50,5]

from sklearn.metrics import recall_score,accuracy_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importance

gs = pd.MultiIndex.from_product([np.arange(100,1600,100),[3,6,9],np.arange(1,50,5)],names=['n_estimators','max_features','max_depth']).to_frame().reset_index(drop=True)
gs['oob_score'] = 0
### 開始網(wǎng)格搜索
from tqdm import tqdm
for i in tqdm(gs.index):    rf = RandomForestClassifier(n_estimators=gs.loc[i,'n_estimators'],max_features=gs.loc[i,'max_features'],max_depth=gs.loc[i,'max_depth'],oob_score=True,random_state=20,n_jobs=-1).fit(x,y.ravel())    gs.loc[i,'oob_score'] = recall_score(y,[0 if i[0] >= 0.5 else 1 for i in rf.oob_decision_function_ ])

plt.figure(figsize=(10,6))plt.scatter(100,gs[gs['max_depth']==16]['oob_score'].max(),color='red',label='Optimal point',zorder=1)plt.legend()

sns.lineplot(x='n_estimators',y='oob_score',data=gs[gs['max_depth']==16],hue='max_features',palette=['C0','C1','C2'],zorder=0)

plt.xticks(range(100,1600,100),fontsize=12,rotation=45)plt.yticks(fontsize=12)plt.xlabel('n_estimators',fontsize=15)plt.ylabel('oob_score',fontsize=15)

得到最優(yōu)模型參數(shù)

100個(gè)弱學(xué)習(xí)器，最大特征數(shù)為3，決策樹深度為16

rf = RandomForestClassifier(n_estimators=100,max_depth=16,max_features=3,random_state=20)rf.fit(x,y)

計(jì)算impurity-based特征重要度和permutation特征重要度

其中，impurity-based特征重要度根據(jù)每個(gè)節(jié)點(diǎn)根據(jù)特征劃分后獲得的基尼增益計(jì)算；permutation特征重要度為將特征隨機(jī)打亂，計(jì)算打亂后模型性能的變化。

feature = pd.DataFrame(list(zip(name,rf.feature_importances_)),columns=['Variable','Relative Importance (impurity-based)']).sort_values(by='Relative Importance (impurity-based)')

perm_imp = permutation_importance(rf,x,y,random_state=10,n_repeats=99)

feature_perm = pd.DataFrame(list(zip(name,perm_imp['importances_mean'])),columns=['Variable','Relative Importance']).sort_values(by='Relative Importance')

feature_perm['Relative Importance'] = feature_perm['Relative Importance']/feature_perm['Relative Importance'].sum()feature_perm.columns = ['Variable', 'Relative Importance (permutation)']feature = pd.merge(feature,feature_perm,on='Variable',how='left')

plt.figure(figsize=(6,8))plt.barh(np.arange(13)+0.15,feature['Relative Importance (impurity-based)'],height=0.3,color='k')plt.barh(np.arange(13)-0.15,feature['Relative Importance (permutation)'],height=0.3,color='C3')plt.ylim(-0.5,12.5)plt.yticks(range(13),feature['Variable'],fontsize=13)plt.xticks(fontsize=14)plt.xlabel('Relative Importance',fontsize=16)plt.yticks(fontsize=16)plt.legend(['impurity-based','permutation'],fontsize=13)plt.show()

特征變量重要度排序結(jié)果

可以看出，總體上無論是impurity-based還是permutation方法計(jì)算的特征重要度，速度的標(biāo)準(zhǔn)差和溫度都是最重要的兩個(gè)變量，其次是能見度、降雨量、中央隔離帶寬度和是否陡坡；在兩種方式計(jì)算的特征重要度中，限速、車道數(shù)、貨車比例和日均流量對(duì)數(shù)均為四個(gè)特征重要度最小的變量。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：計(jì)量經(jīng)濟(jì)圈 > 《待分類》

舉報(bào)/認(rèn)領(lǐng)