轉(zhuǎn)載幾個R語言中實現(xiàn)Logistic回歸模型的案例

生物_醫(yī)藥_科研 2019-01-24

展開全文

案例一：本文用例來自于John Maindonald所著的《Data Analysis and Graphics Using R》一書，其中所用的數(shù)據(jù)集是anesthetic，數(shù)據(jù)集來自于一組醫(yī)學(xué)數(shù)據(jù)，其中變量conc表示麻醉劑的用量，move則表示手術(shù)病人是否有所移動，而我們用nomove做為因變量，因為研究的重點在于conc的增加是否會使nomove的概率增加。

首先載入數(shù)據(jù)集并讀取部分文件，為了觀察兩個變量之間關(guān)系，我們可以利cdplot函數(shù)來繪制條件密度圖

install.packages('DAAG')

library(lattice)

library(DAAG)

head(anesthetic)

move conc logconc nomove

1 0 1.0 0.0000000 1

2 1 1.2 0.1823216 0

3 0 1.4 0.3364722 1

4 1 1.4 0.3364722 0

5 1 1.2 0.1823216 0

6 0 2.5 0.9162907 1

cdplot(factor(nomove)~conc,data=anesthetic,main='條件密度圖',ylab='病人移動',xlab='麻醉劑量')

從圖中可見，隨著麻醉劑量加大，手術(shù)病人傾向于靜止。下面利用logistic回歸進行建模，得到intercept和conc的系數(shù)為-6.47和5.57，由此可見麻醉劑量超過1.16(6.47/5.57)時，病人靜止概率超過50%。

anes1=glm(nomove~conc,family=binomial(link='logit'),data=anesthetic)

summary(anes1)

結(jié)果顯示：

Call:

glm(formula = nomove ~ conc, family = binomial(link = 'logit'),

data = anesthetic)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.76666 -0.74407 0.03413 0.68666 2.06900

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -6.469 2.418 -2.675 0.00748 **

conc 5.567 2.044 2.724 0.00645 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 41.455 on 29 degrees of freedom

Residual deviance: 27.754 on 28 degrees of freedom

AIC: 31.754

Number of Fisher Scoring iterations: 5

下面做出模型的ROC曲線

anes1=glm(nomove~conc,family=binomial(link='logit'),data=anesthetic)

對模型做出預(yù)測結(jié)果

pre=predict(anes1,type='response')

將預(yù)測概率pre和實際結(jié)果放在一個數(shù)據(jù)框中

data=data.frame(prob=pre,obs=anesthetic$nomove)

將預(yù)測概率按照從低到高排序

data=data[order(data$prob),]

n=nrow(data)

tpr=fpr=rep(0,n)

根據(jù)不同的臨界值threshold來計算TPR和FPR，之后繪制成圖

for (i in 1:n){

threshold=data$prob[i]

tp=sum(data$prob>threshold&data$obs==1)

fp=sum(data$prob>threshold&data$obs==0)

tn=sum(data$prob

fn=sum(data$prob

tpr[i]=tp/(tp+fn) #真正率

fpr[i]=fp/(tn+fp) #假正率

}

plot(fpr,tpr,type='l')

abline(a=0,b=1)

R中也有專門繪制ROC曲線的包，如常見的ROCR包，它不僅可以用來畫圖，還能計算ROC曲線下面面積AUC,以評價分類器的綜合性能，該數(shù)值取0-1之間，越大越好。

library(ROCR)

pred=prediction(pre,anesthetic$nomove)

performance(pred,'auc')@y.values

perf=performance(pred,'tpr','fpr')

plot(perf)

還可以使用更加強大的pROC包，它可以方便的比較兩個分類器，并且能自動標(biāo)出最優(yōu)臨界點，圖形看起來比較漂亮：

install.packages('pROC')

library(pROC)

modelroc=roc(anesthetic$nomove,pre)

plot(modelroc,print.auc=TRUE,auc.polygon=TRUE,grid=c(0.1,0.2),grid.col=c('green','red'),max.auc.polygon=TRUE,auc.polygon.col='blue',print.thres=TRUE)

上面的方法是使用原始的0-1數(shù)據(jù)進行建模,即每一行數(shù)據(jù)均表示一個個體，另一種是使用匯總數(shù)據(jù)進行建模，先將原始數(shù)據(jù)按下面步驟進行匯總

anestot=aggregate(anesthetic[,c('move','nomove')],by=list(conc=anesthetic$conc),FUN=sum)

結(jié)果如下：

conc move nomove

1 0.8 6 1

2 1.0 4 1

3 1.2 2 4

4 1.4 2 4

5 1.6 0 4

6 2.5 0 2

anestot$conc=as.numeric(as.character(anestot$conc))

anestot$total=apply(anestot[,c('move','nomove')],1,sum)

anestot$total

[1] 7 5 6 6 4 2

anestot$prop=anestot$nomove/anestot$total

anestot$prop

[1] 0.1428571 0.2000000 0.6666667 0.6666667 1.0000000 1.0000000

對于匯總數(shù)據(jù)，有兩種方法可以得到同樣的結(jié)果，一種是將兩種結(jié)果的向量合并做為因變量，如anes2模型。另一種是將比率做為因變量，總量做為權(quán)重進行建模，如anes3模型。這兩種建模結(jié)果是一樣的。

anes2=glm(cbind(nomove,move)~conc,family=binomial(link='logit'),data=anestot)

summary(anes2)

結(jié)果顯示如下：

Call:

glm(formula = cbind(nomove, move) ~ conc, family = binomial(link = 'logit'),

data = anestot)

Deviance Residuals:

1 2 3 4 5 6

0.20147 -0.45367 0.56890 -0.70000 0.81838 0.04826

Coefficients:

Estimate Std. Error z value Pr(>z)

(Intercept) -6.469 2.419 -2.675 0.00748 **

conc 5.567 2.044 2.724 0.00645 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 15.4334 on 5 degrees of freedom

Residual deviance: 1.7321 on 4 degrees of freedom

AIC: 13.811

Number of Fisher Scoring iterations: 5

anes3=glm(prop~conc,family=binomial(link='logit'),weights=total,data=anestot)

結(jié)果和上面的一樣。

根據(jù)logistic模型，我們可以使用predict函數(shù)來預(yù)測結(jié)果，下面根據(jù)上述模型來繪圖

x=seq(from=0,to=3,length.out=30)

y=predict(anes1,data.frame(conc=x),type='response')

plot(prop~conc,pch=16,col='red',data=anestot,xlim=c(0.5,3),main='Logistic回歸曲線圖',ylab='病人靜止概率',xlab='麻醉劑量')

lines(y~x,lty=2,col='blue')

案例二：利用iris數(shù)據(jù)集，進行邏輯回歸二分類測試，該數(shù)據(jù)集是R語言自帶得數(shù)據(jù)集，包括四個屬性，和三個分類。邏輯回歸我們用glm函數(shù)實現(xiàn)，該函數(shù)提供了各種類型的回歸，如：提供正態(tài)、指數(shù)、gamma、逆高斯、Poisson、二項。我們用的logistic回歸使用的是二項分布族binomial。

index <- which(iris$species="=">

將種類為setosa的數(shù)據(jù)排除出我們需要的數(shù)據(jù)集

ir <- iris[-="">

levels(ir$Species)[1] <->

生成訓(xùn)練集

split <->

ir_train <->

生成測試集
ir_test <->

通過訓(xùn)練集建立模型
model <- glm(species="" ~.,family="">

summary(model)

模型運行結(jié)果：

Call:

glm(formula = Species ~ ., family = binomial(link = 'logit'),data = ir_train)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.339e-04 -2.100e-08 2.100e-08 2.100e-08 1.059e-04

Coefficients:

Estimate Std. Error z value Pr(>z)

(Intercept) -1502.72 363247.01 -0.004 0.997

Sepal.Length 12.45 66482.13 0.000 1.000

Sepal.Width -285.61 95437.92 -0.003 0.998

Petal.Length 154.76 115968.97 0.001 0.999

Petal.Width 869.60 204513.80 0.004 0.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 9.0949e+01 on 65 degrees of freedom

Residual deviance: 4.0575e-08 on 61 degrees of freedom

AIC: 10

Number of Fisher Scoring iterations: 25

通過anova()函數(shù) 對模型進行方差分析

anova(model, test='Chisq')

方差分析如下：

Analysis of Deviance Table

Model: binomial, link: logit

Response: Species

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 65 90.949

Sepal.Length 1 18.934 64 72.015 1.353e-05 ***

Sepal.Width 1 0.131 63 71.884 0.7176

Petal.Length 1 51.960 62 19.924 5.665e-13 ***

Petal.Width 1 19.924 61 0.000 8.058e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

下面通過McFadden R2指標(biāo)進一步對模型進行分析

install.packages('pscl')

library(pscl)

pR2(model)

llh llhNull G2 McFadden r2ML r2CU

-2.028752e-08 -4.547461e+01 9.094922e+01 1.000000e+00 7.479224e-01 1.000000e+00

為了得到分類結(jié)果，需要設(shè)定一個閾值p0——當(dāng)p大于p0時，認(rèn)為該樣本的響應(yīng)變量為1，否則為0。閾值大小對模型的預(yù)測效果有較大影響，需要進一步考慮。首先必須明確模型預(yù)測效果的評價指標(biāo)。

求解訓(xùn)練模型的最佳閥值

對模型做出預(yù)測結(jié)果

model <- glm(species="" ~.,family="">

pre1=predict(model,type='response')

將預(yù)測概率pre1和實際結(jié)果放在一個數(shù)據(jù)框中

data1=data.frame(prob=pre1,obs=ifelse(ir_train$Species=='virginica',1,0))

將預(yù)測概率按照從低到高排序

data1=data1[order(data1$prob),]

n=nrow(data1)

tpr=fpr=rep(0,n)

根據(jù)不同的臨界值threshold來計算TPR和FPR，之后繪制成圖

for (i in 1:n){

threshold=data1$prob[i]

tp=sum(data1$prob>threshold&data1$obs==1)

fp=sum(data1$prob>threshold&data1$obs==0)

tn=sum(data$prob

fn=sum(data$prob

tpr[i]=tp/(tp+fn) #真正率

fpr[i]=fp/(tn+fp) #假正率

}

plot(fpr,tpr,type='l')

abline(a=0,b=1)

下面通過pROC包自動標(biāo)出最優(yōu)臨界點(0.506)

install.packages('pROC')

library(pROC)

modelroc1=roc(ifelse(ir_train$Species=='virginica',1,0),pre1)

plot(modelroc1,print.auc=TRUE,auc.polygon=TRUE,grid=c(0.1,0.2),grid.col=c('green','red'),max.auc.polygon=TRUE,auc.polygon.col='skyblue',print.thres=TRUE)

評估模型的預(yù)測效果

predict <- predict(model,type='response' ,newdata="">

predict.results <- ifelse(="" predict=""> 0.506,'virginica','versicolor')

misClasificError <- mean(predict.results="" !="">

print(paste('Accuracy',1-misClasificError))

[1] 'Accuracy 1'

最后一步，我們將通過畫ROC曲線，并計算其AUC面積，作為評估二類分類效果的一個典型測量

install.packages('ROCR')

library(gplots)

library(ROCR)

p <- predict(model,type='response' ,newdata="">

p.results <- ifelse(="" p=""> 0.5,1,0)

pr <- prediction(p.results,="" ifelse(ir_test$species="">

prf <- performance(pr,="" measure='tpr' ,="" x.measure='fpr'>

plot(prf)

auc <- performance(pr,="" measure='auc'>

auc <>

0.9285714

auc

real <->

data.frame(real,predict)

res <- data.frame(real,predict="ifelse(predict">0.5,'virginca','versicorlor'))

查看模型效果

plot(res)

基于R的案例

以下這個例子主要參考《Introduction to statistical learning with R》，強烈推薦這本書。尤其是回歸部分，講解的很透徹。

數(shù)據(jù)以Smarket為例，該數(shù)據(jù)包含2001-2005年標(biāo)準(zhǔn)普爾500指數(shù)1250個觀測值，9個變量。9個變量分別為：年份，前5個交易日的收益率，前一個交易日和交易額（單位：billions），今天的回報率和走勢（上升或下降）。

df=read.csv('house.csv',header=TRUE)

str(df)

data.frame': 1250 obs. of 9 variables:

$ Year : num 2001 2001 2001 2001 2001 ...

$ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...

$ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...

$ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...

$ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...

$ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...

$ Volume : num 1.19 1.3 1.41 1.28 1.21 ...

$ Today : num 0.959 1.032 -0.623 0.614 0.213 ...

$ Direction: Factor w/ 2 levels 'Down','Up': 2 2 1 2 2 2 1 2 2 2 ...

#以前5個交易日的收益率和前一個工作日的交易額為自變量，今天的走勢做因變量做Logistic回歸；

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,

data=Smarket,family='binomial')

summary(glm.fit)

Call:

glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +

Volume, family = 'binomial', data = Smarket)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.446 -1.203 1.065 1.145 1.326

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.126000 0.240736 -0.523 0.601

Lag1 -0.073074 0.050167 -1.457 0.145

Lag2 -0.042301 0.050086 -0.845 0.398

Lag3 0.011085 0.049939 0.222 0.824

Lag4 0.009359 0.049974 0.187 0.851

Lag5 0.010313 0.049511 0.208 0.835

Volume 0.135441 0.158360 0.855 0.392

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1731.2 on 1249 degrees of freedom

Residual deviance: 1727.6 on 1243 degrees of freedom

AIC: 1741.6

Number of Fisher Scoring iterations: 3

有時候，預(yù)測股市就是這么不靠譜。很多自變量沒通過驗證，接下來我們基于AIC準(zhǔn)則用逐步回歸做一下變量篩選；

注：AIC一般公式為 AIC=2k-ln(L),其中k為參數(shù)的個數(shù)，L為估計模型最大化的似然函數(shù)值；

step(glm.fit)

Start: AIC=1741.58

Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume

Df Deviance AIC

- Lag4 1 1727.6 1739.6

- Lag5 1 1727.6 1739.6

- Lag3 1 1727.6 1739.6

- Lag2 1 1728.3 1740.3

- Volume 1 1728.3 1740.3

1727.6 1741.6

- Lag1 1 1729.7 1741.7

#此處略去一百字；

Direction ~ 1

Call: glm(formula = Direction ~ 1, family = 'binomial', data = Smarket)

Coefficients:

(Intercept)

0.07363

Degrees of Freedom: 1249 Total (i.e. Null); 1249 Residual

Null Deviance: 1731

Residual Deviance: 1731 AIC: 1733

這個結(jié)果足以讓你崩潰，扔硬幣都比這靠譜。原來不用任何變量預(yù)測是最靠譜的。

#模型的評估

glm.probs =predict(glm.fit,type ='response')

glm.probs[1:10]

#生成啞變量

contrasts(Smarket$Direction)

glm.pred=rep ('Down ' ,1250)

glm.pred[glm .probs >.5]=' Up'

table(glm .pred ,Direction )

mean(glm.pred== Direction )

[1] 0.5216

通過混淆矩陣計算分類的準(zhǔn)確性僅有52%，很不樂觀；

挖掘本質(zhì)上是挖掘有趣的的模式，以上數(shù)據(jù)說明我們白忙活了一場，但是我們通過實例學(xué)習(xí)了Logistic模型。

當(dāng)然我們還可以通過數(shù)據(jù)變換或構(gòu)造新的變量來提升擬合降低AIC，最終，模型講究的還是泛化能力；

有時候：擬合很豐滿，泛化很骨感。

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自：生物_醫(yī)藥_科研 > 《待分類》

舉報/認(rèn)領(lǐng)

0條評論

發(fā)表

請遵守用戶評論公約

類似文章 更多

生物_醫(yī)藥_科研

關(guān)注對話

TA的最新館藏

國產(chǎn)第2款MET抑制劑獲批，盤點30億級賽道的入局者和新機會
黑色素瘤一線治療又添新方案！FDA批準(zhǔn)首個免疫+雙靶向療法用于BRAFV600突變晚期黑色素瘤
國際視野丨膽道惡性腫瘤的靶點及耐藥機制
恒瑞創(chuàng)新藥卡瑞利珠單抗聯(lián)合阿帕替尼非小細(xì)胞肺癌研究發(fā)表于JTO
17 款四代靶向藥，全面攻堅 EGFR 耐藥
最全整理匯總：HER2陽性乳腺癌靶向治療研究進展

喜歡該文的人也喜歡更多

熱門閱讀換一換

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

轉(zhuǎn)載幾個R語言中實現(xiàn)Logistic回歸模型的案例