R語言中的情感分析與機器學(xué)習(xí)

icecity1306 2017-08-18

展開全文

利用機器學(xué)習(xí)可以很方便的做情感分析。本篇文章將介紹在R語言中如何利用機器學(xué)習(xí)方法來做情感分析。在R語言中，由Timothy P.Jurka開發(fā)的情感分析以及更一般的文本挖掘包已經(jīng)得到了很好的發(fā)展。你可以查看下sentiment包以及夢幻般的RTextTools包。實際上，Timothy還寫了一個針對低內(nèi)存下多元Logistic回歸（也稱最大熵）的R包maxtent。

然而，RTextTools包中不包含樸素貝葉斯方法。e1071包可以很好的執(zhí)行樸素貝葉斯方法。e1071是TU Wien(維也納科技大學(xué))統(tǒng)計系的一門課程。這個包的主要開發(fā)者是David Meyer。

我們?nèi)匀挥斜匾私馕谋痉治龇矫娴闹R。用R語言來處理文本分析已經(jīng)是公認(rèn)的事實（詳見“R語言中的自然語言處理”）。tm包算是其中成功的一部分：它是R語言在文本挖掘應(yīng)用中的一個框架。它在文本清洗（詞干提取，刪除停用詞等）以及將文本轉(zhuǎn)換為詞條-文檔矩陣(dtm)方面做得很好。http://www./v25/i05/paper 是對它的一個介紹。文本分析最重要的部分就是得到每個文檔的特征向量，其中詞語特征最重要的。當(dāng)然，你也可以將Unigram擴展為Bi-gram，Tri-gram，N-gram等。在本篇文章，我們以單個詞語特征為例做演示。

注意，在R中用ngram包來處理N-gram。在過去，Rweka包提供了函數(shù)來處理它，感興趣的可以查看這個案例：http:///questions/8161167/what-algorithm-i-need-to-find-n-grams?，F(xiàn)在，你可以設(shè)置RTextTools包中create_matrix函數(shù)的參數(shù)ngramLength來實現(xiàn)它。

使用R語言來訓(xùn)練樸素貝葉斯模型

讀取數(shù)據(jù)：

library(RTextTools)
library(e1071)

pos_tweets =  rbind(
  c('I love this car', 'positive'),
  c('This view is amazing', 'positive'),
  c('I feel great this morning', 'positive'),
  c('I am so excited about the concert', 'positive'),
  c('He is my best friend', 'positive')
)

neg_tweets = rbind(
  c('I do not like this car', 'negative'),
  c('This view is horrible', 'negative'),
  c('I feel tired this morning', 'negative'),
  c('I am not looking forward to the concert', 'negative'),
  c('He is my enemy', 'negative')
)

test_tweets = rbind(
  c('feel happy this morning', 'positive'),
  c('larry friend', 'positive'),
  c('not like that man', 'negative'),
  c('house not great', 'negative'),
  c('your song annoying', 'negative')
)

tweets = rbind(pos_tweets, neg_tweets, test_tweets)

創(chuàng)建詞條-文檔矩陣：

# build dtm
matrix= create_matrix(tweets[,1], language="english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=FALSE)

現(xiàn)在，我們可以用這個數(shù)據(jù)集來訓(xùn)練樸素貝葉斯模型。注意，e1071要求響應(yīng)變量是數(shù)值型或因子型的。我們用下面的方法將字符串型數(shù)據(jù)轉(zhuǎn)換成因子型：

# train the model
mat = as.matrix(matrix)
classifier = naiveBayes(mat[1:10,], as.factor(tweets[1:10,2]) )

測試結(jié)果準(zhǔn)確度：

# test the validity
predicted = predict(classifier, mat[11:15,]); predicted
table(tweets[11:15, 2], predicted)
recall_accuracy(tweets[11:15, 2], predicted)

顯然，這個結(jié)果跟python得到的結(jié)果是相同的（http://chengjun./en/2012/03/sentiment-analysi-with-python/這篇文章是用python得到的結(jié)果）。

其它機器學(xué)習(xí)方法怎樣呢？

下面我們使用RTextTools包來處理它。

首先，指定相應(yīng)的數(shù)據(jù)：

# build the data to specify response variable, training set, testing set.
container = create_container(matrix, as.numeric(as.factor(tweets[,2])),
                             trainSize=1:10, testSize=11:15,virgin=FALSE)

其次，用多種機器學(xué)習(xí)算法訓(xùn)練模型：

models = train_models(container, algorithms=c("MAXENT" , "SVM", "RF", "BAGGING", "TREE"))

現(xiàn)在，我們可以使用訓(xùn)練過的模型做測試集分類：

results = classify_models(container, models)

準(zhǔn)確性如何呢？

# accuracy table
table(as.numeric(as.factor(tweets[11:15, 2])), results[,"FORESTS_LABEL"])
table(as.numeric(as.factor(tweets[11:15, 2])), results[,"MAXENTROPY_LABEL"])

# recall accuracy
recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"FORESTS_LABEL"])
recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"MAXENTROPY_LABEL"])
recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"TREE_LABEL"])
recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"BAGGING_LABEL"])
recall_accuracy(as.numeric(as.factor(tweets[11:15, 2])), results[,"SVM_LABEL"])

得到模型的結(jié)果摘要（特別是結(jié)果的有效性）：

# model summary
analytics = create_analytics(container, results)
summary(analytics)
head(analytics@document_summary)
analytics@ensemble_summar

結(jié)果的交叉驗證：

N=4
set.seed(2014)
cross_validate(container,N,"MAXENT")
cross_validate(container,N,"TREE")
cross_validate(container,N,"SVM")
cross_validate(container,N,"RF")

結(jié)果可在我的Rpub頁面找到?？梢钥吹剑琺axent的準(zhǔn)確性跟樸素貝葉斯是一樣的，其它方法的結(jié)果準(zhǔn)確性更差。這是可以理解的，因為我們給的是一個非常小的數(shù)據(jù)集。擴大訓(xùn)練集后，利用更復(fù)雜的方法我們對推文做的情感分析可以得到一個更好的結(jié)果。示例演示如下：

推文情感分析

數(shù)據(jù)來自victorneo。victorneo展示的是用python對推文做情感分析。這里，我們用R來處理它：

讀取數(shù)據(jù)：

###################
"load data"
###################
setwd("D:/Twitter-Sentimental-Analysis-master/")
happy = readLines("./happy.txt")
sad = readLines("./sad.txt")
happy_test = readLines("./happy_test.txt")
sad_test = readLines("./sad_test.txt")

tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ), 
              rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ), 
                   rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))

library(RTextTools)

首先，嘗試下樸素貝葉斯：

# naive bayes
mat= create_matrix(tweet_all, language="english", 
                   removeStopwords=FALSE, removeNumbers=TRUE, 
                   stemWords=FALSE, tm::weightTfIdf)

mat = as.matrix(mat)

classifier = naiveBayes(mat[1:160,], as.factor(sentiment_all[1:160]))
predicted = predict(classifier, mat[161:180,]); predicted

table(sentiment_test, predicted)
recall_accuracy(sentiment_test, predicted)

然后，嘗試其他方法：

# the other methods
mat= create_matrix(tweet_all, language="english", 
                   removeStopwords=FALSE, removeNumbers=TRUE, 
                   stemWords=FALSE, tm::weightTfIdf)

container = create_container(mat, as.numeric(sentiment_all),
                             trainSize=1:160, testSize=161:180,virgin=FALSE) #可以設(shè)置removeSparseTerms

models = train_models(container, algorithms=c("MAXENT",
                                              "SVM",
                                              #"GLMNET", "BOOSTING", 
                                              "SLDA","BAGGING", 
                                              "RF", # "NNET", 
                                              "TREE" 
))

# test the model
results = classify_models(container, models)
table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])
recall_accuracy(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])

這里，我們也希望得到正式的測試結(jié)果。包括：

analytics@algorithm_summary:包括精確度，召回率，準(zhǔn)確率，F(xiàn)-scores的摘要
analytics@label_summary:類標(biāo)簽摘要
analytics@document_summary:所有數(shù)據(jù)和得分的原摘要
analytics@ensemble_summary:所有精確度/覆蓋度比值的摘要

現(xiàn)在讓我們看看結(jié)果：

# formal tests
analytics = create_analytics(container, results)
summary(analytics)

head(analytics@algorithm_summary)
head(analytics@label_summary)
head(analytics@document_summary)
analytics@ensemble_summary # Ensemble Agreement

# Cross Validation
N=3
cross_SVM = cross_validate(container,N,"SVM")
cross_GLMNET = cross_validate(container,N,"GLMNET")
cross_MAXENT = cross_validate(container,N,"MAXENT")

與樸素貝葉斯方法相比，其它算法的結(jié)果更好，召回精度高于0.95。結(jié)果可在Rpub查看

注：對上述得到的四個測試結(jié)果所代表的意義可以參考這篇文章R之文本分類。

原文轉(zhuǎn)載自：雪晴數(shù)據(jù)網(wǎng) http://www./cms/article/107