利用Python編寫網(wǎng)絡(luò)爬蟲下載文章

toppoo 2014-05-11

展開全文

利用Python編寫網(wǎng)絡(luò)爬蟲下載文章

時間2014-05-01 16:10:02 CSDN博客原文 http://blog.csdn.net/acdreamers/article/details/24719937

今天來講如何利用Python爬蟲下載文章，拿韓寒的博客為例來一步一步進(jìn)行詳細(xì)探討。。。

韓寒的博客地址是：http://na.com.cn/s/articlelist_1191258123_0_1.html

可以看出左邊是文章列表，而且不止一頁，我們先從最簡單的開始，先對一篇文章進(jìn)行下載，再研究對一頁所有的文

章進(jìn)行下載，最后再研究對所有的文章下載。

第一步：對一篇文章下載

我們打開韓寒的博客，然后查看源碼，可以看出每一篇文章的列表源碼為：

<span class="atc_title"><a title="東望洋" target="_blank"

href=" http://na.com.cn/s/blog_4701280b0102eck1.html ">東望洋</a></span>

我們主要是想提取出中間綠色部分的URL，然后找到這篇文章的正文進(jìn)行分析，然后提取進(jìn)行下載。首先，假

設(shè)已經(jīng)得到這個字符串，然后研究如何提取這個URL，觀察發(fā)現(xiàn)，對于所有的這類字符串，都有一個共同點，那

就是都含有子 串'<a title='，'href='和'.html' ，那么我們可以用最笨的方式---查找子串 進(jìn)行定界。

在Python中有一個方法叫做find()，是用來查找子串的，返回子串出現(xiàn)的位置，那么，可以用如下代碼來提

取URL，并讀取文件和下載文件。

#encoding:utf-8
import urllib2

def getURL(str):
  start = str.find(r'href=')
  start += 6
  end   = str.find(r'.html')
  end   += 5
  url = str[start : end]
  return url

def getContext(url):
  text =urllib2.urlopen(url).read()
  return text

def StoreContext(url):
  content  = getContext(url)
  filename = url[-20:]
  open(filename, 'w').write(content)

if __name__ == '__main__':
  str = '<span class="atc_title"><a title="東望洋" target="_blank"    
  url = getURL(str)
  StoreContext(url)

第二，三步：下載所有的文章

在這一步中，我們要提取第一頁所有的文章的URL和標(biāo)題，不再采用上面第一步的find()函數(shù)，畢竟這個函數(shù)

缺乏靈活性，那么采用正則表達(dá)式最好。

首先采集數(shù)據(jù)，觀察發(fā)現(xiàn)，所有文章的URL都符合

<a title="..." target="_blank" href=" http://na.com.cn....html ">

這一規(guī)則，所以我們可以設(shè)置正則表達(dá)式

r'<a title=".+" target="_blank" href="( '" rel="nofollow,noindex">http://na\.com.\cn.+.\html )">'

這樣就容易了，下面是爬取韓寒所有文章，并在本地保存為.html文件。

代碼：

#coding:utf-8
import re
import urllib2

def getPageURLs(url):
  text = urllib2.urlopen(url).read()
  pattern = r'<a title=".+" target="_blank" href="(http://na\.com.\cn.+.\html)">'
  regex = re.compile(pattern)
  urlList = re.findall(regex,text)
  return urlList

def getStore(cnt,url):
  text = urllib2.urlopen(url)
  context = text.read();
  text.close()
  filename = 'HanhanArticle/'+str(cnt) + '.html'
  f = open(filename,'w')
  f.write(context)
  f.close()
  
def getAllURLs():
  urls = []
  cnt = 0
  for i in xrange(1,8):
    urls.append('http://na.com.cn/s/articlelist_1191258123_0_'+str(i)+'.html')
  for url in urls:
    tmp = getPageURLs(url)
    for i in tmp:
      cnt += 1
      getStore(cnt,i)
  
if __name__ == '__main__':
  getAllURLs()

由于我把文章的標(biāo)題用一個數(shù)字來命名，似乎不是很完美，還有兩個問題沒有解決，怎么提取文章的標(biāo)題？，這是涉

及到中文提取，怎么把文章的內(nèi)容提取出來保存為txt格式的文件？

如果能解決上面的兩個問題，那么才算是真正地用網(wǎng)絡(luò)爬蟲技術(shù)實現(xiàn)了對韓寒博客的下載。

（1）提取文章的標(biāo)題

為了方便操作，我們用BeautifulSoup來分析網(wǎng)頁，對html文本我們提取title之間的內(nèi)容為

<title> 東望洋_韓寒_新浪博客 </title>

對這個強制轉(zhuǎn)化為字符串，然后進(jìn)行切片操作，大致取string[7 : -28]，得到了文章的標(biāo)題。

from bs4 import BeautifulSoup
import re

for i in xrange(1,317):
  filename = 'HanhanArticle/' + str(i) + '.html'
  html = open(filename,'r')
  soup = BeautifulSoup(html)
  html.close()
  title = soup.find('title')
  string = str(title)
  article = string[7 : -28].decode('utf-8')
  if article[0] != '.':
    print article

但是有些標(biāo)題直接的內(nèi)容還需要處理，比如 <<ONE IS ALL>>， 本來應(yīng)該解釋為《ONE IS ALL》

還有比如 中央電視臺很*很**， 這里的**在文件中不能作為名稱字符。

#coding:utf-8
import re
import urllib2
from bs4 import BeautifulSoup

def getPageURLs(url):
  text = urllib2.urlopen(url).read()
  pattern = r'<a title=".+" target="_blank" href="(http://na\.com\.cn.+\.html)">'
  regex = re.compile(pattern)
  urlList = re.findall(regex,text)
  return urlList

def getStore(title,url):
  text = urllib2.urlopen(url)
  context = text.read();
  text.close()
  filename = 'HanhanArticle/'+ title + '.html'
  f = open(filename,'w')
  f.write(context)
  f.close()

def getTitle(url):
  html = urllib2.urlopen(url).read()
  soup = BeautifulSoup(html)
  title = soup.find('title')
  string = str(title)
  return string[7 : -28]

def Judge(title):
  lens = len(title)
  for i in xrange(0,lens):
    if title[i] == '*':
      return False
  return True
  
def getAllURLs():
  urls = []
  for i in xrange(1,8):
    urls.append('http://na.com.cn/s/articlelist_1191258123_0_'+str(i)+'.html')
  for url in urls:
    tmp = getPageURLs(url)
    for i in tmp:
      title = getTitle(i).decode('utf-8')
      print title
      if title[0] != '.' and Judge(title):
        getStore(title,i)
  
if __name__ == '__main__':
  getAllURLs()