我的一個(gè)客戶有這樣的需求:上傳文件,可以是doc,docx,xls,pdf,txt格式,現(xiàn)需要用php讀取這些文件的內(nèi)容,然后計(jì)算文件里面字?jǐn)?shù).
1.PHP讀取DOC格式的文件
PHP沒有自帶讀取word文件的類,或者是庫,這里我們使用antiword(http://www.winfield./)這個(gè)包來讀取doc文件.
首先介紹一下如何在windows下使用:
1.打開http://www.winfield./(antiword下載頁面),找到對(duì)應(yīng)的windows版本(http://www.winfield./#Windows),下載antiword windows版本(antiword-0_37-windows.zip);
2.將下載下來的文件解壓到C盤根目錄下;
這里還有一點(diǎn)需要注意的:http://www.informatik./~markus/antiword/00README.WIN這個(gè)連接里有windows下安裝的說明文件.
需要設(shè)置環(huán)境變量,我的電腦(右鍵)->高級(jí)->環(huán)境變量->在上面的用戶變量里新建一個(gè)
變量名:HOME
變量值:c:\home這個(gè)目錄應(yīng)該是存在的,如果不存在就在C盤下創(chuàng)建一個(gè)home文件夾.
然后在系統(tǒng)變量,修改Path,在Path變量的值最前面加上%HOME%\antiword.
3.開始->運(yùn)行->CMD 進(jìn)入到antiword目錄;
輸入 antiword -h 看看效果.
4.然后我們使用antiword –t 命令讀取一下doc文件內(nèi)容;首先復(fù)制一個(gè)doc文件到c:\antiword目錄,然后執(zhí)行
>antiword –t 文件名.doc
就可以看到屏幕上輸出word文件的內(nèi)容了.
可能你會(huì)問了,這和PHP讀取word有什么關(guān)系呢?呵呵,別急,我們來看看如何在PHP里使用這個(gè)命令.
<?php
$file = “D:\xampp\htdocs\word_count\uploads\doc-english.doc”;
$content = shell_exec(“c:\antiword\antiword –f $file”);
?>
這樣就把word里面的內(nèi)容讀取content里面了.
至于如何在Linux下讀取doc文件內(nèi)容,就是下載linux版本的壓縮包,里面有readme.txt文件,按照那種方式安裝就可以了.
$content = shell_exec ( "/usr/local/bin/antiword -f $file" );
2.PHP讀取PDF文件內(nèi)容
php也沒有專門用來讀取pdf內(nèi)容的類庫.這樣我們采用第三方包(xpdf).還是先做windows下的操作,下載,將其解壓到C盤根目錄下.
開始->運(yùn)行->cmd->cd /d c:\xpdf <?php
$file = “D:\xampp\htdocs\word_count\uploads\pdf-english.pdf”;
$content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );
?>
這樣就可以把pdf文件的內(nèi)容讀取到php變量里了.
Linux下的安裝方法也很簡單這里就不在一一列出
<?php
$content = shell_exec ( "/usr/bin/pdftotext $file -" );
?>
3.PHP讀取ZIP文件內(nèi)容
首先使用PHP zip解壓zip文件,然后讀取解壓包里的文件,如果是word就采用antiword讀取,如果是pdf就使用xpdf讀取.
<?php
/** * Read ZIP valid file * * @param string $file file path * @return string total valid content */ function ReadZIPFile($file = '') { $content = ""; $inValidFileName = array (); $zip = new ZipArchive ( ); if ($zip->open ( $file ) === TR ) { for($i = 0; $i < $zip->numFiles; $i ++) { $entry = $zip->getNameIndex ( $i ); if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( $entry ) ); $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); } else { $inValidFileName [$i] = $entry; } } $zip->close (); rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); /*if (file_exists ( $file )) { unlink ( $file ); }*/ return $content; } else { return ""; } }
?>
4.PHP讀取DOCX文件內(nèi)容
docx文件其實(shí)是由很多XML文件組成,其中內(nèi)容就存在于word/document.xml里面.
我們找到一個(gè)docx文件,使用zip文件打開(或者把docx后綴名改為zip,然后解壓)
在word目錄下有document.xml
docx文件的內(nèi)容就存在于document.xml里面,我們讀取這個(gè)文件就可以了.
<?php
/** * Read Docx File * * @param string $file filepath * @return string file content */ function parseWord($file) { $content = ""; $zip = new ZipArchive ( ); if ($zip->open ( $file ) === tr ) { for($i = 0; $i < $zip->numFiles; $i ++) { $entry = $zip->getNameIndex ( $i ); if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( $entry ) ); $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; $content = strip_tags ( file_get_contents ( $filepath ) ); break; } } $zip->close (); rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); return $content; } else { return ""; } }
?>
如果想要通過PHP創(chuàng)建docx文件,或者是把docx文件轉(zhuǎn)為xhtml,pdf可以使用phpdocx,(http://www./)
5.PHP讀TXT
直接使用PHP file_get_content函數(shù)就可以了.
<?php
$file = “D:\xampp\htdocs\word_count\uploads\eng.txt”;
$content = file_get_content($file);
?>
6.PHP讀EXCEL
http://phpexcel./
現(xiàn)在只是讀取文件內(nèi)容了,怎么計(jì)算單詞的個(gè)數(shù)呢?
PHP有一個(gè)自帶的函數(shù),str_word_count,這個(gè)函數(shù)可以計(jì)算出單詞的個(gè)數(shù),但是如果要計(jì)算antiword讀取出來的doc文件的單詞個(gè)數(shù)就會(huì)很大的誤差.
這里我們使用以下這個(gè)函數(shù)專門用來讀取單詞個(gè)數(shù) <?php
/** * statistic word count * * @param string $content word content of the file * @return int word count of the content */ function StatisticWordsCount($text = '') { // $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) // $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row $len = strlen ( $text ); if (0 === $len) { return 0; } $words = 1; while ( $len -- ) { if (' ' === $text [$len]) { ++ $words; } } return $words; }
?>
詳細(xì)的代碼如下:
<?php /** * check system operation win or linux * * @param string $file contain file path and file name * @return file content */ function CheckSystemOS($file = '') { $content = ""; // $type = s str ( $file, strrpos ( $file, '.' ) + 1 ); $type = pathinfo ( $file, PATHINFO_EXTENSION ); // global $UNIX_ANTIWORD_PATH, $UNIX_XPDF_PATH; if (strtoupper ( s str ( PHP_OS, 0, 3 ) ) === 'WIN') { //this is a server using windows switch (strtolower ( $type )) { case 'doc' : $content = shell_exec ( "c:\\antiword\\antiword -f $file" ); break; case 'docx' : $content = parseWord ( $file ); break; case 'pdf' : $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" ); break; case 'zip' : $content = ReadZIPFile ( $file ); break; case 'txt' : $content = file_get_contents ( $file ); break; } } else { //this is a server not using windows switch (strtolower ( $type )) { case 'doc' : $content = shell_exec ( "/usr/local/bin/antiword -f $file" ); break; case 'docx' : $content = parseWord ( $file ); break; case 'pdf' : $content = shell_exec ( "/usr/bin/pdftotext $file -" ); break; case 'zip' : $content = ReadZIPFile ( $file ); break; case 'txt' : $content = file_get_contents ( $file ); break; } } /*if (file_exists ( $file )) { @unlink ( $file ); }*/ return $content; }
/** * statistic word count * * @param string $content word content of the file * @return int word count of the content */ function StatisticWordsCount($text = '') { // $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) // $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row $len = strlen ( $text ); if (0 === $len) { return 0; } $words = 1; while ( $len -- ) { if (' ' === $text [$len]) { ++ $words; } } return $words; }
/** * Read Docx File * * @param string $file filepath * @return string file content */ function parseWord($file) { $content = ""; $zip = new ZipArchive ( ); if ($zip->open ( $file ) === tr ) { for($i = 0; $i < $zip->numFiles; $i ++) { $entry = $zip->getNameIndex ( $i ); if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( $entry ) ); $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; $content = strip_tags ( file_get_contents ( $filepath ) ); break; } } $zip->close (); rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); return $content; } else { return ""; } }
/** * Read ZIP valid file * * @param string $file file path * @return string total valid content */ function ReadZIPFile($file = '') { $content = ""; $inValidFileName = array (); $zip = new ZipArchive ( ); if ($zip->open ( $file ) === TR ) { for($i = 0; $i < $zip->numFiles; $i ++) { $entry = $zip->getNameIndex ( $i ); if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( $entry ) ); $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); } else { $inValidFileName [$i] = $entry; } } $zip->close (); rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); /*if (file_exists ( $file )) { unlink ( $file ); }*/ return $content; } else { return ""; } }
/** * remove directory * * @param string $dir path dir */ function rrmdir($dir) { if (is_dir ( $dir )) { $objects = scandir ( $dir ); foreach ( $objects as $object ) { if ($object != "." && $object != "..") { if (filetype ( $dir . "/" . $object ) == "dir") { rrmdir ( $dir . "/" . $object ); } else { unlink ( $dir . "/" . $object ); } } } reset ( $objects ); rmdir ( $dir ); } }
//調(diào)用方法
$file = “D:\xampp\htdocs\word_count\uploads\pdf-german.zip”;
$word_number = StatisticWordsCount ( CheckSystemOS ( $file) );
?>
http://www./article-15290.html
|