【原】一起學(xué)習(xí)PHP中的Tidy擴(kuò)展庫

硬核項目經(jīng)理 2021-10-28

展開全文

一起學(xué)習(xí)PHP中的Tidy擴(kuò)展庫

這個擴(kuò)展估計很多同學(xué)可能都沒聽說過，這可不是泰迪熊呀，而是一個處理 HTML 相關(guān)操作的擴(kuò)展，主要是可以用于 HTML 、 XHTML 、 XML 這類數(shù)據(jù)格式內(nèi)容的格式化及展示。

關(guān)于 Tidy 庫

Tidy 庫擴(kuò)展是隨 PHP 一起發(fā)布的，也就是說，我們可以在編譯安裝 PHP 時加上 --with-tidy 來一起安裝這個擴(kuò)展，也可以在事后通過源碼包中 ext/ 文件夾下的 tidy 目錄中的源碼來進(jìn)行安裝。同時，Tidy 擴(kuò)展還需要依賴一個 tidy 函數(shù)庫，我們需要在操作系統(tǒng)上安裝，如果是 CentOS 的話，直接 yum install libtidy-devel 就可以了。

Tidy 格式化

首先我們來看一下如何通過這個 Tidy 擴(kuò)展庫來格式化一段 HTML 代碼。

$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;

$tidy = new Tidy();
$config = [
        'indent'=>true,
        'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();

echo $tidy, PHP_EOL;
// <html xmlns="http://www./1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

我們定義的 tidy 對象，我們就獲得了格式化之后的 HTML 代碼?？雌饋硎遣皇欠浅５匾?guī)范，不管是 xmlns 還是縮進(jìn) 格式都非常標(biāo)準(zhǔn)。

parseString() 方法有兩個參數(shù)，第一個參數(shù)就是需要格式化的字符串。第二個參數(shù)是格式化的配置，這個配置接收的是一個數(shù)組，同時它內(nèi)部的內(nèi)容也必須是 Tidy 組件中所定義的那些配置信息。這些配置信息我們可以在文后的第二條鏈接中進(jìn)行查詢。這里我們只配置了兩個內(nèi)容， indent 表示是否應(yīng)用縮進(jìn)塊級，output-xhtml 表示是否輸出為 xhtml 。

cleanRepair() 方法用于對已解析的內(nèi)容執(zhí)行清除和修復(fù)的操作，其實也就是格式化的清理工作。

注意我們在測試代碼中是直接打印的 Tidy 對象，也就是說，這個對象實現(xiàn)了 __toString() ，而它真正的樣子其實是這樣的。

var_dump($tidy);
// object(tidy)#1 (2) {
//     ["errorBuffer"]=>
//     string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
//   line 1 column 70 - Warning: discarding unexpected </i>"
//     ["value"]=>
//     string(195) "<html xmlns="http://www./1999/xhtml">
//     <head>
//       <title>
//         test
//       </title>
//     </head>
//     <body>
//       <p>
//         error<br />
//         another line
//       </p>
//     </body>
//   </html>"
//   }

各種屬性信息獲取

var_dump($tidy->isXml()); // bool(false)

var_dump($tidy->isXhtml()); // bool(false)

var_dump($tidy->getStatus()); // int(1)

var_dump($tidy->getRelease());  // string(10) "2017/11/25"

var_dump($tidy->getHtmlVer()); // int(500)

我們可以通過 Tidy 對象的屬性獲取一些關(guān)于待處理文檔的信息，比如是否是 XML ，是否是 XHTML 內(nèi)容。

getStatus() 返回的是 Tidy 對象的狀態(tài)信息，當(dāng)前這個 1 表示的是有警告或輔助功能錯誤的信息，從上面打印的 Tidy 對象的內(nèi)容我們就可以看出，在這個對象的 errorBuffer 屬性中是有 warning 報警信息的。

getRelease() 返回的是當(dāng)前 Tidy 組件的版本信息，也就是你在操作系統(tǒng)上安裝的那個 tidy 組件的信息。getHtmlVer() 返回的是檢測到的 HTML 版本，這里的 500 沒有更多的說明和介紹資料，不知道這個 500 是什么意思。

除了上面的這些內(nèi)容之后，我們還可以獲得前面 $config 中的配置信息及相關(guān)的說明。

var_dump($tidy->getOpt('indent')); // int(1)

var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options. "

getOpt() 方法需要一個參數(shù)，也就是需要查詢的 config 中配置的參數(shù)的話，那么返回就都是默認(rèn)的配置值。getOptDoc() 非常貼心，它返回的是關(guān)于某個參數(shù)的說明文檔。

最后，是更加干貨的一些方法，可以直接操作節(jié)點。

echo $tidy->head(), PHP_EOL;
// <head>
//   <title>
//   test
// </title>
// </head>

$body = $tidy->body();

var_dump($body);
// object(tidyNode)#2 (9) {
//     ["value"]=>
//     string(60) "<body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>"
//     ["name"]=>
//     string(4) "body"
//     ["type"]=>
//     int(5)
//     ["line"]=>
//     int(1)
//     ["column"]=>
//     int(40)
//     ["proprietary"]=>
//     bool(false)
//     ["id"]=>
//     int(16)
//     ["attribute"]=>
//     NULL
//     ["child"]=>
//     array(1) {
//       [0]=>
//       object(tidyNode)#3 (9) {
//         ["value"]=>
//         string(37) "<p>
// ………………
// ………………

echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www./1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www./1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

相信不需要過多地解釋就能夠看出，head() 返回的就是 <head> 標(biāo)簽里面的內(nèi)容，而 body() 、html() 也都是對應(yīng)的相關(guān)標(biāo)簽，root() 返回的則是根結(jié)點的全部內(nèi)容，可以看作是整個文檔內(nèi)容。

這些方法函數(shù)返回的內(nèi)容其實都是一個 TidyNode 對象，這個我們在后面再詳細(xì)地說明。

直接轉(zhuǎn)換為字符串

上面的操作代碼我們都是基于 parseString() 這個方法。它沒有返回值，或者說返回的只是一個布爾類型的成功失敗標(biāo)識。如果我們需要獲取格式化之后的內(nèi)容，只能直接將對象當(dāng)做字符串或者使用 root() 來獲得所有的內(nèi)容。其實，還有一個方法直接就是返回一個格式化后的字符串的。

$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);

echo $repair, PHP_EOL;
// <html xmlns="http://www./1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

repairString() 方法的參數(shù)和 parseString() 是一模一樣的，唯一不同的就是它是返回的一個字符串，而不是在 Tidy 對象內(nèi)部進(jìn)行操作。

轉(zhuǎn)換錯誤信息

在最開始的測試代碼中，我們使用 var_dump() 打印 Tidy 對象時就看到了 errorBuffer 這個變量里是有錯誤信息的。這回我們再來一個有更多問題的 HTML 代碼片斷。

$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www./TR/xhtml1/DTD/xhtml1-strict.dtd">

<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();

echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element

$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!

在這段測試代碼中，我們又使用了一個新的 diagnose() 方法，它的作用是對文檔進(jìn)行診斷測試，并且在 errorBuffer 這個對象變量中添加有關(guān)文檔的更多信息。

TidyNode 操作

之前我們說到過，head()、html()、body()、root() 這幾個方法返回的都是一個 TidyNode 對象，那么這個對象有什么特殊的地方嗎？

$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
  /* JSTE code */
  alert('Hello World');
#>
</head>
<body>

<?php
  // PHP code
  echo 'hello world!';
?>

<%
  /* ASP code */
  response.write("Hello World!")
%>

<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;

$tidy = new Tidy();
$tidy->parseString($html);

$tidyNode = $tidy->html();

showNodes($tidyNode);

function showNodes($node){

    if($node->isComment()){
        echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;
    }
    if($node->isText()){
        echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isAsp()){
        echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isHtml()){
        echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isPhp()){
        echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isJste()){
        echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;
    }

    if($node->name){
        // getParent()
        if($node->getParent()){
            echo '&&&&&&&& ', $node->name ,' getParent is : ', $node->getParent()->name, PHP_EOL;
        }

        // hasSiblings
        echo '^^^^^^^^ ', $node->name, ' has siblings is : ';
        var_dump($node->hasSiblings());
        echo PHP_EOL;
    }

    if($node->hasChildren()){
        foreach($node->child as $child){
            showNodes($child);
        }
    }
}

// ………………
// ………………
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
//   /* JSTE code */
//   alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ………………
// ………………
// ++++++++
// This is Asp Script :"<%
//   /* ASP code */
//   response.write("Hello World!")
// %>" 
// ………………
// ………………

這段代碼具體的測試步驟和各個函數(shù)的解釋就不詳細(xì)地一一列舉說明了。大家通過代碼就可以看出來，我們的 TidyNode 對象可以判斷各個節(jié)點的內(nèi)容，比如是否還有子結(jié)點、是否有兄弟結(jié)點。對象結(jié)點內(nèi)容，可以判斷結(jié)點的格式，是否是注釋、是否是文本、是否是 JS 代碼、是否是 PHP 代碼、是否是 ASP 代碼之類的內(nèi)容。不知道看到這里的你是什么感覺，反正我是覺得這個玩意就非常有意思了，特別是判斷 PHP 代碼這些的方法。

信息統(tǒng)計函數(shù)

最后我們再來看一下 Tidy 擴(kuò)展庫中的一些統(tǒng)計函數(shù)。

$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);

echo 'tidy access count: ', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count: ', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count: ', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count: ', tidy_warning_count($tidy), PHP_EOL;

// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6

其實它們返回的這些數(shù)量都是一些錯誤信息的數(shù)量。tidy_access_count() 表示的是遇到的輔助功能警告數(shù)量，tidy_config_count() 是配置信息錯誤的數(shù)量，另外兩個從名字就看出來了，也就不用我多說了。

總結(jié)

總之，Tidy 擴(kuò)展庫又是一個不太常見但非常有意思的庫。對于某些場景，比如模板開發(fā)之類的功能來說還是有一些用武之地的。大家可以報著學(xué)習(xí)的心態(tài)好好再深入的了解一下，說不定它正好就能解決你現(xiàn)在最棘手的問題哦！

測試代碼：

https://github.com/zhangyue0503/dev-blog/blob/master/php/2021/01/source/8.一起學(xué)習(xí)PHP中的Tidy擴(kuò)展庫.php

參考文檔：

https://www./manual/zh/book.tidy.php

http://tidy./docs/quickref.html

贊賞

共11人贊賞

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看