奇技淫巧 - 阅读主题
<<  <  1  >  >>

继续奋战RSS垃圾代码

好(0) 差(0) 阅读(651) 评论(0)
poster face
发表次数:1396
等级:◆◆◆◆◇◇
经验:11,233
魅力:1,373,865
给 Wen 发消息 给 Wen 发email


才写了篇《决战RSS垃圾代码》,新订一个feed,又是源自MSN Space,又多了一堆新垃圾,又要开工奋战我的微型Tidy了。

经过半小时的浴血奋战,消灭新的垃圾代码。新版的微型Tidy如下:

function clean_summary($strSummary) {
    $arrMatchings = array(
        array("<", ">")
        // 因为有“&”实体检测,不需要检测截断的实体了
        //array("&", ";")
    );
    foreach ($arrMatchings as $arrMatching) {
        $intL = strrpos($strSummary, $arrMatching[0]);
        if ($intL !== false) {
            $intR = strrpos($strSummary, $arrMatching[1]);
            if ($intL > $intR) $strSummary = substr($strSummary, 0, $intL);
        }
    }
    $strSummary = str_replace(array(
        "<br>", "<br/>", "</font>", "</span>"
    ), array(
        "<br />", "<br />", "", ""
    ), $strSummary);
    $arrPatterns = array(
        "#<script[.\s]*</script>#is",
        "#<script.*/>#is",
        "#<iframe[.\s]*</iframe>#is",
        "#<iframe.*/>#is",
        '#<img[^>]*dynsrc[^>]*>#is', // 拿img标签来播音乐,真是开玩笑,删掉
        "#<img([^>]*)([^/])>#is",
        '#(height|width)="?(\d+)(px)?"?#is',
        "#(id|border)=[^\s/]*#is",
        '#style="[^"]*"#is',
        "#<font[^>]*>#is",
        "#<span[^>]*>#is",
        "#<div[^>]*></div>#is",
        "#<p[^>]*></p>#is",
        '#<img[^>]*height="0"[^>]*width="0"[^>]*>#is'
    );
    $arrReplacements = array(
        "", "", "", "", "",
        "<img\1\2 />",
        '\1="\2"',
        "", "", "", "", "", "", ""
    );
    $strSummary = preg_replace($arrPatterns, $arrReplacements, $strSummary);
    $intAnd = -1;
    while (($intAnd = strpos($strSummary, "&", $intAnd + 1)) !== false) {
        if (($intSemicolon = strpos($strSummary, ";", $intAnd)) !== false) {
            if ($intSemicolon - $intAnd > 6) {
                $strSummary = substr_replace($strSummary, "&amp;", $intAnd, 1);
                continue;
            }
            $strEntity = substr($strSummary, $intAnd, $intSemicolon - $intAnd + 1);
            $arrEntities = array(
                "&nbsp;", "&amp;", "&lt;", "&gt;", "&quot;"
            );
            $boolNotEscaped = true;
            foreach ($arrEntities as $e) {
                if ($strEntity == $e) {
                    $boolNotEscaped = false;
                    break;
                }
            }
            if (!$boolNotEscaped) continue;
            if (!preg_match("|^&#\d+;|", $strEntity)) {
                $strSummary = substr_replace($strSummary, "&amp;", $intAnd, 1);
            }
        } else $strSummary = substr_replace($strSummary, "&amp;", $intAnd, 1);
    }
    // 检测img标签没有alt属性的,添加上
    $intStart = -1;
    while (($intStart = strpos($strSummary, "<img", $intStart + 1)) !== false) {
        $intEnd = strpos($strSummary, ">", $intStart);
        $strTag = substr($strSummary, $intStart, $intEnd - $intStart + 1);
        if (false !== strpos($strTag, "alt=")) {
            $intStart = $intEnd;
            continue;
        }
        $strSummary = substr_replace($strSummary, '<img alt=""', $intStart, 4);
    }
    return $strSummary;
}
Share/Save/Bookmark
最后修改:Wen 于 2005-08-20 10:08:09

发表于 2005-08-18 16:01:22
奇技淫巧 - 阅读主题
<<  <  1  >  >>
发表评论
评论将以 过客 的名义发表
你的名字/昵称:
为减少垃圾评论,请准确回答问题:

注意:不超过 65535 字节,不支持HTML,支持NN Code

| | | 注册 | 忘记密码
分类浏览: 足迹 | 美食 | 开心 | 奇技淫巧 | 科学 | 音视 | 琐事 | WENSH事务 | 过客留言 | 前梦想国度 |
English categories: Footprints | Recipes | Fun | Tech | Science | Entertainment | Life | WENSH Affairs | Guests' Msgs |
| 链接 | 服务 | 留言 | 关于 NetNest |
Valid XHTML 1.0 | Valid CSS2 | WAI-A WCAG 1.0
W3Csites.com Listed | Directory of Travel Blogs | Travel blogs | Travel | Top Blogs
Copyright 2005-2008 WEN'S Horizon [32/0.031]
Powered by NetNest 2.1.0.081019 © 2004-2008 NetNest Group