Google Blogoscoped

Friday, June 25, 2004

PHP5 Spider

I wrote a PHP5 spider. You can provide any URL and this web application will recursively grab the first few links of every page it finds. The script uses the PHP5 domdocument and its loadHTMLFile function along with XPath.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
        xml:lang="en" lang="en">
<head>
    <title>PHP5 Spidering by Google Blogoscoped</title>
    <link rel="stylesheet" href="default.css" type="text/css"
            media="screen, projection" />
</head>
<body>

<?

$g_spiderCount = 0;
$g_spiderMax = 1000;
$g_spiderFirstMax = 3;
$g_urlsDone = '';
$g_maxLevel = 3;

spiderUrl('http://blog.outer-court.com');

function spiderUrl($url, $level = 0)
{
    global $g_spiderCount;
    global $g_spiderMax;
    global $g_maxLevel;

    $g_spiderCount++;

    if (! ($g_spiderCount > $g_spiderMax || $level > $g_maxLevel) )
    {
        $dom = new domdocument;
        @$dom->loadHTMLFile($url);
    
        $xpath = new domxpath($dom);
    
        echo '<ul>';
        $xNodes = @$xpath->query('//body//a[@href]');
        $linkCount = 0;
        foreach ($xNodes as $xNode)
        {
            handleLink(&$xNode, $level, $url, &$linkCount);
        }
        echo '</ul>';
    }
}

?>

</body>
</html><?

function handleLink(&$xNode, $level, $url, &$linkCount)
{
    global $g_spiderFirstMax;

    $subtype = @$xNode->firstChild->nodeName;
    $sLinkcontent = '';
    if ($subtype == 'img')
    {
        $sLinkcontent = @$xNode->firstChild->getAttribute('alt');
    }
    else if ($subtype == '#text')
    {
        $sLinkcontent = @$xNode->firstChild->data;
    }

    $sLinkurl = $xNode->getAttribute('href');
    if ( $sLinkcontent != '' && $sLinkurl != ''
            && stripos($sLinkurl, 'javascript:') === false
            && stripos($sLinkurl, 'ftp://') === false
            && stripos($sLinkurl, 'mailto:') === false )
    {
        $fullhref = getAbsoluteUrl($url, $sLinkurl);

        if ( !getUrlWasSpidered($fullhref) )
        {
            if (++$linkCount <= $g_spiderFirstMax)
            {
                $linkdisplay = str_replace('http://', '', $fullhref);
                $linkdisplay = reduce($linkdisplay);

                echo '<li>';
                echo ' <a href="' . $fullhref . '">' .
                        htmlentities($sLinkcontent) . '</a><br />';
                echo '<span class="url">' . $linkdisplay .
                        '</span><br />';
                spiderUrl($fullhref, $level + 1);
                echo '</li>';
            }
        }
    }
}

function getUrlWasSpidered($fullhref)
{
    global $g_urlsDone;

    $sCheck = '[[' . $fullhref . ']]';
    $urlWasSpidered = ! (stripos($g_urlsDone, $sCheck) === false);
    if (!$urlWasSpidered)
    {
        $g_urlsDone .= $sCheck;
    }

    return $urlWasSpidered;
}

function getAbsoluteUrl($absolute, $relative)
{
    if (preg_match(',^(https?://|ftp://|mailto:|news:),i',
                $relative))
        return $relative;

    $url = parse_url($absolute);
    
    if (@$url['path']{strlen(@$url['path']) - 1} == '/')
        $dir = substr(@$url['path'], 0, strlen(@$url['path']) - 1);
    else
        $dir = dirname(@$url['path']);

    if ($relative{0} == '/') {
        $relative = substr($relative, 1);
        $dir = '';
    }
    elseif (substr($relative, 0, 2) == './')
        $relative = substr($relative, 2);
    else while (substr($relative, 0, 3) == '../') {
        $relative = substr($relative, 3);
        $dir = substr($dir, 0, strrpos($dir, '/'));
    }

    return sprintf('%s://%s%s/%s', $url['scheme'],
            $url['host'], $dir, $relative);
}

function reduce($url, $length = 50)
{
    $x = substr($url, 0, $length);
    if ( strlen($url) > $length + 7 )
    {
        $x .= ' ...';
    }
    return $x;
}

?>

Technorati Tricked

How easy is it to get into a top 100 blog list? Apparently over at Technorati, all it takes is clever linkfarming. This webmaster put up blogs with mostly no real content and heavily cross-linked them. As a result we can find two sites in the top 10 which seem like they don’t belong there: “Paris Hilton Video” on number six, and “Daily News Update” (nothing but uncommented links) on number eight.

Update: the sites in question have now been removed from the index, as Technorati developer Kevin Marks tells in the forum.

The Wisdom of Crowds

“Each link to a page counts as a vote. Google is a republic, rather than a pure democracy; sites that have more links into them are effectively given more voting power. But the principle is fundamentally democratic–let the masses decide. (...)

How does this work? What Google is relying on is something I call the wisdom of crowds: Under the right circumstances, groups are smarter, make better decisions and are better at solving problems than even the smartest people within them. (...)

You can see this phenomenon at work in examples ranging from the trivial to the genuinely weighty. Finance professor Jack L. Treynor, for instance, devised an experiment in which he asked his students to guess how many jellybeans were in a jar and found that the group’s average guess was off by just 2% even though very, very few of the students were that close. Or consider the show Who Wants to Be a Millionaire. When a contestant on the show is stumped by a question, he has a couple of choices in asking for help: the audience or someone he’s designated as an expert. The experts do a reasonable job: They get the answer right 65% of the time. But the audience is close to perfect: It gets the answer right 91% of the time, even though it’s made up of people who have nothing better to do than sit in a TV studio and watch Regis Philbin.”
– James Surowiecki, author of The Wisdom of Crowds, Mass Intelligence (Forbes), 05.24.04

[Illustration via Gimp Public Domain Photo Archive.]

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!