Google Blogoscoped

Friday, October 29, 2004

Handwriting

Yahoo Facelift

If you’ve been to Yahoo lately, you can see a redesign took place on the front-page. Still somewhat messy, but certainly a fresh look.

Screen-Scraping Meta-Search

Here is an example of a screen-scraping meta search engine written in PHP5. It uses new PHP5 features like easy HTML-to-XML conversion and Xpath. The array declared on top of the "performSearch" function is crucial to how the search will be applied: it contains the title of the search engine to be screenscraped; its URL, with placeholder [keyword]; and the Xpath intended to grab the first relevant link.

<?

header("Content-type: text/html; charset=utf-8");
$q = ( isset($_GET['q']) ) ? $_GET['q'] : '';

?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>Streaming Screen-scraped Search Results</title>
    <link rel="stylesheet" href="default.css" type="text/css" media="screen, projection, print" />
</head>
<body>

<h1>Streaming Screen-scraped Search Results</h1>

<form action="./" method="get"><div>
<input type="text" size="20" name="q" value="<?= toAttribute($q) ?>" /> <input type="submit" value="Search" />
</div></form>

<?

if ($q != '')
{
    performSearch($q);
}

?>
</body>
</html><?

function performSearch($keyword)
{
    $arrTitleUrlXpath = array(
            'Google;http://www.google.com/search?q=[keyword];//p[@class="g"]//a[@href]',
            'Yahoo;http://search.yahoo.com/search?p=[keyword];//ol[@start="1"]//a[@href]',
            'MSN;http://search.msn.com/results.aspx?q=[keyword];//li/a[@class="t"]'
            );
    
    $dom = new domdocument;
    
    foreach ($arrTitleUrlXpath as $titleUrlXpath)
    {
        list($title, $url, $sXpath) = explode(';', $titleUrlXpath);
        $url = str_replace( '[keyword]', urlencode($keyword), $url );

        echo '<div class="loading">Loading <a href="' . $url . '">' . $title . '</a> result...</div>';
        flush(); ob_flush();
    
        @$dom->loadHTMLFile($url);
        
        $xpath = new domxpath($dom);
        $xNodes = $xpath->query($sXpath);
    
        foreach ($xNodes as $xNode)
        {
            $sLinktext = @$xNode->firstChild->data;
            $sLinkurl = $xNode->getAttribute('href');
            if ($sLinktext != '' && $sLinkurl != '')
            {
                echo '<div class="result"><a href="' . $sLinkurl . '">' . $sLinktext . '</a> ';
                echo '(Top result from ' . $title . ')</div>';
                echo "rn";
                flush(); ob_flush();
            }
            break;
        }
    }
}

function toAttribute($s)
{
    $s = toXml($s);
    $s = str_replace('"', '&quot;', $s);

    return $s;
}

function toXml($s)
{
    $s = str_replace('&', '&amp;', $s);
    $s = str_replace('<', '&lt;', $s);
    $s = str_replace('>', '&gt;', $s);

    return $s;
}

?>

This is the stylesheet. To display a "loading" message for the second the result is grabbed from another site, a div-block of class "loading" is used. It will then be overlayed with a div-block of class "result" using relative positioning.

body
{
    background-color: white;
    color: black;
    margin: 20px;
    padding: 0;
    font-family: arial, helvetica, sans-serif;
}

a
{
    color: blue;
}

.loading
{
    font-style: italic;
    background-color: #eee;
    height: 50px;
    width: 700px;
    padding: 5px;
    overflow: hidden;
    margin-bottom: 0;
}

.result
{
    background-color: #fff;
    height: 50px;
    width: 700px;
    padding: 5px;
    overflow: hidden;
    position: relative;
    top: -60px;
}

You can also see above running on my site.

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!