Google Blogoscoped

Tuesday, November 28, 2006

More Custom Search Engine Tips & Tricks

John Biundo and Eric Enge of Stone Temple Consulting both worked in the tech industry for over two decades. Eric’s also the creator of the Custom Search Guide.

In Creating Advanced Custom Search Engines, we showed you how to use features like context and annotation files to unlock advanced capabilities of Google Custom Search Engines (CSEs).

In this post, we’ll explore some additional advanced CSE tricks you can do now that you’re familiar with the basic techniques of editing context and annotation files. The code associated with each of the examples is presented at the bottom of the post.

Labeling the World Wide Web

A surprisingly little-known fact is that Google, and some trusted partners, have quietly annotated a large number of web sites with standard labels. The fruits of this labor are quite visible in Google if you search on medical terms like “prostate cancer” or “bird flu”.

The results that you see, with the “refinement interface” prominently displayed above the list of search results, is another aspect of Google Topics, the platform that underlies Google Custom Search Engines. The same refinement UI you see here is available to authors of CSEs.

See Custom Search Engine Topic Refinement for a detailed tutorial on how to take advantage of the powerful topic refinement user interface provided by the Google Topics platform in your own CSEs.

The labels used in Google’s main search have an additional special property. Unlike the labels you create in your own CSE, the Google labels are public, which means they’re available to be used by anyone’s Custom Search Engine. This turns out to be a very powerful concept. Let’s look at a couple of examples of how you can exploit this power:

In addition to the medical domain, Google (and an impressive list of partners) have annotated many other “Topics”, including: destinations, autos, computers and video games, and other areas. It’s reasonable to expect that this annotation will continue – in fact, Google has a program that encourages users to help with the massive task of annotating the web.

More Fun with Standard Labels

In addition to these domain-specific labels, Google has labeled thousands of sites with labels like “blogs”, “faq”, “forums”, “news”, “reviews”, and “stores”. With these labels, and a little imagination, you can do all kinds of clever things, like:

 

Sample Code

Below are the relevant context files for each of the examples described above, along with some discussion.

Example 1: How to boost all the sites that have been labeled ’medical_authorities’ or ’for_health_professionals’ by the Google Health Topic community in your results.

Sample context file for example 1:

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false"
keywords="medical" Title="med" Description="A Medical Search Engine" language="en">
   <Context>
      <BackgroundLabels>
         <Label name="medical_authorities" mode="BOOST" weight="0.69999999" />
         <Label name="for_health_professionals" mode="BOOST" weight="0.69999999" />
      </BackgroundLabels>
   </Context>
</CustomSearchEngine>

Adding the two labels to the BackgroundLabels section of your context file will do the trick. Be careful about using any “FILTER" type labels in your context file, as these may work to eliminate the Google labeled sites.

Let’s explore this a bit further. Let’s say you have actually annotated a large number of medical sites yourself, which you wish to define as the “universe” of results for your CSE. Within that universe, you wish to promote the Google labeled “medical_authorities” and “for_health_professional” sites. In this case, you will use a “FILTER" type label to limit the CSE results to only those sites you’ve labeled. Within that set, the Google labeled sites will be boosted. In this scenario, sites that you have not specifically annotated will not be included, even if they have been labeled by Google with one of the standard labels.

One final note: the boost weight selected is 0.69999999. The reason for this is somewhat arbitrary. This is the weight Google uses when you generate a default CSE with the CSE “wizard”, and choose to search the whole web, but boost your selected set of sites. There is very little documentation on how to choose the right weight value, and since this seems to be the default number that Google has settled upon, it’s what I chose.

Example 2: How to use standard labels to create the same kind of refinement interface that Google uses in its main search for medical terms.

Sample context file for example 2:

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false" keywords="medical" Title="med" Description="A Medical Search Engine" language="en">
  <Context>
    <Title>Health Topic</Title>
      <Facet>
          <Title>Condition info</Title>
          <FacetItem>
            <Label name="condition_overview" mode="FILTER" />
            <Title>Overview</Title>
          </FacetItem>
          <FacetItem>
            <Label name="condition_symptoms" mode="FILTER" />
            <Title>Symptoms</Title>
          </FacetItem>
          <FacetItem>
            <Label name="tests_diagnosis" mode="FILTER" />
            <Title>Tests/diagnosis</Title>
          </FacetItem>
          <FacetItem>
            <Label name="condition_treatment" mode="FILTER" />
            <Title>Treatment</Title>
          </FacetItem>
          <FacetItem>
            <Label name="causes_risk_factors" mode="FILTER" />
            <Title>Causes/risk factors</Title>
          </FacetItem>
      </Facet>
      <Facet>
        <Title>Drug info</Title>
          <FacetItem>
            <Label name="drug_uses" mode="FILTER" />
            <Title>Drug uses</Title>
          </FacetItem>
          <FacetItem>
            <Label name="drug_side_effects" mode="FILTER" />
            <Title>Side effects</Title>
          </FacetItem>
          <FacetItem>
            <Label name="drug_warning_recalls" mode="FILTER" />
            <Title>Warnings/recalls</Title>
          </FacetItem>
      </Facet>
      <Facet>
        <Title>For doctors</Title>
          <FacetItem>
            <Label name="research_overview" mode="FILTER" />
            <Title>Research overview</Title>
          </FacetItem>
          <FacetItem>
            <Label name="practice_guidelines" mode="FILTER" />
            <Title>Practice guidelines</Title>
          </FacetItem>
          <FacetItem>
            <Label name="patient_handouts" mode="FILTER" />
            <Title>Patient handouts</Title>
          </FacetItem>
          <FacetItem>
            <Label name="health_continuing_education" mode="FILTER" />
            <Title>Continuing education</Title>
          </FacetItem>
          <FacetItem>
            <Label name="clinical_trials" mode="FILTER" />
            <Title>Clinical trials</Title>
          </FacetItem>
      </Facet>
      <Facet>
        <Title>Info type</Title>
          <FacetItem>
            <Label name="medical_authorities" mode="FILTER" />
            <Title>From medical authorities</Title>
          </FacetItem>
          <FacetItem>
            <Label name="alternative_medicine" mode="FILTER" />
            <Title>Alternative medicine</Title>
          </FacetItem>
          <FacetItem>
            <Label name="for_health_professionals" mode="FILTER" />
            <Title>For health professionals</Title>
          </FacetItem>
          <FacetItem>
            <Label name="for_patients" mode="FILTER" />
            <Title>For patients</Title>
          </FacetItem>
          <FacetItem>
            <Label name="health_support_groups" mode="FILTER" />
            <Title>Support groups</Title>
          </FacetItem>
      </Facet>
  </Context>
</CustomSearchEngine>

This context file defines a sophisticated topic refinement hierarchy, very much like the one used by Google for health-related searches. We haven’t mentioned any BackgroundLabels in this file, so by default, this CSE will search the whole web. Adding a combination of “FILTER”, “BOOST”, and “ELIMINATE” labels can allow you to very specifically define the scope and rankings used in your CSE, in addition to providing the advanced topic refinement user interface.

Example 3: How to use standard labels to create a blog search engine.

Sample context file for example 3:

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false" keywords="blog" Title="bl2" Description="only blogs" language="en">
  <Context>
    <BackgroundLabels>
      <Label name="blogs" mode="FILTER" />
    </BackgroundLabels>
  </Context>
</CustomSearchEngine>

This one is pretty straightforward. It simply performs a filter on the standard label “blogs” – causing only those sites to be included in search results.

Example 4: How to use standard labels to create a domain-specific blog search engine.

Sample context file for example 4:

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false" keywords="blog" Title="bl2" Description="only blogs" language="en">
  <Context>
    <BackgroundLabels>
      <Label name="blogs" mode="FILTER" />
      <Label name="seo_site" mode="FILTER" />
    </BackgroundLabels>
  </Context>
</CustomSearchEngine>

The only difference between example 3 and example 4 is the addition of a second “FILTER" type label. The interaction between multiple “FILTER” labels can be thought of as “intersection”, or “AND”. Only URLs that meet all of the “FILTER” conditions (labeled “blogs” by Google, and labeled “seo_site” by the CSE owner), will be included in the results.

Note that this context file implies the existence of an annotation file where URLs/URL patterns have been associated with the label “seo_site”.

Example 5: How to use standard labels to screen shopping/sales sites from your results.

Sample context file for example 5:

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false" keywords="health" Title="vitamins and health food" Description="A non-commercial search engine for vitamins and health food" language="en">
  <Context>
    <BackgroundLabels>
      <Label name="stores" mode="ELIMINATE" />
    </BackgroundLabels>
  </Context>
</CustomSearchEngine>

This example proved to be an interesting exercise. If you create a CSE using this context file, and search on “vitamins”, you’ll see that it does indeed screen out some of the commercial sites that sell vitamins (compared to the same search on www.google.com). I was a bit skeptical, so I compared a search on “health food”, and found that a good number of commercial sites did, in fact, pass the screen and make it into my search results. This lead me to wondering what kind of sites Google has labeled as “stores”, and how extensive this labeling is.

To test this question, I created another, very similar, and quite unusual, CSE. Take a look at the following context file.

<?xml version="1.0" encoding="UTF-8" ?>
<CustomSearchEngine version="1.0" volunteers="false" keywords="health" Title="vitamins and health food" Description="A non-commercial search engine for vitamins and health food" language="en">
  <Context>
    <Facet>
      <FacetItem>
        <Label name="stores" mode="FILTER" ignorebackgroundlabels="true"/>
          <Title>stores</Title>
        </FacetItem>
      </Facet>
    <BackgroundLabels>
      <Label name="stores" mode="ELIMINATE" />
    </BackgroundLabels>
  </Context>
</CustomSearchEngine>

This context file does two, somewhat contradictory things. It creates a refinement interface that allows me to see all search results, for whatever term I search on (try “health food”), that have been labeled “stores”. It does this by specifying “FILTER” on the refinement label (the Label tag within the FacetItem element). Recall, however, that we also have a background label in place that eliminates all URLs that are labeled “stores”. Since a background label is, by default, in effect for all queries, our refinement would produce no results! Not terribly interesting. However, we can temporarily disable the effect of background labels with the very convenient “ignorebackgroundlabels” attribute (we set it to “true” to disable background labels). Doing this turns off all background labels while we are using refinements (i.e., when we click on the “stores” label in the refinement user interface).

All the above sounds more complicated than it really is. The easiest way to make sense of it is to simply create a test CSE, cut and paste the above code into a text file, and upload it (as the context file) on the advanced page of the control panel. Then try a few queries on the preview page.

After playing with this test CSE for a bit, it seems clear that Google has made a good start at labeling commercial sites, and that this could be very valuable for CSE builders. But it definitely needs more work. The good news is that you can create your own annotation file, using the “stores” label, and it is merged seamlessly with the standard set of Google annotations. So if you want to use this technique, you can simply set up a context file like the one shown in the example, then keep track of commercial sites that show up in your results, and annotate them with the “stores” label, and voila, they’ll disappear like magic.

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!