Google Blogoscoped

Forum

Google Disrespecting Robots.txt?  (View post)

Jan Piotrowski [PersonRank 1]

Saturday, October 29, 2005
18 years ago

Hey Philipp,

ich bin da anderer Meinung und hatte dazu in meinem Blog auch schonmal was geschrieben, aber eben auf Deutsch:
http://betamode.de/2005/09/25/google-missverstaendnisse-im-index-vs-gespidert/

Ich sehe das Verhalten von Google also nicht als falsch an, wenn deine Testseite nun aber in den normalen Suchergebnissen auftauchen sollte sieht das natürlich anders aus. Davon gehe ich aber nicht aus.

Mal schaun...

Andrew Hitchcock [PersonRank 10]

18 years ago #

Hmm, I wonder. If the information was just on the page itself, and you linked to it, then it would have to follow the link to find that out. However, you list it in your robots.txt as well, so it shouldn't load it at all.

Actually, I think all search engines have a similar behavior. For a long time I blocked MSN and Yahoos spiders from my website, but it still appeared (without snippets or a title). The know about the URL, but they aren't allowed to crawl it. However, the sites use the link text and URL text to allow you to search for it.

Hmm, now that I think about this, maybe this is what is happening: You link to the page, so Google knows about it. Google runs it past the the robots.txt and finds that it can't spider it. Since it doesn't crawl it, it doesn't know that the page has a noindex attribute. Since it doesn't know about the noindex, it lists it in the database. Isn't robots.txt just for crawling/spidering pages? I guess we'll see what happens if you find it when you search for that unique string.

Andrew

Randy Charles Morin [PersonRank 1]

18 years ago #

I can tell you that Google does respect the robots.txt file and let me tell you why. You see, my server is $15/m shared hosted. It'll get swamped pretty easy when Google indexes my site. Occasionally, Google finds a new entry point into my site and begins a deep crawl, which means a denial of service attack on my end until I fix the robots.txt file, which ends the attack almost immediately.

Philipp Lenssen [PersonRank 10]

18 years ago #

Yes, I know Google doesn't crawl the site. What I meant was there seems to be no way to tell Google "don't list this URL in your search results." But that's what I want to be sure of with this test.

Enrico Altavilla [PersonRank 0]

18 years ago #

Philipp, Google *will* show your page in the search results, if you disallow it in the robots.txt file.

The only way to prevent Google for showing your page is to *remove* the "Disallow: /please-ignore-me" rule in the robots.txt file and to use the "noindex" meta tag.

Philipp Lenssen [PersonRank 10]

18 years ago #

Interesting irony; if I want my page to be removed from the Google index, I must make sure Google crawls it, because it assumes by default I want a page to be in the index (even when I excluded it via the robots.txt).

By the way, what if it’s not an HTML page, but e.g. a resumee as PDF file? Is there a HTTP header for robots information?

Phil Ringnalda [PersonRank 1]

18 years ago #

Anything that doesn't accept a header with the same content as a meta element is broken: the whole point of "meta http-equiv" is that it be equivalent to the same HTTP header. (Well, originally it was supposed to tell your server to send that header, but servers don't want to parse HTML before they send it, and clients have to anyway, so they pretend you sent a header when they see an http-equiv.)

alek [PersonRank 10]

18 years ago #

Philipp,

Perhaps I mis-understand what you said above, but I would suggest that robots.txt does not "tell" Google to NOT list you in their index – it tells the spider NOT to grab your content. I.e. if you have external links to your page, then the URL could certainly still be listed IMHO ... just no content since it wasn't spidered.

I think Enrico was saying this a different way ... and roger on the "noindex" approach. Good follow-up points/questions by you on the irony, plus yea, how the heck would you "de-index" non-HTML files?

alek

Seg [PersonRank 1]

18 years ago #

Your page will show in google in "listing mode" because Googlebot records every URL it finds in href. For exemple, try "site:www" in GG and you will see URLs linked by other sites even if this extension doesn't exist...
What you could do is have a rel="nofollow" in the source link (this should work, but I've never tried it).

Andrea Silver, andilinks [PersonRank 0]

18 years ago #

The item listed "disallow" in robots.txt will eventually disappear from the index but will remain for a time--probably long enough to do the damage you wanted to prevent.

The meta tags intended for googlebot will work but my experience has been that Yahoo slurp will obey this tag too, which may or may not be what you want...

Seg [PersonRank 1]

18 years ago #

Done : http://www.google.com/search?q=site%3Ablog.outer-court.com%2Fplease-ignore-me%2F

aaron wall [PersonRank 1]

18 years ago #

Google has over 20,000! pages indexed from my site.

the problem is that I only have around 1,000 and they evil-like throw my comments redirects in the search results.

A situation where Google is perhaps encouraging cloaking.

Seg [PersonRank 1]

18 years ago #

Now it's out! http://www.google.com/search?q=site%3Ablog.outer-court.com%2Fplease-ignore-me%2F

Philipp Lenssen [PersonRank 10]

18 years ago #

Hmmm....

Se james [PersonRank 1]

18 years ago #

The issue about Google and robot.txt sounds scary. It does a good job but could at the same time create fresh problems when indexing a site (using the robot file).

www.seferm.com

Se james [PersonRank 1]

17 years ago #

Is any one out there who has experience in creating a robots file and how to exclude specific folders or dirctory?

Many thanks

www.Seferm.com

Philipp Lenssen [PersonRank 10]

17 years ago #

Se James, try googling for [robots.txt tutorial] and check the top 10 results.

Ionut Alex. Chitu [PersonRank 10]

17 years ago #

http://www.robotstxt.org/wc/robots.html

kc [PersonRank 0]

17 years ago #

google does NOT obey robots.txt as it was intended. I just ran a test. Disallowed a file in robots.txt. If they grab the file, they get automatically banned. Lo and behold... google just got automatically banned because they grabbed the disallowed file.

NateDawg [PersonRank 10]

17 years ago #

Google Webmaster Central Blog has responded to Robots.txt complaints and more specifically the GoogleBot behavior.
The one thing that I picked out that seems pretty important is the way that the GoogleBot handles specific address to itself.
<quote>
For instance, for this robots.txt file:

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/

Googlebot will crawl everything in the site other than pages in the cgi-bin directory.

For this robots.txt file:

User-agent: *
Disallow: /

Googlebot won't crawl any pages of the site.
</quote>

Link: http://googlewebmastercentral.blogspot.com/2006/08/all-about-googlebot.html

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!