Google Blogoscoped

Sunday, February 6, 2005

10 URL Traps

There are many different things which can go wrong with URLs. I will list some of the biggest issues here.

1. Fictitious URLs

Sometimes there’s a need to make up a URL, for example in a novel. Even when you are careful enough to see if it exists, and it doesn’t at the time you check, once the book goes to print there is a chance someone grabs this URL hoping for traffic. And the URL may not at all be what one could expect, and show unsettling content.

A good rule of thumb: even when you invent fictitious URLs, make sure they are either yours, or are using the official “www.example.com” scheme of example URLs. (Example.com, Example.net and others are set up specifically so you can create sample links in tutorials and the like.)

2. Permalinks using the post title

Permalinks are permanent links used in blogs and other news sites. Their purpose is to allow other sites to create links which will work now and in the future. Some blogging tools however use the title of a post as permalink. But what if you want to change the post title, e.g. because you made a spelling error, or because further fact-checking revealed a critical error? Then, by changing the title, you’d break the permalink. Even if the blogging software would be set up to maintain the original “titled” URL, and only change the title itself, the result would be sub-optimal; because the URL would then still contain the error.

A good solution here is to not use “titled” URLs (and as opposed to common belief, it won’t hurt your search engine ranking in any crucial way either). As Tim Berners-Lee, inventor of the web, said: Cool URLs Don’t Change. (Actually, he refers to “URIs” – Universal Resource Identifier instead of Uniform Resource Locator – but that’s another story, and you will find more details in Tim Berners-Lee’s web-biography “Weaving the Web.”)

3. Tiny URLs

Some services allow users to set up shorter versions of long URLs. This may be so you can send out an email with the URL without risking it to be too long to memorize, and without risking it to cause a line-break. The URL may turn from “www.example.com/really/long/url/with/many/words.html” into “www.example.com/23782f/”.

“Tiny URLs” like these (for example those created using TinyURL.com) however have an inherent problem: you are using an additional service layer in-between your site and another. A direct link is less likely to break than a “tiny URL” because there are less links in the chain. Just imagine the “tiny URL” service you are using goes bankrupt and shuts down its servers 5 years from now; when you included the URL in a book or similar offline publication, you won’t even have the chance to automatically revise all instances. The reader just won’t be able to reach the other site anymore. Even worse, the “tiny URL” service could change its policies and start to display advertisement pop-ups or similar annoyances.

The solution here is to either live with very long URLs, or if the URL is under your control, to make sure you are using shorter URLs in the first place. Good style in emails is to include brackets around the URL so it wraps correctly, like this: <http://www.example.com>.

4. Not owning a URL

When you set up any site of importance to you, make sure you own the domain. Mostly, that means saying good-bye to a free solution a la Geocities, Tripod, or Blogspot. But why would you want to own the URL? Because otherwise, you don’t have full control over the future of your site. The site service may start changing its policy and delete your files; the site may start running pop-ups; the site may start charging a fee; the site may include ads right in your content, and often break your carefully crafted HTML along the way. (Some of these examples happened to my own sites in the past, at a time when I didn’t have my own domain.)

A very drastic problem of this sort happened to a public question & answer service some time ago: many researchers used a particular file upload service to post sample files (Excel data or similar helping bits) for display to the customers they answered. Then, this file server changed into an adult site. Now the answers site, for a brief amount of time which could only be fixed by adjusting the complete archive, was linking to adult content. Posts by respected expert researchers would now look like nasty pranks to new visitors. From this day on researchers were asked to not use any 3rd party upload file storage servers anymore. Again: always own the URL you put your content on.

When you buy your own domain and provider, I would also suggest to not share it with colleagues, to not make any deal with your boss, and to not host too much of your friend’s or family stuff on it. Because colleagues may leave, and the deal with your boss may pass once you switch a job. Your friends on the other hand may simply not understand what kind of files they are allowed to put on the server, which poses a security risk.

5. Inline anchors

When you have a blog, you can either put every blog post on its own page, or use inline links to individual blog posts. The first kind would be “www.example.com/post/100.html” and the second “www.example.com/post/100.html#54”. There is a particular problem with having multiple posts on one page, and that is visitors who arrive from search engines will be forced to scroll to find their relevant article they’d been searching for. A bonus, on the other hand, is that your blogging system doesn’t create very long pages for one-paragraph posts; this would also split up your PageRank, and thus decrease the value of each single post. More posts on a single page on the other hand would focus the PageRank (because more links are pointing to it).

In my own blog, I lately started to use a mixed style for posts. Some get their own page, but not all of them. I find this to be a good compromise and the best of both link worlds.

6. Framed URLs

Microsoft offers thousands of articles to developers in their MSDN network. There’s a huge problem, however: each article assumes a long navigation to the left side to put the article in context. The navigation is using an extra frame. Now when you arrive from a search engine, you will briefly see the content only, to be then redirected to the framed URL with the navigation frame. This of course is very annoying, both for the delay caused, and the strange assumption that you would care to understand the place of this content within the vast hierarchy of the MSDN site.

There’s another problem with framed URLs, which is that when you bookmark them the browser may only bookmark the actual top frame, and not the sub-frames.

The simple (and old) lesson here is: don’t use frames, and let each URL be just itself.

7. Commas in URLs

You should avoid certain “special” characters in URLs. In particular I’d advise against using a comma, because many software packages which create auto links (for e.g. newsgroup posts) will think the comma is the end of the URL, and stop the link at that point and break it.

8. Session IDs in URLs

Some services expose a session ID right in the URL. The URL will then look something like this: “www.example.com/shop/?id=239038915” The problem here is that this is a very temporary URL. Still, it may be indexed by Google (every URL you visit using the Google Toolbar, or a PageRank extension in Firefox, can be found by Google). Also, the user will now be confused as to whether he can pass on this link, or bookmark it. Amazon.com is a good example of confusing users with overly long, cryptic, and “sessionified” URLs.

The solution to this problem: avoid session IDs in URLs and use HTTP header cookies instead.

9. Exposing the server-side script extension

This is a minor problem, but avoid it if you can: using the server-side script extension, like “www.example.com/shop/index.php”, in your URLs. Because you may change the web application language, but that would also mean breaking the URL (or writing extensive redirects for old URLs). It’s better style and somewhat nicer to read when you use URLs like “www.example.com/shop/index.html” and then use an “htaccess” file on Apache servers to rewrite it so your actual script is accessed.

10. Too many parameters

When it comes to search engine optimization for Google (and possibly others), too many parameters can be harmful. Because of that a URL like “www.example.com/?id=10& date=2004-12-24&title=Tutorial& sort=ascending” is better rewritten into “www.example.cmo/id-10-date- 2004-12-24/ tutorial/ascending/”. Not only is this slightly nicer to read and write, it can also help your pages land a better ranking in Google.

I once had a site with thousands of URLs with multiple parameters. They achieved a relatively low PageRank for a year or so. I then used the Apache’s “htaccess” file to change from multiple GET parameters to a rewritten style with an HTML extension. After short time, all URLs achieved a higher PageRank. While this incident certainly is no final proof, there was also GoogleGuy (a Google employee who posts in the WebmasterWorld forum) who once said: don’t use too many parameters.

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!