Google Answers is a human-to-human search engine for those who aren’t interested in actually searching. Could this answer system ever be implemented algorithmically, so that not people, but programs, would be ready to help? One should think it would make for the ultimate search engine. What would need to happen for this to come true? If nothing else, today’s Web is a bit of a mess. First of all, standards would be nice.
It’s inspiring to not analyze what data is online and what can be done with it; but to think about what could be done with it, if it would be formatted in better ways. (A Googler’s Web Dream, so to speak.)
The World Wide Web consortium, put into life by the man who invented the World Wide Web (and many of the technologies that come with it), is putting forth online publishing recommendations. Those happen to be search engine friendly. But not truly for the sake of creating websites for searchbots; rather, to optimize for any device on any medium for any kind of visitor.
The device can be a mobile phone, the medium Text-to-Speech, accessed in a car by a driver. The concepts “green link”, “Flash animation to the right”, “title displayed in the browser title bar” completely fail to cover this situation.
However, HTML, the common language for webpages, never was intended to talk about these things in the first place. Rather, it is a structural mark-up language. And yet, that is exactly how a large part of web developers would format their content. The very simple and elegant solution to talk about colors, positioning and so on, are stylesheets (CSS, Cascading Style Sheets). Though they’re gaining popularity, the majority of the web (including the big G itself) does not put them to good use.
So what does HTML talk about? It actually should only mention; this is important; this is a quote; this thing there is an address, here’s the title, and this table heading is for this table data. But if only a minority of webmasters mark-up their data in such ways, there’s simply no commercial need to analyze these meanings. It just wouldn’t be pragmatic to focus on it because in large numbers, there’s no significant bonus gained from this analysis.
However, accessing the web from portable devices will become more mainstream in the future. Web creators will have to decide wether they have the time (or get paid for the time) to mark-up the same data in a variety of ways for a variety of devices (print, screen, handheld, TTS (Text-to-Speech), to name a few there might even be some we don’t know yet, because this device hasn’t been invented yet)... or to simply switch to logical structuring with media-specific stylesheets on top.
How could a Search Engine actually profit if future developers were following standards*, thus turning the majority of online content into perfectly structured mark-up?*
*I’m not talking about the Semantic Web, by the way, which is an effort going one step further than simply logically structuring a page it’s intended to add specific meaning. (The effort is interesting, though slightly unrealistic until the Web took the first step of structuring content.)
Well, it could turn into a Find Engine, or Conclusion Engine.
To explain; a typical SERP (Search Engine Result Page) is a listing of title-links, with short (more or less relevant) description snippets. It’s an assistance tool in searching, but it’s not doing your job of finding. Currently, you can ask a question over at Google Answers, ask a real live human being. This person is a Google research expert and will do the job of searching for you, giving you only the found answer to a question. If this could be even remotely automated I’m not talking about a replacement of human research, or Google Answers it would be the next generation of Search Engines. The post-Google era.
Imagine all the logical conclusions a search engine could yield if it had actual connection* data like the following:
A is an abbreviation for B.
C said D.
E is the title of F.
G is an address.**
*The only real connection a search engine can trust today is that of a pointer from here to there, the good old hyperlink. And already, very relevant conclusions (like Google’s PageRank algorithm) stem from it.
**I’m almost thinking that any mark-up which is not a connection between two sets of data (like “G is an address”) is useless in terms of information processing and automatized conclusions.
We would quickly see SERPS like this:
The following are quotes by Albert Einstein: “God does not play dice”, ...
This is the table-of-contents for the page ...
The company is located here ...
The Web is not standardized; maybe it’s impossible to ever get to that point, and so it will be the much harder implementation of intelligent algorithms in natural language text processing. (Though collectively, we could help speed that up a bit with structural mark-up.) But even then, this Find Engine might be created, once algorithms are perfected, and once the data pool is large enough for us to say “the Web does not play dice”.
Looking further ahead: after the search engine that can reach conclusions on its own, we would be a short distance away from the Creation Engine. Instead of only serving a result to a query on the state of what is, this page would be brave enough to create new theories. Maybe soon enough we’d even see Creation Engines coming to approximations on what data is missing to reach the needed conclusion; and in busy preparation for a future user request of the same sort, Creation Engine Bots would be send out to post questions in relevant newsgroups, help chats, and online discussion forums.
Or Google Answers, for that matter.
GoogleGuy is working for Google Inc. and gives answers to SEO questions at the WebMasterWorld forum; the future of the Open Directory Project, reporting Spam, Google and feedback (“the recent ’extreme geolocation’ thread was a good prompt to make sure that we went back and made sure everything worked better”), infinite penalties, Google’s view of SEOs, and more.
I can’t tell how good the portable device discussed in “Robin, to the Googlephone! Need a Web browser to settle bar bets? Try the Sidekick.”* (by Paul Boutin, June 12, 2003) really is. Trying Google with my handphone around two years ago was a failure; slow, small, and even though Google’s instant HTML to WAP conversion is a great idea, they cut off the page after a certain amount of characters. But is a color screen really the most important in quickly retrieving online information? Luckily for purist, cross-media developers, WAP2 adopts the XHTML1 Strict standard and gets rid of WML and some cell phones already support it, along with CSS tailored to handhelds. Now all we need is webmasters sticking to standards, or on-the-fly webpage conversion from cluttered to clean (which is harder than it may sound, if structural markup is missing on a page).
And I guess once we got all this working... bar bets will lose their appeal. If you can instantly know the right answer, why get into a big argument?
*Ironically, when going through Paul Boutin’s article I had to fight two separate Flash animations to the right while one was blinking for my attention, the other was busy throwing shadows all over the page and came across his line; “That’s not to say [the Sidekick device is] perfect. The browser doesn’t do JavaScript, animation, or pop-up windows, which makes many sites unusable. Most embarrassingly, it fails to load some sites at all, including [this site] Slate.”
Embarrassing to Slate, I suppose?
And when did a pop-up window ever make a site usable?
The previously discussed Google indexing limit rumour is now covered at Google-Watch.org as the Y2K+3 theory:
“Let’s speculate. Most of Google’s core software was written in 1998-2000. It was written in C and C++ to run under Linux. As of July 2000, Google was claiming one billion web pages indexed. By November 2002, they were claiming 3 billion. At this rate of increase, they would now be at 3.5 billion, even though the count hasn’t changed on their home page since November. If you search for the word “the” you get a count of 3.76 billion. It’s unclear what role other languages would have, if any, in producing this count. Perhaps each language has it’s own lexicon and it’s own web page IDs. But any way you cut it, we’re approaching 4 billion very soon, at least for English. With some numbers presumably set aside for the freshbot, it would appear that they are running out available web page IDs.
If you use an ID number to identify each new page on the web, there is a problem once you get to 4.2 billion. Numbers higher than that require more processing power and different coding."
– Is Google broken?, June 9, 2003
Arnaud Fischer, AltaVista information retrieval and search engine technology product marketing manager from 1999-2001, discusses What’s it Going to Take to Beat Google (June 12, 2003):
Users’ intent and personalized relevancy: Fischer argues search engines should better understand the context (time and location) of individual users. Thus a user entering a single term like “pizza” would automatically only see places relevant to the current location.
Training the engines: Somehow, search engines should learn from the user’s past searching behavior. Privacy concerns, among others, come with this approach.
Query disambiguation: This is regarding concept-based cluststering methods, as in use by Vivisimo and others. A certain otherwise ambiguous theme, e.g. if you enter “virus” (computer virus, or human virus?) could be further focussed by having the user decide what’s wanted.
Making results productive: Fischer writes that users should have the option to be notified when new, relevant results for certain queries appear online, since they are often looking for new content on the same thing. (He does not mention Googlealert though, which does just that I use it for several queries and it’s working very well, notifying me of new sites on different subjects.)
Vertical engines: Serving niche markets and further crawling those highly specialized parts of the “Deep Web” otherwise not found could be one major advantage over current Google technology.
Meta search engines: Somehow, in theory, a search engine going through several ones at the same time should be better than a single one. However, meta-search engines have a bad reputation; obtrusive advertisement, paid links. They should rather focus on getting the best results from a variety of search services.
An interesting article, but somehow, nothing of the above looks as if it could be the next Next Big Thing in search engine technology.
Yes, engines might be trained, but collectively please to judge which sites are more relevant. And do we really want a personalized search? It might be nice when the next Google suddenly assumes this and that about you, but then again, we want different things at different times. Maybe today I want to research a human virus, and tomorrow a computer virus.
Vertical search engines do make sense, but splitting up the interface into even more niche results is not wanted I believe. One interface covering all the web (flat and deep) seems more reasonable.
Query disambiguation is also nice. Maybe in a way it doesn’t interfere with a straight-forward result list. Keep it simple. Isn’t that exactly what AltaVista forgot around the years 1999-2001, when it was completely put out of the race by Google?
It seems to me, relevancy still rules (and can still be optimized, even by Google). Otherwise, trust users to enter a second query to redefine a search. Trust them to be able to enter their hometown + pizza to find a near-by restaurant. But please do serve up the best pizza.
Centuryshare calculator has been debugged; there was a bug preventing it to show statistics from 1950-1990.
And here’s the Centuryshare for “Hippies”:
Google Answers customer Qpet-ga is writing a book on “Quality of Life”, about 300 pages long, covering philosophy, anhtropology, and psychology. Those subjects are also focus for Qpet’s well-priced questions at Google Answers, which are really some of the most interesting posted. Here’s a chronological selection, and you can see this should make for a truly great book:
And the following one, which surely would receive many votes for the Answer of the Year Award (if there would be such a thing):
>> More posts
Advertisement
This site unofficially covers Google™ and more with some rights reserved. Join our forum!