Google Blogoscoped

Thursday, April 29, 2004

Orkut Community Data Leaking

“We strongly respect your privacy. orkut is a trusted community. Our web site and services are designed to facilitate trust."
– Orkut’s Help

Orkut, Google’s social networking website, stores some of the most intimate data of its users, equaling that of Gmail in importance. Not all of the data can be seen by everyone. In fact, you can determine who’s going to see which data. Only a friend may see where you live, but a friend of a friend may get to know your birthday.

However there has been a leak of data, which one might consider a privacy issue. Rolan Yang from New Jersey created the “ORKUT Personal Network GeoMapper”,* which was made possible due to one of his associate’s aquisition of a mirror of the entire Orkut website**. That’s right folks, the way things are looking personal data left the Orkutplex.

The GeoMapper MySQL backend does not contain seriously crticial data, as name and location are usually already public within the community. However this time this data is public outside the community (and in fact, already made its way into the Google index).

GeoMapper on its own is a technically interesting application (see previous post), and whoever agrees to show his friends list or location to anyone outside of Orkut surely can make use of it. However the issue here is that I do not believe everyone at Orkut agreed to this data being shown. The issue here is also that someone inside Google might have taken the Orkut data and passed it on – or if it was someone outside Google, then there must have been a lot of screen-scraping done by an Orkut member. (In theory, this would be possible, and any Orkut member could start to talk about details of other Orkut members... just as much as someone you consider trusted might publish email you send to him.) I asked Orkut for a comment.

Update: someone told me that “Orkut Police” noticed the bots screen-scraping the site and quickly stopped them. Thus the data grabbed is quite old, and further trouble is not ahead.

*The site is hosted at STARGATE.COM INC registered datawhorehouse.com

** “A while back, an associate of mine aquired a mirror of the entire Orkut website (...) I fed the info into my database to do some visualizations .. just for personal amusement and out of scientific curiosity.” (Rolan via email, permission to quote here has been given.)

Growing Two Words

I took the algorithm recently mentioned which grows two words into sentences, and applied it to the Google Web API. You can see the result on FindForward’s Grow Words*.

E.g. enter “i am” (no quotes) to grow this to “I am your child understands the. historical perspective kathleen prody links chemists materials research and livestock presented old s almanac by elizabeth shippen green artwork prices always go back to the bible in.”. (If you check the Google results for “i am” you will see what is happening: “Your child” are the two next words, which are then googled again, and so on.)

*Note result sentences are not terribly meaningful so this is not yet a public FindForward option. However following the link from here will add it to the FindForward options.

Orkut Map of Friends

I don’t quite get it... how can Datawhorehouse display an Orkut map of friends? I thought Orkut’s social network wasn’t public data (not to say it’s incredibly well hidden, since there are already hundreds of thousands of people in the program). In any case, this map app lets you enter the name of any Orkut member (no double-spaces, US-only) to dynamically create a map graphic. Red lines are friends, blue lines are friends of a friend.

Automated Sites

Automated news portals via RSS as a way to attract AdSense-clicking visitors. Got to put that in my book of spam techniques in the age of Google.
For example, http://news.jdwebpages.com/, hundreds of pages doing nothing but reprinting XML news-feeds in Google-friendly hypertext using PHP. Then again, nothing illegal going on here. RSS wants to be free...

There are other ways to create automated texts. One is complete randomization of words in a dictionary, as I tried out in Memecodes. Another is to display search results of Google et al, as used by many spamfarms.

Also, you can check which words typically follow a word (I tried this out in Retale, which is a tiny freeware app you can find at the bottom of my VB programs page).
Recently I came across an optimized variation of this; check which word typically follows the last two words, and it works much better it seems (see the history of Mark V. Shaney, a bot participating in discussions, and a bot being taken seriously).

Yet another way is to screen-scrape on demand. Let’s take my FindForward Directory view. The first page is “hand made”, whereas all following ones are taken straight from DMOZ.org, the Open Directory Project. (DMOZ.org content is open, and I add thumbshots to give extra value.) So if you look at FindForward’s Open Software Directory, nothing of that resides on my server.

Of course, you can also use a database backend (or simply files on your server hard disk) to generate millions of pages by having just little data, by simply combining this data in meaningful ways. The ultimate automated text generation system, which could feed endless data into Google (and possibly plaster its output with visitor-friendly text ads), would be an intelligent bot. This bot could babble on and on about Web events and everything else (pretty much like a blogger does today). In the future, we might see some of those intelligent websites pop up. Websites where there is no webmaster or article database providing content – websites being able to pass the Turing test.

From Hotmail to Gmail and Other Gmail tips

If you have an existing Hotmail account and you’d like to transfer emails to Gmail, there’s apparently an easy way to do it. Another tip coming from Gmail Gems (a Gmail blog) is to use yourname+keyword@gmail.com to separate emails, e.g. to use philipp.lenssen+job@gmail.com as an email to give out to my colleagues.

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!