Excellent Work Philipp!
I am Stunned =)
I hope this gets Google's attention to change the 4500 sources number!
|
Only english sources ;-)
And the number didn't change since... March, 21st 2003! http://web.archive.org/web/20030321025019/http://news.google.com/ |
Philipp, there must be certainly more sources.. afterall you were polling using " semi-random words "-- which will only mean a smal fraction of items being returned based on the "seeded words"!!
I like the yield here.. its has a very broad spectrum of sources..
What does the "type" column mean in the sample data ? does it imply paid/free subscriptions ??
Hey, can your convert the lreturn list into an OMPL share ?? |
My employer CoStar Group which also publishes commercial real estate news is listed. So in a way I see that as validating that semi-random does pull in smaller focus news topics. |
This blog is listed as "Outer-Court" rather than "Google Blogoscoped", so I guess other news sources might also be listed under less-familiar names. |
Thanks for the comments everyone!
> Philipp, there must be certainly more sources.. > afterall you were polling using " semi-random words "-- which will only mean > a smal fraction of items being returned based on > the "seeded words"!!
I don't think the fraction is that small. But let me explain the "semi-random" words: I used a couple of hundred popular words like "a" or "the" from the dictionary as well as a couple of hundred two-letter combinations like "vx", plus a couple of hundred more rare words, plus a list of words I manually created a while ago (like "Google", "Bush", "Iraq"), plus a couple of numbers (like 1, 2, 3). I sorted ~70% by date published and ~30% by relevancy, on result lists of 100 each. Around 10% of all times I simply queried for either "a" or "the". Take a look at a search for "the" sorted by date... you can refresh every other minute to get new sources: http://news.google.com/news?hl=en&ned=us&ie=UTF-8&scoring=d&num=100&q=the&btnG=Search+News
Almost immediatelly, you'll get a couple of thousand sources using this method. But after a while, less and less new sources will be found on average... this might be because you're nearing something like an 80%/ 90% completeness range. But that's guesswork of course... might be there's 20,000 sources and I just didn't discover all. But if I need to take a guess, I would guess it's more like 10,000 sources than 20,000.
Whether or not Google News are correct when they say "around 4,500" depends on how you count. If you count e.g. all CNN sources as 1, the number might be much smaller than 8,700. But since they keep adding sources and they don't adjust their number, we can assume they're counting as I did, but they're just not telling people the real number anymore. And "above 4,500" of course is always correct. It's like the joke where the doctor tells the patient, "You're gonna die." and the patient says "Really?" and the doctor replies, "Well, not anytime soon, but someday we're all gonna die..." |
Jorn Barger of RobotWisdom comments:
<<if you specify a source and search for "+the" you can get an rss/atom feed for that source that's often better than the sources' own, if they even offer one (eg the New Yorker doesn't)>> http://robotwisdom2.blogspot.com/2005/09/long-tail-of-google-news.html
Very interesting tip! |
I agree the actual number is probably close to 10,000 – your methodology appears to provide reasonable coverage and my guess is rate of additions is diminishing.
Since Google hand-selects their sources, they know EXACTLY how many sources they are pulling – you should start a poll to see when they change the "4,500" number on their page! ;-) |
> guess it's more like 10,000 sources than 20,000.
Gotta edit my sentence for clarity... I meant "closer to 10,000 sources than 20,000". E.g. could be 9,000, but probably not 18,000... |
I was agreeing with you Philipp – your coverage should be close to complete – my guess is if Google discloses the number, you are within 10-20% of it. |
I sent in a msg to the news prd team. requesting them to update the true up values on numbers of sources.. lets see what they do. |
So are these 8000+ Permanent ? |
Just FYI – the sources are NOT permanent. Anyone can request to be added as a source. Google then makes a determination as to whether or notyou really ARE a source. But you can also be de-listed if someone complains. I was involved once in getting a plagiarism site delisted (the site was scraping content from several other sources, changing two or three words and then republishing it with a new headline and author – we are talking two or three words out of 3000 here) and Google checked it out and delisted it within a few days. |
Update: The crawler has been running this week and I've updated the tool. It's now showing 9,336 Google News sources. |
You might consider showing a graph of number of total sources uncovered as a function of time – that is probably pretty asymptomic. Also, if you keep this running, it could be used to uncover NEW sources and also figure out DROPPED sources (assumes that sources post content with some frequency). |
Philipp, great piece of work you did. But were you aware of a similar job that has been maintain at privateradio.org for some years already ? Have a look at: http://www.privateradio.org/blog/i/google-news/reports/us/ |
Yes I saw that site, in fact it kind of prompted me to do this 'cause it lacked so many sources... |