Google Blogoscoped

Forum

Archive.org Founder Criticizes Google Book Scanning (Video)  (View post)

Brian M. [PersonRank 10]

Monday, November 27, 2006
17 years ago11,094 views

When I saw Brewster Kahle speak at Wikimania he was not so adamantly against Google's book scanning. I would like to hear more about these "secret" contracts that were obtained via subpoena. It has always been my understanding that the libraries are being given exactly what Google has. It would be terrible, for instance, if Google merely gave them the scanned pages, and left the task of OCR'ing the books up to the libraries. But I highly doubt that they would agree to this.

Does anyone have more information on just what information Google is turning over? If these "secret" contracts were obtained via subpoena, I should be able to read them. Where are they?

Note Brewster's mission: "Universal Access to all Knowledge"
Google's mission: "Organize the world's information and make it universally accessible and useful."

Google's book search now features a prominent radio dial allowing you to search all "full view books." You can download pdfs of all books that are out of copyright. As far as I can tell, Google is making universally accessible the most amount of information that the law allows for. My hunch is that Brewster should instead work with the libraries that Google has contracted with, rather than fighting with Google. Or he can just download every single pdf available on Google and make it available.

In fact, I propose we start a project to do this right now: Download every single pdf of out of copyright books (including scraping their metadata) from Google book search and put them on Ourmedia.org. Brewster Kahle is on their board of trustees, the bandwidth and disk space comes from The Internet Archive, and their mission is thus:

"Ourmedia is a media archive, supported by the Internet Archive, which freely hosts any images, text, and video or audio clips which do not violate copyright laws and do not include pornography."

This will satisfy Brewster's wants: Every book that can be _legally_ exported from one of these famous public libraries, will not solely be available from a single corporation such as Google. This will save Brewster a lot of work, I think.

Gary Price [PersonRank 10]

17 years ago #

On a related note:

Don't forget the the Online Books Page direct access to tens of thousands of full text books online.
http://digital.library.upenn.edu/books/

Just look at how much is added daily:
http://onlinebooks.library.upenn.edu/new.html

A feed is also available of new entries:
http://onlinebooks.library.upenn.edu/new.html

The World eBook Library, access (PDF's mostly) to over 400,000 titles.
http://www.netlibrary.net/
Searchable. Access is only 8.95/US per year. More than 75,000 books are available at no charge.
http://www.worldlibrary.net/Public.htm
.
Another cool source (For NEW full text books) is ebrary. This service is free (registration required) but unlike other services that offer full text books, you can view the complete book online online for free. You only pay $.25/U.S. to copy or print a page (if you want).
http://shop.ebrary.com/
ebrary is run by Christopher Warnock, the son of the founder of Adobe.

AND if you're in need of children's book, make sure to check out the International Children's Digital Library. Amazing and all free.
http://www.icdlbooks.org/

Special note on the search interface built for children:
http://www.childrenslibrary.org/icdl/SimpleSearchCategory

and the quality and presentation of a scanned book. Example:
http://www.childrenslibrary.org/icdl/BookReader?bookid=dkthntr_00260002&twoPage=true&route=simple_0_0_0_English_0&size=0&pnum1=1&djvu=false&lang=English

Finally, many LIBRARIES offer FREE full text access to NetLibrary. All you need is a library card. Access is 24x7 from any web computer. Primarily new books that you can annotate, share, etc.
http://www.netlibrary.com/Librarian/Home/Home.aspx

Gary Price [PersonRank 10]

17 years ago #

Forgot to mention (sorry) that Archive.org has several digitization projects underway.

See:
http://www.archive.org/details/texts

See Also: A video of Book Scanning Robot at the University of Toronto in action.
http://www.archive.org/details/scanning_robot

See Also: An article about U of Toronto Scanning (part of the Internet Archive): Building an Online Library, One Volume at a Time (via WSJ, free)
http://online.wsj.com/public/article/SB113111987803688478-VNpw62xi_JA4avE8cxOZf0pf_nM_20061109.html?mod=blogs

Of course, The Internet Archive is also home to a massive archive or music and live performances. It also offers a growing collection of video.
http://www.archive.org/index.php

See Also:
+ DigitalBookIndex
http://www.digitalbookindex.org/about.htm
128,000 titles about 88,000 free.

+ The OpenBook Library
http://www.openlibrary.org/
Cool technology! Reminds me of the “Turning the Pages” technology see here (NLM) and the British Library (12 full text books).
http://www.bl.uk/onlinegallery/ttp/ttpbooks.html

Bill Alldredge [PersonRank 0]

17 years ago #

It's interesting to hear Kahle speak so negatively of Google's library digitization, considering he is one of the founders of an equally-monumental task (the Internet Archive). Legal and copyright issues aside, Google has invested a tremendous amount of time, funds and resources into this project, for what on the surface appears to be a genuine effort to make information available to the public. Is it just me or does it seem like there's some animosity towards Google in Kahle's comments in this interview? While I agree there's a need for an open-source standard for digitization of this type of content, this is one of those cases where Google simply beat the rest of the competition to establishing partnerships with libraries to make the project possible. It's disappointing to hear someone responsible for a project like the Internet Archive to be opposed to the digitization and archival of such a vast amount of information, which is being made available to the general population. Seems like there's more to the story...

Philipp Lenssen [PersonRank 10]

17 years ago #

I like your proposal Brian. I suppose US users will find the most PDFs, as often full view books aren't really full view in some other countries.

Now, Google's usage guideline for PDFs of works in the public domain includes the line "we request that you use these files for personal ... purposes"* (and they also ask you to not remove their "digitized by Google" watermark) – though I wonder if they can even claim this right. (See the previous discussion at http://blogoscoped.com/archive/2006-08-30-n13.html .) Google also asks you to refrain from automated querying...

* For reference, here's a PDF I just downloaded.
http://blogoscoped.com/files/Shakspeare_s_Seven_Ages.pdf

Kevin [PersonRank 0]

17 years ago #

The head of France's National Library offered a similar sentiment in his book "Google and the Myth of Universal Knowledge".

I think it is a valid point that one firm owning all the digitization is dangerous, but if anyone is going to be the one firm, it should be Google.

Also, 1 set of digitized books is better than 0.

Brian M. [PersonRank 10]

17 years ago #

A mere reproduction of a public domain work does not remove it from the public domain. Google does not own the copyright to these works and thus cannot own the copyright to an uninspired derivative. We are legally free to remove Google's watermark and notices.

Google may own the right to the collection of public-domain books. This is an as-yet untested corner of copyright law.

It's easy to circumvent, though, by not making any individual responsible for the copy. Provide a user interface whereby they must click a link to download each pdf. They can them modify the meta-data, verify that it is not a duplicate, and click again to upload with their own account. This also avoids violating Google's TOS, as it is not automated.

Anonymous Googler [PersonRank 0]

17 years ago #

Brian: Google's not trying to lock down public domain content: that's why it says "request", not "require". We'd just like people to know who spent the effort (and money) to digitize and index it, just like a facsimile of an old book will usually credit the museum or library that arranged to have it done. We're not trying to impose a copyright on something that's already in the public domain--that would be evil.

The request not to automate downloading is just advice about how not to trigger our spam detectors--our servers are designed to detect automated scrapers and refuse them so that they don't hog resources away from human users.

There's nothing sinister here except to people who are used to extorting money from researchers because old and out of print books are hard to find. We think that information scarcity is a bug. Helping people find what they're looking for is what we're all about: in any language, from any time and place--not just web pages.

Brian M. [PersonRank 10]

17 years ago #

I totally understand that – I was not implying that Google is trying to exert copyright over these books, only countering the direction the conversation was going.

That said, you could make life easier by allowing us to download all of Google's public domain books. Better yet, ask Brewster to send Google a Petabox, and when you guys are done scanning in all the books, send it back to him chock full of all the public domain that Google acquired from these libraries.

Your significant contribution to history here is not in scanning the books, but in giving them back to the people – all of them (in case, you know, the bubble bursts ;)

Beret [PersonRank 0]

17 years ago #

It's a good thing archive.org maintainers press for data to be released to the world. However, using archive.org for research seems to have been gone since 2002:
<<Researcher access is currently not available pending redesign. This material has been retained for reference and was current information as of late 2002.>>
http://www.archive.org/web/researcher/proposal.php

Philipp Lenssen [PersonRank 10]

17 years ago #

Thanks for clarifying, anonymous googler. Now, you (Google) also request people only make "personal, non-commercial" use of the works, but this is not a requirement the public domain imposes, and I also don't see how that is in line with wanting to be credited for the hard work you do in digitizing, which is of course a fair point. Sure, there's a difference between "request" and "require" as you say, but I wonder if it's an obvious difference to those (non-lawyer) people reading through your PDF's "usage guidelines". A more user-friendly legalese – something more akin to a helpful Google-search for [define:public-domain] – may be the following...

<<This book is in the public domain, a copyright-free zone – feel free to remix & republish everything to your heart's liking (and even sell!) the book. We hope you credit Google Book Search for all the hard work we did digitizing the book. Visit us at ...>>

John Dowdell [PersonRank 1]

17 years ago #


(A small request: A functional synopsis of the points made in a video could be a useful orientation for any subsequent discussion of those points.)

Roger Browne [PersonRank 10]

17 years ago #

Well phrased, Philipp! That would be a great way for a corporation to generate fabulous goodwill.

Philipp Lenssen [PersonRank 10]

17 years ago #

John, good idea, I added a try at a transcript to the post.

Brian M. [PersonRank 10]

17 years ago #

I also want to note, that we cannot trust the libraries to give us these works. Take a case study: The New York Public Library Digital Gallery [1]. When they first opened, their was much fanfare concerning the large amount of public domain images that they had scanned in and were making downloadable. I wanted to put these on Wikipedia, but I couldn't. Take a look at what their legalese said:

"Most materials are in the public domain, freely available for personal use. For any other use, including but not limited to any type of publication or commercial use, contact The Library’s Photographic Services & Permissions http://www.nypl.org/permissions/) office regarding required permissions."

I asked (over e-mail) for more information:

"As the physical rights holder, the Library charges a usage fee to use any images from its collections for any type of publication, broadcast, exhibition, web site, etc. The library does not claim copyright on these images.

Please remember as it is stated on the Library's web site it is strictly prohibited to download images for any use other than personal or research purposes without obtaining written permission from the Library and payment of use fees."

They seem to be taking the position that you are violating their Terms of Service by downloading an image if your intent is to use it for anything not considered "private use". This doesn't sound legal to me. Suppose I browse the images on their website, and I keep a cache of every website I ever visit on my computer (which is true). I now have in my possesion an image which is in the public domain on my computer, which I acquired in a legal manner. As this image is in the public domain I am free to distribute it, sell it, and publish it as I will because it is not encumbered by copyright.

The legality of this situation does not really matter. All I am saying is that we cannot trust the libraries to be on our side, so Google giving the scanned books back to the libraries is not really giving them back to the people, because there is a good chance they are going to try and charge us for them.

[1] http://digitalgallery.nypl.org/nypldigital/index.cfm

Philipp Lenssen [PersonRank 10]

17 years ago #

You are right Brian. I briefly covered that case here too:
http://blogoscoped.com/archive/2005-03-05.html

I think they won't be able to survive with this stance would it come to a court case, but I'm no lawyer, so I dunno... basically, if it's public domain I don't care what they think their ToS is, though I likely also will just go elsewhere, so there won't be a big conflict.

Brian M. [PersonRank 10]

17 years ago #

Oh, didn't realize you put that up. I am of course Alterego =)

Daniel Brandt [PersonRank 3]

17 years ago #

Google's contracts with the University of Michigan and the University of California are available at http://www.google-watch.org/modify.html

All contracts that Google writes are nominally confidential. But in these two cases, state freedom-of-information laws in Michigan and California obligated UM and UC to provide me with copies.

The New York Public Library is a private entity, and is not covered by New York sunshine laws. As a private entity, they apparently presume the right to restrict access to public domain material. You can see from the Google contracts with UM and UC that these two universities are completely restricted when it comes to the manner in which they can distribute their copy of Google's scans and text files. UC got smart and negotiated for coordinate data (i.e., the pixel coordinates where each word appears on a scanned page), which means that UC can at least develop its own search engine for their copies. But UM didn't do bother to do this, and presumably won't get the coordinate data.

Philipp Lenssen [PersonRank 10]

17 years ago #

Thanks Daniel for the link.

Anonymous Googler [PersonRank 0]

17 years ago #

Daniel: I'm not going to get into an extended debate with someone who obviously has an axe to grind, so this will be my last comment on this entry,but I don't think the contracts you linked to are as "lurid" as you seem to think.

They're pretty reasonable for a library or museum. Brian is completely correct that libraries are not, in general, willing to place digitizations of their collections into the public domain, even if the texts themselves are already there. Library politics is its own universe of egos and money, and not all of the restrictions are Google's idea. Remember that many libraries and museums sustain themselves by being the gatekeepers (and often, toll collectors) for their collections. Even places like the Victoria & Albert Museum, which has done their own stunning digitization project <http://images.vam.ac.uk/> require a license for anything beyond personal study.

"Public domain" isn't as simple as it seems, unfortunately. I wish it were--I'd love to have all of the world's literature instantly accessible on my laptop--but the world isn't there yet.

Roger Browne [PersonRank 10]

17 years ago #

Brewster Kahle is on the ball here. The contracts linked to by Daniel Brandt are quite eye-opening. See for example section 4.10.b of the UC contract:

University shall have the right to distribute ... all or any portion of public domain works contained in the University Digital Copy [to recipient institutions]... for research, scholarly and academic purposes ... Prior to any distribution by University to a recipient institution, Google and the Recipient Institution must have entered into a written agreement on terms acceptable to Google...

So yeah, the University is not even allowed to give Internet Archive a copy of their public domain images unless Google agrees. Despite what Anonymous Googler suggests, that contract clause is certainly not there to avoid triggering Google's spam detector!

Kevin [PersonRank 0]

17 years ago #

In this film http://www.guba.com/watch/2000796349
at the 30 minute mark, the director of the Stanford Library says that Google agreed to provide Stanford with copies of the scanned books.

This directly contradicts Brewster Kahle, doesn't it?

Brian M. [PersonRank 10]

17 years ago #

Not really – Brewster was stating that the terms under which the data will be given to the libraries pretty much makes Google the sole distributor.

Jeff Ubois [PersonRank 0]

17 years ago #

Anonymous Googler: The people I've met at Google have been bright, passionate, and committed personally to improving access to the world's information, but "Trust Us" doesn't cut it as an answer when it is coming from a giant media company.

The non-disclosure agreements Google insists on are a big problem for the libary community because they inhibit discussion among librarians about the most complex decisions they've ever had to make. And it's just not right that we have to use freedom of information act requests to get librarians to tell us what is going on.

The restrictions placed on the universities, and their ability to share data, are understandable (Google needs to profit), but problematic. It would be bad if the future of libraries rests on one commercial entity owning all the scans.

The argument that someone else can come in and re-scan the books if if Google's terms prove bad is unsupportable. We've been through a scanning experience with microfilm, and in practice, we know that scans don't happen twice. Though the money for scanning may apparently be coming from Google only, the libraries involved in these projects are committing substantial resources; public money is going to benefit a private company. Perhaps that's appropriate given what the libraries get back, but the public discussions that should be happening are not.

Finally, Google's entire business is based on having free, automated access to other people's content on other people's servers. So it is odd to see Google insist that others should not have similar access to content on Google's servers.

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!