the lowest common
denominator". It can be read, written, copied and printed by any simple text
editor or word processor on every computer in the world. It is the only format
compatible with 99% of hardware and software. It can be used as it is or to
create versions in many other formats. It will still be used while other formats
will be obsolete (or are already obsolete, like formats of a few short-lived
reading devices launched between 1999 and 2003). It is the assurance collections
will never be obsolete, and will survive future technological changes. The goal
is to preserve the texts not only over decades but over centuries. There is no
other standard as widely used as ASCII right now, even Unicode, a "universal"
encoding system created in 1991.
Project Gutenberg also publishes eBooks in well-known formats like HTML, XML or
RTF. There are Unicode files too. Any other format provided by volunteers (PDF,
LIT, TeX and many others) is usually accepted, as long as they also supply an
ASCII version where possible.
But a large scale conversion into other formats is handed over to other
organizations. For example Blackmask Online, which uses Project Gutenberg's
collections to offer thousands of free eBooks in eight different formats based
on the Open eBook (OeB) format. Or Manybooks.net, which converts Project
Gutenberg's eBooks into formats readable on PDAs. Or Bookshare.org, the main
digital library for the visual impaired community in the US, which converts
books from Project Gutenberg into Braille format and DAISY (Digital Audio
Information System) format.
What is entailed exactly, once copyright clearance is received? Digitization is
done by scanning the book page after page to get "image" files. Then volunteers
run an OCR (Optical Character Recognition) software to convert "image" files
into text files. Then each text file is proofread (i.e. re-read and corrected)
by comparing it to the "image" file or the original page of the print version.
There is an average of 10 mistakes per page for a good OCR package and... many
more mistakes if the quality of the scanner and the OCR package is not great.
The book is proofread twice on the computer screen by two different people, who
make any corrections necessary. When the original is in poor condition, as with
very old books, it is keyed in manually, word by word. Some volunteers
themselves prefer to type short texts, or works they particularly like. But most
books are sc
|