read
on a cell phone. Or Wattpad, a free service for reading and sharing stories on a
mobile phone. Once downloaded to your phone, the service gives instant access to
works from Project Gutenberg.
As a volunteer, the wisest thing to do is to choose a book published before
1923. It is also required that copyright clearance be confirmed prior to working
on any book by sending a photocopy of the title page and verso page (even if the
latter is blank) to Michael Hart. The pages should be sent as scans to be
uploaded on the website. For people who cannot create scans, it is possible to
send photocopies by postal mail. The pages will then be filed, either on paper
or electronically, so that the proof will be available in the future, to
demonstrate if necessary that the book is in the public domain under the US law.
Project Gutenberg doesn't release any book until the book's copyright status has
been confirmed.
What is entailed exactly, once copyright clearance is received? Digitization is
done by scanning the book page after page to get "image" files. Then volunteers
run an OCR (Optical Character Recognition) software to convert "image" files
into text files. Then each text file is proofread (i.e. re-read and corrected)
by comparing it to the "image" file or the original page of the print version.
There is an average of 10 mistakes per page for a good OCR package, and many
more mistakes if the quality of the scanner and the OCR package is not great.
The book is proofread twice on the computer screen by two different people, who
make any corrections necessary. When the original is in poor condition, as with
very old books, it is keyed in manually, word by word. Some volunteers
themselves prefer to type short texts, or works they particularly like. But most
books are scanned, "OCRized" and proofread.
Contrary to digitization in "image format", which consists only in scanning the
pages, digitization in "text format" adds the OCR step: a) the book can be
copied, indexed, searched, analyzed and compared with other books; b) it is
possible to search the content of the book with the "Find" button available in
any browser and any software, without a specific search engine.
The assets of digitization in "text format" are numerous. It makes a smaller and
more easily sendable computer file, unlike digitization in "image format", which
produces a bulky "photo" file. Contrary to other formats, the files are
accessible for low-bandw
|