Digitization for Public Librarians

Why public libraries should digitize local history
It is part of the service mission of the library to protect the community's collective memory, the skills to do so are easily acquired, the equipment is either already available or inexpensive to purchase, libraries own or can readily obtain the rare items, no one else will bother, the number of available items is finite and it will benefit the community.

Copyright
This is a concern, but should not paralyze you. Susan Kornfield has produced a guide for libraries digitizing. Some important points: copyright is not forever, it does NOT exist to protect the author, copying or digitizing does not confer copyright and  libraries acting in good faith have some safeguards.

Standards and Best Practices
The goal of standards and best practices is to handle the item once. The problems with standards are: they will change, they offer only temporary protection and they are beyond the means of most public libraries. The answer may be responsible minimalism. Don't let the perfect be the enemy of the possible. 

Access and Preservation
To digitize a document is different than preserving it, the goal of preservation being to provide access to an original item. Digitization complements preservation by protecting the original and providing far superior access.

Some Concepts and Definitions

File Formats and Display Types
Masters and derivatives - A master file is saved in the highest quality possible. Derivatives are files created from masters for display purposes.
TIFF or TIF (Tagged Information File Format) - A widely-used bitmapped graphics file format handles monochrome, gray scale, 8-and 24-bit color. The masters of scanning files are often saved in this format.
GIF (Graphics Interchange Format)  - A popular bitmapped graphics file format which is widely used on the Web, because the files compress well, but losslessly.
JPEG or JPG (Joint Photographic Experts Group) - A format that is becoming very popular due to its high compression capability. It provides lossy compression. The rate of compression can be controlled, resulting in high or low quality images.
Thumbnails - Small versions of an image that are linked to a larger version.
TXT
(plain text) - Text that is in a raw (ASCII) form. It includes no formatting, making it very portable.
HTML (HyperText Markup Language) - The document format used on the World Wide Web.  HTML files are small, making them easy to store, transmit and download. Browsers allow the end user to select the typeface and font size for displaying HTML, so that it is user friendly and accessible. HTML which includes graphics is sometimes called Illustrated HTML.
PDF (Portable Document Format) - A format that allows documents created on one platform to be displayed and printed exactly the same on another. Adobe Acrobat Reader is free software used to read and display PDF files. 
Page Views - Pages can be presented as a graphic. GIF is often used. 

Scanner jargon
DPI (Dots Per Inch) - The measurement of the resolution of printing and display systems. It is a square measurement. The number of dots helps determine file size.
Bi-Tonal - Scanning done in black and white is called bi-tonal. Also referred to as line drawing. 
Grayscale - A series of shades from black to white, usually 256 of them. 
Color - Color usually ranges from 256 shades (8 bit) to 16.7 million shades (24 bit).
OCR (Optical Character Recognition) - After a page of text has been scanned, OCR software examines the resulting dots, converts the dots to letters and so recreates the text.

Other definitions can be found at TechEncyclopedia.

Types of Digital Projects

An Existing Digital Document. Local historians, genealogists and others often have interesting documents already in electronic format. These documents can be easily converted  to HTML for web publication.  The Index to Elsie's Scrapbook and Lincoln High School Graduates 1904-2000 are examples of this kind of document.

Retyping an Older Document. An older document can simply be entered into a word processing program, checked for accuracy, converted to HTML and then web published. A Short History of Wisconsin Rapids is an example of this method. This mimeographed document from the 1950s was retyped by library staff.  This method is suitable for text-only documents, where original format is less important than the content.

Scan and OCR. Text can be digitized using OCR. The resulting text will need to be proofed thoroughly, especially if the original is not laser quality. This proofing can be more time consuming than simply re-entering the document. Graphics can be scanned separately and combined with the text. If the text is used to create HTML, the resulting files are small and can be viewed in any browser. Since this method changes the format of the document, it is best used when the format is relatively unimportant. Centennial Story 1890-1990 is an example of OCR text with a separate graphic section, which reflects the original format. Each chapter of the book has been placed in a separate file for ease of access. The Appendix was updated to include additional information.

Page views. When the format or feel of the original document is important, page views allow that to be replicated. Since the master TIFF files are difficult to display, smaller GIF files or some other format is used for display. The Making of America site uses page views, with a large custom database to manage them. 

Adobe Acrobat. If the original format of a document is important, Acrobat (not Acrobat Reader) can be used to replicate it. A page is scanned, usually creating a  TIF file. Acrobat converts the TIF file into a PDF file. Acrobat can also take a series of scanned pages and combines them in one document (file) with important advantages in terms of display.  Acrobat sells for about $250 and is a sophisticated program, but not beyond the ability of a dedicated amateur.

Wood County place names is a 130 page book scanned bi-tonally at 600 dpi. It is available as a 21 MB file or broken into sections. Acrobat gathers the scanned pages together and retains the flavor of the original, but the resulting file is much larger than an unformatted HTML file of the text would be.

Rules and Regulations of the T. B. Scott Free Public Library, a four page pamphlet, was scanned in grayscale because it was originally printed on colored paper. The format of the original is an important part of the charm of this document, warranting its publication in PDF format. The text of this document would require only 17K in plain HTML, but takes 1.7 MB in grayscale TIFF or PDF.

Official Historical Program - Wood County Centennial is a 28 page document with dozens of photographs scattered throughout the text. Its small size made it appropriate for in-house digitization. The text was scanned bi-tonally, with grayscale photographs pasted in. Acrobat was used to gather the scanned pages.

Scan and Post. If an item is mainly graphical (such as photographs), the graphics can be scanned and web published. Scanners are inexpensive and usually include graphic software that will help clean up the files and save them in the best file format. The Young Postcards are an example of this method. Note that thumbnails (small graphics linked to the larger version) have been used to make it easy to browse. 

Databases. Databases can be used to track and display photographs. Kansas City Public Library uses a database to provide access to its collection of 14,000 photographs. This is too large a collection to be browsed, so databases organize it for display. Collections of data such as obituary indexes and cemetery records are also candidates for digitization. Identifying copyright can be a problem, since these kinds of records are usually compiled by groups. McMillan has loaded databases as static pages, but it is only slightly more difficult to load a database as a dynamic document.

Index to the 1928 Standard Atlas of Wood County, Wisconsin / compiled by Marlys Manley Steckler. This was originally a database created with Excel. Simply using SAVE AS HTML created the static pages.

Stevens Point Area Obituary Index. This on-line database can be searched and updated. 

Professional services. Some jobs are too complicated and/or time consuming for library staff.  Artwork of the Wisconsin River Valley, part 1 is part of a photographic series originally published in nine parts.  Due to its combination of text and graphics, it was sent to a commercial vendor (Northern Micrographics) for conversion.  The cost of the project was funded by a grant by Consolidated Papers Foundation, Inc.

Out-Source or In-House?
When a library uses an outside professional: 

When a library does a project in-house:

What Next?

Once you have a digital text, there are two main options to provide access:

Digitizing on a Budget (or without a Budget)

Many public libraries that might want to digitize documents has little or no budget for the process. They do not have to sit on the sidelines and wait for some indefinite future when funding is plentiful or digitization is free. It is possible to move forward without committing significant staff time or funding.

Projects do not need to be massive. In most cases, anything you do will be the largest such project in your community. The more you do, the more support and assistance you will find. Demonstrate the library's interest and competence. Additional opportunities will present themselves.

Recommended Internet Sites

Digitization in Public Libraries Web Site - Handouts and advice from the 2001 PLA Spring Symposium on digitization. Be sure to check the Resource list.
The On-Line Books Page - Probably the most complete of several attempts at listing on-line books, with over 13,000 listings. 

This page was prepared by Andy Barnett, Assistant Director of McMillan Memorial Library. All examples used are from the Library's collection of digital historical titles. He welcomes comments and suggestions.

This page is located at http://www.mcmillanlibrary.org/programs/digitize.html

Last updated August 30, 2006