Search FCW


Subscribe Now!
Table of Contents
Sprint
Business
BPM
CXOs
Columns
Columnists
Defense
E-Government
Elections 2008
Enterprise Architecture
Funding
Homeland Security
Health IT
IPv6
LOB
Management
Procurement
Privacy
Policy
Program Management
State and Local
Security
Technology
Telework
Training and Certification
Workforce

More Topics
resourcecenter
Home
Letters to the Editor
Current Issue/Download
Print/Online Archives
Editorial Calendar
researchstore
resourcecenter
Communications for Continuity Operations

Oracle Resource Center
NEW! Transforming Data Center
Managed Services
Service Oriented Architecture
Training & Simulation
Networking Communications
Security Directives and Compliance
Data Center Virtualization
Air Force ELSG Contract Guide

More >>



Latest News
ADVERTISEMENT





 

Fugitive documents elude preservationists

GPO, Library of Congress turn to Web harvesting

By Aliya Sternstein
Published on May 9, 2005

Comment

Click here to comment on this article


Related story links

A crisis for Web preservation


Newsletters

You might also be interested in these FCW newsletters:

Daily

To learn more, click here.


Government Printing Office officials, who have a significant role in preserving government information, want to capture fugitive publications, which are documents that federal agencies have published on the Web but for which no copy or record exists in GPO's database.

To recover such documents for preservation, GPO officials are interested in new software technologies such as Web harvesting, and they are reviewing proposals from companies that make such software.

Web harvesting is one of three activities that will contribute to what GPO officials say will be the Future Digital System. In addition, GPO officials plan to convert paper-based government information to digital formats and deposit electronic documents in libraries that are part of GPO's Federal Depository Library Program.

Other federal agencies are also interested in harvesting. As part of the Library of Congress' National Digital Information Infrastructure and Preservation Program, LOC officials awarded grants last fall to several academic and other institutions for creating technologies that preserve Web content and its context.

Information about electronic content, such as the server on which information is stored, will be maintained as part of the library's preservation program. Information about the server's location and when the content was published are also important to preserve, library officials say. With that additional information, people could compare the White House Web site on the last day of the Clinton administration, for example, with the site's appearance on the last day of the first Bush administration.

Web harvesting, sometimes called crawling or spidering, is more than searching for and discovering information. Harvesting techniques are used for downloading code, images, documents and any files essential to reproduce a Web site after it has been taken down.

Search engines perform only the first step in preserving Web sites for future generations. A search engine typically finds a Web site and indexes it without storing it.

Officials at the National Archives and Records Administration also have an interest in Web harvesting as they develop NARA's Electronic Records Archives. Harvested e-documents, however, are only one type of record of government decision-making for which NARA is creating the archives.

Hard to capture

Librarians agree that many challenges face anyone who attempts to pick through government content on the Web looking for materials to preserve. "Dynamic, interactive aspects are hard to capture," said Martha Anderson, project manager for LOC's Office of Strategic Initiatives.



upcoming event

Green Computing Summit, Ronald Reagan Building, Washington, DC
December 2 - December 3, 2008

Trusted Internet Connection and the Comprehensive National Cyber Security Initiative, The Willard Intercontinental Hotel, Washington, DC
December 4, 2008


 

head
fcw
issue
First Name State
Last Name Zip
Title Email