Advertisment

New virtual library holds 10 billion Web pages

author-image
CIOL Bureau
New Update

Andrea Orr

Advertisment

PALO ALTO: Dubbed the Wayback Machine (http://web.archive.org

), the archive is the work of San Francisco entrepreneur Brewster Kahle who for

the past five years has been working on a library that would store not just

documents like old newspapers that are normally preserved, but a sampling of

everything that has ever been posted on the World Wide Web.

In an interview on Thursday, Kahle said the Wayback Machine includes some of

the most amateur-looking Web pages dating all the way back to 1996, helping

students to chronicle the evolution of the Web design.

That is not all that is chronicled.

Advertisment

As the Web quickly became a worldwide communications tool over the past five

years, corporations and governments often hastily slapped up information with

much less care than they might have given to the contents of a bound book.

One archive from the official White House Web site now features remarks from

then-President Clinton, who almost five years prior to the Sept. 11 attacks,

warned of the need to tighten airport security.

"I will direct that all airport and airline employees with access to

secure areas be given criminal background checks and FBI fingerprint

checks," Clinton said in a Sept. 9, 1996 address now preserved on the

Wayback Machine. "I will direct the FAA to begin full passenger bag match

for domestic flights at selected airports. And I'm proud to say that several of

the commission's recommendations will be put into place immediately."

Advertisment

Kahle said he expects the new archive, which is free, will be used by

everyone from journalists looking to dig up old statements by corporations, to

students and others just looking for kicks.

Kahle used it himself late one night to locate an out-of-print computer

manual when one of his computers was giving him trouble. Much more salacious

material, like some fly-by-night porn sites thought to be extinct, are also

preserved.

The Wayback Machine, Kahle said, will contain copies of many defunct

magazines, which may or may not have maintained their own archives. It also

provides a way to track the ever-changing messages from different companies,

such as the Internet advertising company DoubleClick Inc., which routinely

amended its privacy policy over the past several years in response to customer

complaints.

Advertisment

"If you don't actively keep a record of digital materials, they're

gone," Kahle said. "This is a huge collection. It's a celebration of

the Web." While impressive in size, the Wayback Machine is more a testament

to Kahle's commitment to save as much Web content as possible, than it is to any

advanced technology.

He archived the Web pages with basic Web crawlers that repeatedly sweep the

entire Web, excluding some password-protected sites.

Because this Web sweep takes about two months, researchers will not

necessarily be able to find a page from a particular day, but they should be

able to get a sampling from a given two month time frame.

(C) Reuters Limited.

tech-news