The Wayback Machine
What is the Wayback Machine?
The Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older version of your favorite Web site. The Wayback Machine can make all of this possible.
How large is the Wayback Machine?
The Wayback Machine contains almost 1.5 petabytes of data.
How was the Wayback Machine made?
Alexa Internet, in cooperation with the Internet Archive in San Francisco, designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.
Who was involved in the creation of the Wayback Machine?
The original idea for the Wayback Machine began in 1996, when the Internet Archive in San Francisco first began archiving the web. Five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive made the Wayback Machine available to the public. The Internet Archives of both San Francisco and later BA relied on donations of web crawls, technology, and expertise from Alexa Internet and others. The Wayback Machine is owned and operated by the Internet Archive in San Francisco, who later donated BA a version of the Wayback Machine to present the mirror archive to its users.
How can I get my site included in the Wayback Machine?
Alexa Internet has been crawling the web since 1996, which has resulted in a massive archive. If you have a web site, and you would like to ensure that it is saved for posterity in the BA Internet Archive, and you've searched wayback and found no results, you can visit the Alexa's "Webmasters" page at http://pages.alexa.com/help/webmasters/index.html#crawl_site.
Method 2: If you have the Alexa toolbar installed, just visit a site.
Method 3: While visiting a site, use the 'show related links' in Internet Explorer, which uses the Alexa service.
Sites are usually crawled within 24 hours and no more than 48. Right now there is a 6-12 month lag between the date a site is crawled and the date it appears in the Wayback Machine.
How can I remove my site's pages from the Wayback Machine?
The BA Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.
Can I link to old pages on the Wayback Machine?
Yes! The Wayback Machine is built so that it can be used and referenced. If you find an archived page that you would like to reference on your Web page or in an article, you can copy the URL. You can even use fuzzy URL matching and date specification... but that's a bit more advanced.
Can people download sites from the Wayback Machine?
Why am I getting broken or gray images on a site?
Broken images (when there is a small red "x" where the image should be) occur when the images are not available on our servers. Usually this means that they were not archived. Gray images are the result of robots.txt exclusions. The site in question may have blocked robot access to their images directory.
What does 'failed connection' and other error messages mean?
Below is a list of the main error messages you will see while searching the Wayback Machine. If you see an error message that does not have the Wayback Machine logo in the upper left corner, you are most likely looking at an archived page or the live web.
Failed Connection: The server that the particular piece of information lives on is down. Generally these clear up within two weeks.
Robots.txt Query Exclusion: A robots.txt is something that a site owner puts on their site that keeps crawlers from crawling them. The Internet Archive crawlers retroactively respect all robots.txt.
Path Index Error: A path index error message refers to a problem in our database wherein the information requested is not available (generally because of a machine or software issue; however each case can be different). We cannot always completely fix these errors in a timely manner.
Not in Archive: Generally this means that the site archived has a redirect on it and the site you are redirected to is not in the archive or cannot be found on the live web.
Why are there no recent archives in the Wayback Machine?
Pages are not added by the Internet Archive less than 6 months after they are collected, because of the time delayed donation from Alexa. Updates can take more than 12 months in some cases.
Archived files must be added to the Wayback Machine to be accessed by users. There is no other way to access files before they appear in the Wayback Machine.
How did I end up on the live version of a site? Or I clicked on X date, but now I am on Y date, how is that possible?
Not every date for every site archived is 100% complete. When you are surfing an incomplete archived site the Wayback Machine will grab the closest available date to the one you are in for the links that are missing. In the event that we do not have the link archived at all, the Wayback Machine will look for the link on the live web and grab it if available. Pay attention to the date code embedded in the archived url. This is the list of numbers in the middle; it translates as yyyymmddhhmmss. For example, in this url http://web.archive.org/web/20000229123340/http://www.yahoo.com/ the date the site was crawled was Feb 29, 2000 at 12:33 and 40 seconds.
How do I cite Wayback Machine urls in MLA format?
The Internet Archive in San Francisco asked the Modern Language Association (MLA) to help with the subject of how to cite an archived URL in correct format. MLA said that there is no established format for resources like the Wayback Machine, but it's best to err on the side of more information. You should cite the webpage as you would normally, and then give the Wayback Machine information. They provided the following example: McDonald, R. C. "Basic Canary Care." _Robirda Online_. 12 Sept. 2004. 18 Dec. 2006
The BA Internet Archive
What type of machinery is used in the BA Internet Archive?
Much of the BA Internet Archive is stored on hundreds of slightly modified x86 servers. The computers run on the Linux operating system. Each computer has 512Mb of memory and can hold just over 1.5 Terabytes of data on ATA disks.
How are dynamic pages being archived?
Why are some sites harder to archive than others?
If you look at the collection of archived sites, you will find some broken pages, missing graphics, and some sites that aren't archived at all. Here are some things that make it difficult to archive a web site:
As a general rule of thumb, simple html is the easiest to archive.
- Robots.txt: The archive respects robot exclusion headers.
- Server side image maps: Like any functionality on the web, if it needs to contact the originating server in order to work, it will fail when archived.
- Unknown sites: The archive contains crawls of the Web completed by Alexa Internet. If Alexa doesn't know about your site, it won't be archived. Use the Alexa Toolbar (available at www.alexa.com), and it will know about your page. Or you can visit Alexa's Archive Your Site page at http://pages.alexa.com/help/webmasters/index.html#crawl_site.
- Orphan pages: If there are no links to your pages, the robot won't find it (the robots don't enter queries in search boxes.)
Some sites are not available because of robots.txt or other exclusions. What does that mean?
The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop archiving a site, and we endeavor to comply with these requests. When you come across a "blocked site error" message, that means that a site owner has made such a request and it has been honored.
Can I search the Archive?
Using the Wayback Machine, it is possible to search for the names of sites (URLs) contained in the Archive and to specify date ranges for your search. We hope to implement a full text search engine at some point in the future.
Why isn't the site I'm looking for in the archive?
Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to the automated crawling systems. Site owners might have also requested that their sites be excluded from the Wayback Machine. When this has occurred, you will see a "blocked site error" message. When a site is excluded because of robots.txt you will see a "robots.txt query exclusion error" message.
How do you protect my privacy if you archive my site?
Like a public library, the Archive provides free and open access to its collections to researchers, historians, and scholars. Our cultural norms have long promoted access to documents that were, but no longer are, publicly accessible.
Given the rate at which the Internet is changing, the average life of a Web page is only 77 days. If no effort is made to preserve it, it will be entirely and irretrievably lost. Rather than let this moment slip by, we are proceeding with documenting the growth and content of the Internet, using libraries as our model.
How can I get a copy of the pages on my Web site? If my site got hacked or damaged, could I get a backup from the BA Archive?
What does it mean when a site's archive data has been "updated"?
When Alexa's automated systems crawl the web every few months or so, it is found that only about 50% of all pages on the web have changed from the previous visit. This means that much of the content in the archive is duplicate material. If you don't see an asterisk (*) next to an archived document, then the content on the archived page is identical to the previously archived copy.
Why is the BA Internet Archive maintaining a copy of websites? What makes the information useful?
Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.
Do you archive email? Chat?
No, we do not archive chat systems or personal email messages that have not been posted to Usenet bulletin boards or publicly accessible online message boards.
Do you archive all the sites on the Web?
No, we archive only publicly accessible Web pages. We do not archive pages that require a password to access, pages tagged for "robot exclusion" by their owners, pages that are only accessible when a person types into and sends a form, or pages on secure servers. If a site owner properly requests removal of a Web site, we will exclude that site from the Wayback Machine.
Is there any personal information in these collections?
We archive Web pages that are publicly accessible. These may include pages with personal information.
Who has access to the collections? What about the public?
Anyone can access our collections through our website http://archive.bibalex.org. The web archive can be searched using the Wayback Machine.
How do I contact the BA Internet Archive?
All questions about the Wayback Machine, or other BA Internet Archive projects, should be addressed to archive AT bibalex DOT org
What software can play the downloaded movies?
VLC Media Player is the most versatile player we've found for playing the wide variety of movies found in the Archive. And, it's free! We also recommend MPlayer.
MPEG1 (VCD) most players;
MPEG2 (DVD) freeware VLC, shareware player from http://www.elecard.com, or for-pay QuickTime6 plugin:
For Mac OSX and 9:
MPEG1 (VCD) most players;
MPEG2 (DVD) freeware VLC (http://www.videolan.org/) the for-pay QuickTime6 add-on (see http://www.apple.com/quicktime/products/mpeg2playback/).
Some Mac users have written to us suggesting MPlayer (OS X), BBDEMUX, and MPEG2DECX -- free on http://www.versiontracker.com.
Why do I get errors when I try to play a movie?
The best all-around, free player is VLC Media Player; it handles most of the movie files you will find on this site. If you're seeing errors when you try to play movies, please try downloading VLC and using that instead. This clears up many people's problems.
Here are some other possible problems:
1.There is heavy traffic to our site. If you experience a delay, please try again later or at a different time of day.
2.You're behind a firewall and the firewall software is attempting to modify incoming bits. Contact your network or firewall administrator.
3.Your Internet connection went down or timed out. Check with your ISP or network administrator to see if there's a special policy about keeping a connection live.
4.If your browser seems to hang after a "100% downloaded" message, check to see that you have sufficient hard-disk and TMP disk space. Rebooting the system sometimes helps.
5.You are trying to play an MPEG-2 file on a platform other than Windows or Linux. At present, you need VLC (http://www.videolan.org) or the for-pay QuickTime6 add-on to play MPEG-2 files on the Macintosh. Please contact us at info at archive AT bibalex DOT org if you have information about other players that work on platforms other than Windows.
6.Your player tried to stream the movie, and it isn't streamable. Download the movie first, and then play it. (Right-click > Save As)
7.Some conflict exists between your computer's configuration and the player you're using. Unfortunately, because PCs can be set up in so many different ways and because different standards exist for playing video, finding a player that will work is a hit-and-miss process. Try Rod Hewitt's evaluations of a number of players.
Can I use these movies in FinalCut Pro -- in the QuickTime format?
You can re-encode Mpeg2 movies to QuickTime for FinalCut Pro using Cleaner 5.0.2 with the following settings. There is no de-interlacing, so you don't lose anything. The files increase in size 10 fold, so make sure you have enough HD space. This procedure gives you QuickTime movies suitable for use with final cut.
Cleaner 5 -- if you don't have 5.0.2, you can download .0.2 from the terran.com site.
- output > quicktime, .mov
- tracks > process everything
- image > image size constrain to 720*480, display size normal, do not deinterlace, field dominance-SHIFT DOWN
- encode > apple DV-ntsc codec, millions of colors, spatial quality 100%, frame rate, same as source
- Audio > we're still not sure about which is best. Start with mono, 48kb, and experiment.
Some have had good results with their decoder cards. compare a few films done both ways on a good monitor with scopes and see which method is best.
One of the simplest ways to transcode movies from MPEG-2 to DV format for editing is to use the freeware utility MPEG Streamclip (Mac OS X and Windows) available at squared5.com. It offers many settings and maintains video/audio sync.
Who owns the rights to these movies?
This will vary for practically every movie in the archive.
Are there other similar archives on the Web?
There are many sites that allow users to upload videos, but most of them only display very low quality video and/or do not let you download the videos.
As far as we know, this is the only site that presents high-quality downloadable movie data files with such liberal use restrictions. See the Links page at Prelinger Archives for a number of sites that may be useful to researchers or those seeking specific films or footage.
What are the encoding parameters used in digitizing the MPEG2 movies?
MPEG-2, DVD - 720 x 480 or 702 x 480 interlaced. With a system header on each pack to be compatible with DVD. (Prelinger movies are 1/2 D1 352 x 480 29.97 fps which causes some players to make them look skinny)
How do I contact the BA Internet Archive?
To do this under Linux from the command line: This requires a few common programs. Using any modern package distribution of Linux, installing these should be quite simple.
1.The first command copies just the video out of input.mpeg and produces output.video:
mplayer input.mpeg -dumpstream -dumpfile /dev/stdout | tcextract -t vob -a 0 -x mpeg2 > output.video
2.The second command copies just the audio out of input.mpeg and produces output.audio:
mplayer input.mpeg -aid 128 -dumpaudio -dumpfile output.audio
3.The third command combines the video and audio back together again in a format ready for dvdauthor:
mplex -f 8 -V -o complete.vob output.video output.audio
4.This step creates the dvd structure. Create a new file with any text editor with the following:
<vob file="complete.vob" chapters="0,15:00,30:00,45:00,60:00"/>
The chapters line lists the points to include chapter marks on the DVD for jump navigation.
5.Now let dvdauthor create our DVD:
dvdauthor -x dvdauthor.xml
Done! You should now have a folder called "DVD_folder" with your movie. You can create an ISO or BIN image with mkisofs:
mkisofs -dvd-video -V "Movie Title" -o movie.iso DVD_folder/
You can play movie.iso in almost any video player or burn it to a DVD:
growisofs -speed=16 -dvd-compat -Z /dev/dvd=movie.iso
If you just want to burn the film to a DVD you do not have to create the movie.iso image file:
growisofs -speed=16 -dvd-video -dvd-compat -V "Movie Title" -Z /dev/dvd DVD_folder/
How did the Internet Archive digitize the films?
The Prelinger Archives films are held in original film form (35mm, 16mm, 8mm, Super 8mm, and various obsolete formats like 28mm and 9.5mm). Films were first transferred to Betacam SP videotape, a widely used analog broadcast video standard, on telecine machines manufactured by Rank Cintel or Bosch. The film-to-tape transfer process is not a real-time process: It requires inspection of the film, repair of any physical damage, and supervision by a skilled operator who manipulates color, contrast, speed, and video controls.
The videotape masters created in the film-to-tape transfer suite were then digitized at Prelinger Archives in New York City using an encoding workstation built by Rod Hewitt. The workstation is a 550 MHz PC with a FutureTel NS320 MPEG encoder card. Custom software, also written by Rod Hewitt, drove the Betacam SP playback deck and managed the encoding process. The files were uploaded to hard disk through the courtesy of Flycode, Inc.
The files were encoded at constant bitrates ranging from 2.75 Mbps to 3.5 Mbps. Most were encoded at 480 x 480 pixels (2/3 D1) or 368 x 480 (roughly 1/2 D1). The encoder drops horizontal pixels during the digitizing process, which during decoding are interpolated by the decoder to produce a 720 x 480 picture. (Rod Hewitt's site Coolstf shows examples of an image before and after this process.) Picture quality is equal to or better than most direct broadcast satellite television. Audio was encoded at MPEG-1 Level 2, generally at 112 kbps. Both types of MPEG-2 movies have mono audio tracks.
How can I re-code Prelinger Archive films to SVCD so I can watch them on a DVD player?
See archived version of www.moviebone.com/.
How can I get access to these movies on videotape or film?
Access to the movies stored on this site in videotape or film form is available to commercial users through Archive Films, representing Prelinger Archives for stock footage sales. Please contact Archive Films directly:
Archive Films/Archive Photos
75 Varick Street
New York, NY 10013
+1 (646) 613-4100 (voice)
+1 (646) 613-4140 (fax)
+1 (800) 876-5115 (toll free in the US)
sales AT archivefilms DOT com
Please visit www.prelinger.com/prelarch.html for more information on access to these and similar films. Prelinger Archives regrets that it cannot generally provide access to movies stored on this Web site in other ways than through the site itself. They recognize that circumstances may arise when such access should be granted, and they welcome email requests. Please address them to Rick Prelinger.
The Internet Archive does not provide access to these films other than through this site.
Are there restrictions on the use of the Prelinger Films?
The Prelinger movies are open and available to everyone without charges or fees. You are warmly encouraged to access, download, use, and reproduce these films in whole or part, in any medium or market throughout the world, for any purpose whatsoever. We would appreciate attribution or credit whenever possible, but do not require it.
Can you point me to resources on the history of ephemeral films?
See the bibliography and links to other resources at www.prelinger.com/ephemeral.html.
Why are there no post-1964 movies in the Prelinger collection?
Because of copyright law. While a high percentage of ephemeral films were never originally copyrighted or (if initially copyrighted) never had their copyrights properly renewed, copyright laws still protect most moving image works produced in the United States from 1964 to the present. Since this site exists to supply material to users without most rights restrictions, every title has been checked for copyright status. Those titles that either are copyrighted or whose status is in question have not been made available. For information on recent changes in copyright law, see the circular Duration of Copyright (in PDF format) published by the Library of Congress.