Overview of Sources of Archived Web Pages¶
The EDGI Website Monitoring project currently ingests data from three sources:
- Internet Archive Wayback Machine 
- Versionista (legacy) 
Internet Archive¶
The Internet Archive Wayback Machine stores snapshots of HTML pages. The Internet Archive at large stores a broader range of archived content, but we are focused on the Wayback Machine.
The original response from the Internet Archive is a list of ‘mementos’ (or ‘versions’, as we call them). Each line contains information about a version. At times, a line may be a ‘TimeMap’, which is essentially a link to the next list of mementos.
source_metadata : Byte sequences containing information separated by semi-colons of each version stored in Internet Archive. We extract the useful information i.e. the date and uri from each memento.
Example:
b'<http://web.archive.org/web/19970711094601/http://www.nasa.gov:80/>; rel="memento"; datetime="Fri, 11 Jul 1997 09:46:01 GMT",'
Versionista¶
Versionista returns a JSON blob which contains the following fields:
- account : The Versionista account we’re logging into to get the versions. We have two accounts - versionista1 & versionista2 
- siteName : The website of the file. 
- agency : The name of the Government agency which owns the website. 
- versionistaSiteUrl : A link to a website as it is stored on Versionista. 
- versionistaPageUrl : A link to a webpage as it is stored on Versionista. 
- pageUrl : The page’s true URL. 
- pageTitle : The title of the page as defined in the title tag. 
- siteId : Id of the website in Versionista. 
- pageId : Id of a webpage in Versionista. 
- versionId : Id of a version in Versionista. 
- url : The full URL to view this version in Versionista. You’ll need to be logged into the appropriate Versionista account to make use of it. 
- date : The date and time when the version was captured. 
- hasContent : Indicates if Versionista has stored any content of the page or not. There is a limit on the size of the versions Versionista can store. True or False 
- diffWithPreviousUrl : URL to diff view in Versionista (comparing with previous version). 
- diffWithPreviousDate : The capture date of the first ever captured version of the page. 
- diffWithFirstUrl : URL to diff view in Versionista (comparing with first version). 
- diffWithFirstDate : The capture date of the current version of the page. 
- textDiff : A dictionary with the URL to the text diff view in Versionista and its SHA 256 hash and length. 
- diff: A dictionary with the URL to the entire diff view in Verisionista and its SHA 256 hash and length. 
- filePath : Path to the diff file which is stored in our archive. 
- hash : The diff file’s SHA 256 hash. 
Recent Versionista output file
| Aspect | Internet Archive | Versionista | 
|---|---|---|
| Type | Byte Sequence | JSON | 
| Version/file can be directly accessed | No | No | 
| Elapsed time details | Not present | Not Present | 
| Page meta tag data/ header | Not present | Not Present |