Overview of Sources of Archived Web Pages¶
The EDGI Website Monitoring project currently ingests data from three sources:
Internet Archive Wayback Machine
Versionista (legacy)
Internet Archive¶
The Internet Archive Wayback Machine stores snapshots of HTML pages. The Internet Archive at large stores a broader range of archived content, but we are focused on the Wayback Machine.
The original response from the Internet Archive is a list of ‘mementos’ (or ‘versions’, as we call them). Each line contains information about a version. At times, a line may be a ‘TimeMap’, which is essentially a link to the next list of mementos.
source_metadata : Byte sequences containing information separated by semi-colons of each version stored in Internet Archive. We extract the useful information i.e. the date and uri from each memento.
Example:
b'<http://web.archive.org/web/19970711094601/http://www.nasa.gov:80/>; rel="memento"; datetime="Fri, 11 Jul 1997 09:46:01 GMT",'
Versionista¶
Versionista returns a JSON blob which contains the following fields:
account : The Versionista account we’re logging into to get the versions. We have two accounts - versionista1 & versionista2
siteName : The website of the file.
agency : The name of the Government agency which owns the website.
versionistaSiteUrl : A link to a website as it is stored on Versionista.
versionistaPageUrl : A link to a webpage as it is stored on Versionista.
pageUrl : The page’s true URL.
pageTitle : The title of the page as defined in the title tag.
siteId : Id of the website in Versionista.
pageId : Id of a webpage in Versionista.
versionId : Id of a version in Versionista.
url : The full URL to view this version in Versionista. You’ll need to be logged into the appropriate Versionista account to make use of it.
date : The date and time when the version was captured.
hasContent : Indicates if Versionista has stored any content of the page or not. There is a limit on the size of the versions Versionista can store. True or False
diffWithPreviousUrl : URL to diff view in Versionista (comparing with previous version).
diffWithPreviousDate : The capture date of the first ever captured version of the page.
diffWithFirstUrl : URL to diff view in Versionista (comparing with first version).
diffWithFirstDate : The capture date of the current version of the page.
textDiff : A dictionary with the URL to the text diff view in Versionista and its SHA 256 hash and length.
diff: A dictionary with the URL to the entire diff view in Verisionista and its SHA 256 hash and length.
filePath : Path to the diff file which is stored in our archive.
hash : The diff file’s SHA 256 hash.
Recent Versionista output file
Aspect |
Internet Archive |
Versionista |
---|---|---|
Type |
Byte Sequence |
JSON |
Version/file can be directly accessed |
No |
No |
Elapsed time details |
Not present |
Not Present |
Page meta tag data/ header |
Not present |
Not Present |