*****************************************
Overview of Sources of Archived Web Pages
*****************************************
The EDGI Website Monitoring project currently ingests data from three sources:
* Internet Archive Wayback Machine
* Versionista (legacy)
Internet Archive
================
The Internet Archive Wayback Machine stores snapshots of HTML pages. The
Internet Archive at large stores a broader range of archived content, but we
are focused on the Wayback Machine.
The original response from the Internet Archive is a list of 'mementos' (or
'versions', as we call them). Each line contains information about a version. At
times, a line may be a 'TimeMap', which is essentially a link to the next list
of mementos.
`source_metadata` : Byte sequences containing information separated by semi-colons of each version stored in Internet Archive.
We extract the useful information i.e. the date and uri from each memento.
Example:
.. code-block:: python
b'; rel="memento"; datetime="Fri, 11 Jul 1997 09:46:01 GMT",'
`Reference on Mementos & TimeMaps `_
Versionista
===========
Versionista returns a JSON blob which contains the following fields:
* `account` : The Versionista account we're logging into to get the versions. We have two accounts - `versionista1` & `versionista2`
* `siteName` : The website of the file.
* `agency` : The name of the Government agency which owns the website.
* `versionistaSiteUrl` : A link to a website as it is stored on Versionista.
* `versionistaPageUrl` : A link to a webpage as it is stored on Versionista.
* `pageUrl` : The page's true URL.
* `pageTitle` : The title of the page as defined in the `title` tag.
* `siteId` : Id of the website in Versionista.
* `pageId` : Id of a webpage in Versionista.
* `versionId` : Id of a version in Versionista.
* `url` : The full URL to view this version in Versionista. You’ll need to be logged into the appropriate Versionista account to make use of it.
* `date` : The date and time when the version was captured.
* `hasContent` : Indicates if Versionista has stored any content of the page or not. There is a limit on the size of the versions Versionista can store. `True` or `False`
* `diffWithPreviousUrl` : URL to diff view in Versionista (comparing with previous version).
* `diffWithPreviousDate` : The capture date of the first ever captured version of the page.
* `diffWithFirstUrl` : URL to diff view in Versionista (comparing with first version).
* `diffWithFirstDate` : The capture date of the current version of the page.
* `textDiff` : A dictionary with the URL to the text diff view in Versionista and its SHA 256 hash and length.
* `diff`: A dictionary with the URL to the entire diff view in Verisionista and its SHA 256 hash and length.
* `filePath` : Path to the diff file which is stored in our archive.
* `hash` : The diff file's SHA 256 hash.
`Recent Versionista output file `_
======================================= ================ ===========
Aspect Internet Archive Versionista
======================================= ================ ===========
Type Byte Sequence JSON
Version/file can be directly accessed No No
Elapsed time details Not present Not Present
Page meta tag data/ header Not present Not Present
======================================= ================ ===========