This guide explains what Heritrix can do, why it needs our help, and how to identify documents and datasets that Heritrix can’t reach.
How a Web Crawler Works
A webcrawler like Heritrix visits web pages and searches for links. When it identifies a link, it follows it to a new page, where it once again identifies links, then follows those, and so on, and so on… As you can imagine, this rapidly leads to a very large number of links!
Internet Archive therefore imposes limits: after 3 “hops”, Heretrix will stop collecting links and move on to the next “seed” on the list. This allows a lot of “territory” to be covered—moving relatively rapidly through the huge .gov domain—but it means that there are “hidden depths” which it doesn’t see.
When we nominate seeds, we draw Heritrix’s attention to some of this “deep web,” identifying it as especially important and worth crawling.
What Heritrix Can and Can’t Get
Heritrix is a clever program, but is fully-automated and runs in a “command-line” environment. That means that there are certain things it can’t do. On the other hand, what it can do, it does very well.
- Browse to any
https://link that it finds. This includes datasets in
- Browse to any
ftp://or FTP links.
- It can’t click web form buttons like “Go!” or “Submit”. Search forms often stop it dead in its tracks, unless there’s a “browse” link on the page to the full list of searchable resources
- It doesn’t have a browser environment that can do computing work for it. Many sophisticated webpages ask the browser to do a lot of computational work. Heritrix can’t do that very well.
It can’t follow links that don’t use the Hypertext Transfer Protocol (HTTP). This mostly affects resources that aren’t designed for direct web browsing, including:
- Databases: Lots of the data on the web is stored in relational databases that are almost never available directly on the web. Instead, databases serve particular bits of information in response to a request from a webserver, or from the user or browser. Those requests travel over paths that Heritrix can’t follow.
- Document Collections: Often, government sites will present documents in a user-friendly frame of some kind. When this is the case, the PDF itself is not usually accessible to the web crawler.
Identifying Uncrawlable Data
Let’s look at one example of a difficult-to-crawl site.
There are two red flags that suggest that important data won’t be preserved by an IA webcrawl.
The picture in the center of the page is not an actual PDF, but just an image with your search terms highlighted in yellow. If you right-click to View Page Source the page, the HTML source looks like this:
<img src="/Exe/tiff2png.cgi/P100OW3T.PNG?-i+-r+85+-g+15+-h+5,0,7,7,31,7,10,0,7,13,3,7,14,28,7,20,20,7,21,45,7,25,36,7,28, 0,7,32,28,7,34,43,7,37,78,7,42,45,7,42,62,7,43,64,7,45,68,7,51,81, 7,54,55,7,56,74,7,59,70,7+D%3A%5CZYFILES%5CINDEX %20DATA%5C11THRU15%5CTIFF%5C00001231%5CP100OW3T.TIF" style="max-width:none;margin-bottom:5px">
Capturing the page content will only archive a single page of the document rather than the document itself!
First, right-click and Inspect Element on the PDF link, you’ll see that it looks like this:
<a href="#" title="Download this document as a PDF" alt="Download this document as a PDF" onclick="ZyShowPDF('PDF',event)">PDF</a>
Heritrix can’t capture links of this kind.
Second, when clicking on the link, instead of loading right away, the page first gives you a fancy loading screen. This means that some kind of communication between your browser and the server while you wait. Heritrix can’t talk to the server like your browser can!
Based on the above, any crawl will fail to capture the actual document—the actual resource!
How You Can Help – Nominating Seeds
An important task is to identify these uncrawlable datasets and interfaces with the EDGI Chrome Nomination Extension and alert data archivers to their location. Many of these datasets may be available for you to download elsewhere, such as data.gov, a centralized repository of government datasets, or in another part of an agency’s website. Once we know about them, we can try to reverse engineer the interface—look for clues that identify the underlying data and develop a strategy to preserve it, either through scraping or some other preservation avenue (such as an FOIA request or a manual download at a government library).
For more information on these avenues, see James Jacobs’ 2016 End of Term (EOT) crawl and how you can help.