The Need for Web Archives
The lifetime of the average page on the web is notoriously short. As a result, links to web pages often become dysfunctional over time. Simultaneously, the web is highly dynamic in nature, and the content on any web page is likely to change over time.
To combat these dual problems of link rot and content drift, a number of web archives exist, which periodically crawl and store web pages. Users can leverage these archives to refer to the content hosted on a specific URL, at any particular point in time from the past.
Let’s look at each of these issues in more detail.
To eliminate the differences in resource URLs that are requested across different loads of the same page, Jawa eliminates the underlying sources of variation. To do so, Jawa tracks the values of each such source of variation when crawling a page, and enforces the same values when a user later loads the archived page snapshot.
From our research paper that describes Jawa, the key findings are as follows:
- To store a corpus of 1 million page snapshots that we downloaded from the Internet Archive, Jawa reduces the total amount of storage needed by 41%.
- On over 95% of pages in a corpus of 3000 pages, Jawa eliminates almost all failed network fetches when loading archived pages in a different browser than the one used to crawl these pages.
- Jawa improves the number of pages that can be crawled per hour by 39%.