Overview of Web Measurements – Web Archive Research at the University of Michigan

You might be accessing this page because you received requests which included the link to this page in the user-agent. The requests that you received are benign and are generated at a low rate. They were generated as part of an academic study to study and fix broken links on the web.

More detailed information about the project is given below. If you would like us to stop making these requests to your site, please send us email at web-research AT umich DOT edu and we will exclude your site from our crawls.

What is the goal of this project?

It is well known that the web is decaying, namely links to pages created in the past cease to be functional as time goes on. In order to understand and address this phenomenon, we perform representative crawls of the web to quantify the fraction of broken pages and, more importantly, to characterize why they are broken.

In fact, we have found that many broken links to web pages are not because those pages no longer exist, but instead due to the reorganization of websites. Therefore, when a page is no longer accessible from its original URL, we have found that it is often available at a different URL on the same site.

Currently, we are developing a system that can help web providers automatically fix broken external links on their pages by finding the new URL that any particular broken link could be rewritten to.

Measures taken to minimize impact of web crawls

In this study, we crawl a large number of URLs in order to collect representative data and examples, as well as to test our system. We try our best to spread out our requests to any particular site and respect the policy of each site (by following the rules specified in the site’s robots.txt). We do not store any page content that we fetch, and we use all of our crawls only to log metadata.

If you still feel uncomfortable with our crawls, please contact us at web-research AT umich DOT edu and we will exclude your site from our measurements.

Project members

Huanchen Sun: MS student at University of Michigan
Jingyuan Zhu: PhD student at University of Michigan
Harsha V. Madhyastha: Associate Professor at University of Southern California