Site operators may mark some links so that they are ignored by Web crawlers – typically, for the purpose of keeping some section of a Web site invisible to casual searchers. The most common means of doing this is to create a file in a standard location, technically called a "robots.txt" file, that lists a set of locations or directories that the crawler is asked not to index. It is completely voluntary for companies to follow this protocol, but all of the major search engines do.
There are legitimate reasons to use a robots.txt file to stop information that, while available on the Web, may not be appropriate for wide distribution, or to prevent copyrighted material from being cached in search engines. A robots.txt file also can be used to prevent duplicate content from being crawled, or to protect non-robust applications used on the Web site. However, robots.txt can be misused, too, over-blocking content and preventing search engines from crawling the site. For instance, much has been said about the whitehouse.gov robots file and other agencies such as ATF have added wide swaths of their websites to the list of hard to find information with just a few lines of code.
Federal government Web sites contain public information and resources that should be readily available. The widespread use of robots.txt on federal government Web sites is a questionable practice that serves to limit the availability of information, as shown in our previous examples.
Previous section: Web Crawlers | Next section: Sitemapping