A sharp eyed Declan McCullagh of CNet recently posted about several federal government websites using robots.txt files to keep their entire site from being indexed by search engines.
Declan also points out other government sites who are using quirky robots.txt restrictions based on the bots they presumably prefer (example: favoring MSN’s bot over Google).
So the question arises, is this the work of an inexperienced webmaster or part of a broader government conspiracy to hide web content?
Declan theorizes “I can think of two reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was discovered by Google and (b) letting the Feds modify a file such as a transcript without anyone noticing. (There have been allegations of the Bush administration altering, or at least creatively interpreting, transcripts before. And I’ve documented how a transcript of a public meeting was surreptitiously deleted — and then restored.)”
While a conspiracy theory based on the feds hiding web content for information manipulation purposes is an attractive assumption, in reality I can’t believe this would be the actual intent. Let’s not forget that robots.txt is an entirely voluntary command structure. Any person or bot who chooses to ignore robots.txt can freely access and save all content available in any publicly accessible area of all of these sites.
Say for example CareerBuilder wanted to crawl and cache content from the Office of the Director of National Intelligence’s (ODNi) site. Perhaps it would look something like this? Perhaps this content would also be accessible from a search engine and would look something like this.
I’m fairly certain that the ODNI is smart enough to know that robots.txt isn’t a security mechanism. I personally feel that these robots.txt files are either the work of inexperienced web masters or part of a misguided desire to reduce search engine visibility. If this is an effort to reduce search engine visibility, then that effort has failed. Many of the phrases I searched for which were contained on the ODNI website were either quoted somewhere else or down right copied all together.
This means that ODNI’s original content is available on search engines, and even worse, is solely available (through search engines anyways) on web pages which the ODNI doesn’t control. By having publicly available content and restricting search engine bots, the ODNI has effectively released control over their content to third parties who may append that content with politically motivated criticisms.