Posted August 24, 2007 2:28 pm by with 6 comments

Tweet about this on TwitterShare on LinkedInShare on Google+Share on FacebookBuffer this page

Like a page out of a John Grisham novel, the Federal Government is using robots to help stay invisible on the web. Of course I’m not talking about futuristic robots with laser beams for eyes, but rather robots.txt files on various government websites.

A sharp eyed Declan McCullagh of CNet recently posted about several federal government websites using robots.txt files to keep their entire site from being indexed by search engines.

The offenders?

Declan also points out other government sites who are using quirky robots.txt restrictions based on the bots they presumably prefer (example: favoring MSN’s bot over Google).

So the question arises, is this the work of an inexperienced webmaster or part of a broader government conspiracy to hide web content?

Declan theorizes “I can think of two reasons: (a) avoiding the situation of posting a report that turned out to be embarrassing and was discovered by Google and (b) letting the Feds modify a file such as a transcript without anyone noticing. (There have been allegations of the Bush administration altering, or at least creatively interpreting, transcripts before. And I’ve documented how a transcript of a public meeting was surreptitiously deleted — and then restored.)”

While a conspiracy theory based on the feds hiding web content for information manipulation purposes is an attractive assumption, in reality I can’t believe this would be the actual intent. Let’s not forget that robots.txt is an entirely voluntary command structure. Any person or bot who chooses to ignore robots.txt can freely access and save all content available in any publicly accessible area of all of these sites.

Say for example CareerBuilder wanted to crawl and cache content from the Office of the Director of National Intelligence’s (ODNi) site. Perhaps it would look something like this? Perhaps this content would also be accessible from a search engine and would look something like this.

I’m fairly certain that the ODNI is smart enough to know that robots.txt isn’t a security mechanism. I personally feel that these robots.txt files are either the work of inexperienced web masters or part of a misguided desire to reduce search engine visibility. If this is an effort to reduce search engine visibility, then that effort has failed. Many of the phrases I searched for which were contained on the ODNI website were either quoted somewhere else or down right copied all together.

This means that ODNI’s original content is available on search engines, and even worse, is solely available (through search engines anyways) on web pages which the ODNI doesn’t control. By having publicly available content and restricting search engine bots, the ODNI has effectively released control over their content to third parties who may append that content with politically motivated criticisms.

  • “I’m fairly certain that the ODNI is smart enough to know that robots.txt isn’t a security mechanism.”

    I’m not as certain as you are. You would like to think they’re smart enough and I’m sure they are, but it’s quite possible the person who put up the robots.txt file isn’t and did think they were hiding the content.

  • I guess it’s right to use robot.txt for the government site. They must be having a lot of data which should be kept confidential, to maintain proper law and order.

  • As I pointed out to Matt Cutts last week, there are Federal sites (and educational and NGO sites) that disavow any endorsement of sites they link out to.

    Google nonetheless treats all their outbound links as endorsements.

    I would say the government has good reason not to trust search engines since they don’t follow direction very well.

  • Conspiracy,…..wait
    I don’t know, as suspiciouse as it looks, why would they not want to be so easily found. besides, they should know by now that the more suspiciouse it looks – the more people want to look.

  • the robot.txt do not provide any such security for any private or classified documents.
    the robot.txt only instructs the web crawler not to crawl all or part of a website.
    and not all search engines follow the robot.txt instructions.
    now the real rule of thumb here is if you want to hide a file, keep that file off
    of any networked computer or server.

    mark’s last blog

  • Yes, that a good part from our u.s government. It is even difficult to get a government grant easily. Thanks for the mention and helping out people who need to apply for grants.