Google is Cracking the “Invisible Web”
It was announced on Friday via the Google Webmaster Central Blog that the search engine has been experimenting with crawling through html forms in order to more fully index sites.
The search engine is filling out forms on a “small number of particularly useful sites” in order to bring forth the content that is populated by these forms and index these results.
“In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
As Danny Sullivan pointed out on SearchEngineLand.com, it was not so long ago that Google of a crack down on a web site’s search results ending up in the SERPs. From the above portion pulled from the Webmaster Central Blog, these are the exact type of results they are now spidering and looking to index.This is exciting for many webmasters as it means the information that you may be creating dynamically, through the use of a database and html form, may not be left in the dark long.The “invisible web” may finally be coming into the light.For site owners that might be weary of the information hidden behind their html forms becoming items on the SERPs, the blog insists that the restrictions set up in the robots.txt file will be followed.
“…the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate.”
It is hard to say how long it will be before this type of indexing becomes a standard for Google. As is their MO they tend to adapt their technology, run with it, and then improve upon the results.