Marketing Pilgrim's "Search Marketing" Channel

Marketing Pilgrim's Search Marketing Channel is sponsored by ClickZ Live Chicago. Register to attend today!

Google is Cracking the “Invisible Web”



It was announced on Friday via the Google Webmaster Central Blog that the search engine has been experimenting with crawling through html forms in order to more fully index sites.

The search engine is filling out forms on a “small number of particularly useful sites” in order to bring forth the content that is populated by these forms and index these results.

“In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”

As Danny Sullivan pointed out on SearchEngineLand.com, it was not so long ago that Google of a crack down on a web site’s search results ending up in the SERPs. From the above portion pulled from the Webmaster Central Blog, these are the exact type of results they are now spidering and looking to index.This is exciting for many webmasters as it means the information that you may be creating dynamically, through the use of a database and html form, may not be left in the dark long.The “invisible web” may finally be coming into the light.For site owners that might be weary of the information hidden behind their html forms becoming items on the SERPs, the blog insists that the restrictions set up in the robots.txt file will be followed.

“…the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate.”

It is hard to say how long it will be before this type of indexing becomes a standard for Google. As is their MO they tend to adapt their technology, run with it, and then improve upon the results.

  • http://www.gadgets4nowt.co.uk PS3

    That’s quite a radical turnaround, Google seem to be having a bit of a shake up. I don’t currently use forms but may have to start!

  • http://www.askowlbert.com Barbara Ling (aka Owlbert)

    Wow, I’ve been teaching about the Invisible web now since, hmmm….1998? Almost a decade now…it’s great that Google is starting to index content of that nature. Resource Shelf is a useful tool for that aspect of the Internet as well.

    Enjoy,

    Barbara

  • Pingback: Google to reveal the Deep Invisible web

  • http://www.werebu.com/ Sasha T.

    I’m not really that excited about this news, maybe because I don’t have forms on my blog. I guess it will help some websites but more likely people will block G spider.

    Sasha T.

  • http://www.blahblahtech.com/2008/02/improve-seo-with-google-webmaster-tools.html Wayne Smallman

    This is an astonishingly bad idea by Google.

    My clients use their forms to harvest data from their customers before offering them information that is in some cases, commercial sensitive.

    Yes, anyone could just fill out a form and share this data, but that really doesn’t happen. The value is in the data.

    If that data is there for all to see, what edge does any company have, and how do they engage their prospective consumers?

    The problem is, too few people are aware of the robot.txt file, so the negatives massively out-weigh the benefits.

    The dialogue is effectively killed…

  • http://www.jaankanellis.com Jaan Kanellis

    So the obvious question is we can stop Googlebot from doing this. Sort of a nofollow for forms.

  • http://xn--7dbcc0f.net/ guruassassin

    do not forget guys
    google rule it all !!!

  • Pingback: Google passa a descobrir links em formulários | Lucrando na Rede

  • Steve Teal

    For as long as we can stop the bots from spidering with the robots.txt i am sure we have nothing to worry about.

    Steve

  • http://www.whatimnot.com Piper

    I wonder if this will lead to duplicate content issues for people, depending on the nature of the forms and the content that becomes available by going through the forms. I’d definitely like to hear more about how this works.

  • Pingback: » Search Engines Filling In Forms - Scotland SEO Blog

  • http://chasinggoogle.blogspot.com Elections guy

    Looks like we got new black-hat SEO tools.

  • http://www.mmmeeja.com/blog andymurd

    @Wayne Smallman – Google’s blog post stated that they only fill in forms that use the HTTP method “GET”. That’s a pretty small number and usually used for search or drop down menus.

    If a form results in entries going into a database, it should use “POST” instead of “GET” because the browser can legitimately cache the results of “GET” requests.

    I think Google are doing the right thing here and blackhats have probably been using this technique for years (whilst ignoring robots.txt etc).

  • http://www.blahblahtech.com/ Wayne Smallman

    Hi Andy, I appreciate what you’re saying, and that less reputable people may have been using these techniques.

    But what these guys never had was Google’s scale and reach…

  • http://www.goodnightmoonfuton.com Futon-Matt

    I like that idea, though I only have one form on my site.

  • http://seologia.com.ar Seologia

    Is there a list of the sites they’re experimenting on?

  • http://www.trbr.net Seomotion

    It is good idea. But I am afraid of server bandwidth

  • http://www.thevanblog.com Steven Bradley

    Like most things this comes with both pros and cons. I think more pros, which is good, but we will have to pay a little more attention to preventing thank you pages from being indexed. There could also be duplicate content issues too depending on your forms.

    Overall though, this is good news.

  • http://www.seoresults.co.za Web Marketing Man

    This should prove to be helpful, especially if webmasters use robots.txt to block the googlebots where they aren’t supposed to go. The more information being made available in the returned results the better experience for the end-user will be at the end of the day.

  • 9tlwyg

    I think, now the big question is, where Googlebot getting its search queries? Google press says it is getting it from “the site”. If search queries are stored on the site (via Google Analytics or whatever), then it’s a big privacy violation.

    And I don’t see a point of Google expanding its index as we have too many already.