Discovering the Rest of the Internet Iceberg
Did you know that the visible portion of an iceberg only represents about 20% of its actual size? Beneath the water surface lies the other 80%. Imagine if the captain of the Titanic had that tidbit of information. Well the Internet is similar in many ways. The amount of the entire scope of the Internet that is still inaccessible to the engines and their crawlers is quite amazing. Even as Google indexed it one trillionth (with a T) web address last summer it appears as if there is so much more out there.
A New York Times article introduces this concept like this:
Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.
The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.
Since there is so much more out there then you would suspect that there are folks trying to find it. There are and of course Google is among them. Some wonder though what Google would do with the information. It is speculated that it may require a different presentation for search results that until now has been untouchable save the occasional intrusion of universal search and personalized results.
“Google faces a real challenge,” said Chris Sherman, executive editor of Search Engine Land. “They want to make the experience better, but they have to be supercautious with making changes for fear of alienating their users.”
While this may not be at the front of the news and on everyone’s mind all the time it is very real. When you learn that a company like Kosmix, who is doing work in this area, is backed in part by Jeff Bezos of Amazon it’s hard not to raise an eyebrow and think what may be the next generation of search is closer than we may have thought. Even if it isn’t close there is a race on to get there first that could mean ridiculous amounts of money and power.
For a little deeper look, it appears that the true information that currently is not being found by traditional crawlers is in the databases of the Deep Web. Personally, I find it hard to grasp the sheer amount of data that exists on the web and how it is presented right now. Taking a look at this whole matter though is certainly interesting. We have been trained that what is given to us by Google search is the definitive answer (which it is at the moment) and we even ignore anything past the first 5 or so results as being not worthy. With this impatient approach to results will it even matter if we are given more data? Will it now mean that the first 20 results are SO relevant that we have to start slowing down and making decisions on our own rather than having the engines think for us?
Read the article and get the details because it is interesting for sure. Here’s a final thought to leave you with.
“The huge thing is the ability to connect disparate data sources,” said Mike Bergman, a computer scientist and consultant who is credited with coining the term Deep Web. Mr. Bergman said the long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.
The whims of web surfers? Is that all we are? Well, actually I guess it is in this context. How’s that for making you feel significant on this fine Monday!