It’s not easy to index millions (well let’s make that billions) of tweets especially since the nuances of language can make context a difficult thing to understand.
Twitter knows that its search must improve and its latest efforts are laid out in painstaking detail in a post on the Twitter engineering blog. We’ll get to the words in a minute. First, you may (and I emphasize may) want to watch this video that is part of the post. If you want the 3.5 minutes of your life back after watching us don’t complain to us. You have been warned.
Now to the words. Essentially, Twitter is using real-time work by real humans to help categorize the mountains of tweets which, in turn, help with making Twitter search more effective. The trick is doing the categorization as close to the trending ‘spike’ in tweets that occurs around subjects and events in real time. The post tells us
From a search and advertising perspective, however, these sudden events pose several challenges:
1. The queries people perform have probably never before been seen, so it’s impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for “horses and bayonets” are interested in the Presidential debates?
2. Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.
So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can — because in just a few hours, the search spike will be gone.
How do we do this? We’ve built a real-time human computation engine to help us identify search queries as soon as they’re trending, send these queries to real humans to be judged, and then incorporate the human annotations into our back-end models.
There is mention of the Storm topology used and an explanation of the Turkers (those silly humanoids) which appear to be of the in-house variety due to the complexities of this operation and more. It appears to be quite an undertaking.
But it’s one of the final sentences in the post that is the most telling when the author thanks who was involved in getting this project underway. Look who is first in line (although not mentioned at all in post otherwise).
Many thanks to the Revenue and Storm teams, as well as our Turkers, for their help in launching this project.
No surprise here because while this whole thing appears to be about search and Turks and better results its really about revenue. Isn’t everything? Now we’ll see just how this activity turns into advertising products for Twitter and how Twitter will pay to scale this operation with your ad dollars. Makes sense to me.