The Challenge of Parsing Language and Curating Streams:
Aggregating accurate social data for The Weather Channel (TWC) required determining, with precision, if a tweet is actually about the weather. Not so easy considering there are, for instance, many connotation of “hot”.
Initial Filtering based on Queries
Consider this tweet: “It’s hot as a frying pan outside!” Hot and outside are two terms that cue us in to a discussion of weather. The same words could be used in a completely different light, however. “I think Kanye & Jay Z’s new album is hot. Can I get an outside opinion?” matches the same query “hot AND outside”, but is inconsequential to TWC.
We built an AI that could be hand trained to classify the narrowed down tweets as either weather or not weather. The classifier is based on the Maximum Entropy algorithm. We wanted training to be as simple and intuitive as possible, as we wanted to encourage constant training, as results get better with larger training sets. We created a web-based interface that displays a single tweet in a large font with 3 buttons – Weather, Not Weather, and Skip. When the user clicks one, another tweet is instantaneously displayed. Feedback is given to the user that shows total trained, and total trained by me.
Our next task was deciphering the location of weather-related tweets. The fact that “it’s hot as a frying pan outside!” is meaningless if we don’t know where outside is. We needed to assign location to as many tweets as possible. Some tweets are already geo-tagged, which made it easy for us to pass along coordinates to TWC. Unfortunately, only1-3 percent of tweets are attached to any geo location. To determine location, we looked at the user’s self-reported location, as well as certain location indicators within the tweet.
Tweets must be free of profanity to be cleared for broadcast on TWC. We went with a very aggressive approach in weeding out instances of profanity. We have an extensive profanity dictionary that searches within words and screen names to flag tweets as containing profane language.
Synthesizing and Curating the Pulse of the People
What TWC and Twitter have done is the most ambitious on-air/online/in-app curation effort that a network has realized to date, articulating the world’s largest and oldest narrative – the weather.
We enjoyed the language processing challenge and overall scale in bringing this project to launch, and are excited to bring the know-how we gleaned to offer more support to news organizations as they jump the hurdle into accurate Twitter-based reporting. Social data helps crowd source the news, turning every Twitter user into a potential source for on-the-ground, first-hand reporting. This next step in marrying social networking with news dissemination will create a compelling integrated social user experience, improve field reporting, push dissemination of news and events to real-time, advance SEO with changing content and context, and continue to unlock the value of the Twitter fire hose.