How We Did It: The Development of Trendrr v3′s Sentiment Analysis Engine
A feature we’re excited about for Trendrr v3 is a revamped sentiment analysis engine which processes the Twitter conversation. Sentiment analysis determines automatically whether a tweet expresses a positive, negative or neutral sentiment towards a brand or product. We are able to automatically determine the relative frequencies of positive, negative and neutral sentiment with accuracy greater than 90th percentile.
Our engineering team referred to Go, Bhayani, and Huang’s “Twitter Sentiment Classification using Distant Supervision” in developing our approach. Machine learning algorithms to handle this type of problem follow a basic pattern: collect some pre-categorized data (training data), and use its features to define the parameters of the model. The model takes unknown data as an input, parses the data for the same features found in the training data, and outputs its most likely category. The model’s parameters determine what this prediction will be. The more training data obtained, the more accurate the model.
Similar to the methods of Go, Bhayani and Huang, we used the presence of single words or pairs of words (e.g. whether a tweet contains the word “bad” or the phrase “not bad”) as our features, and built our model with the Maximum Entropy algorithm. The Maximum Entropy algorithm attempts to classify data by making as few assumptions as possible while respecting the observations from the training data. We gathered training data by collecting tweets containing emoticons, allowing us to obtain a large set of training data without taxing human processing. This training data is particularly useful because it is drawn from actual Twitter posts, which employ a somewhat unique vocabulary.
All of this training data, however, is taken from texts that contain opinion. To accurately identify neutral tweets we had to augment our training data, as our models were operating without ever having seen a tweet that did not express any sentiment. As such, we decided to use a technique used in sentiment classification for movie reviews called the “Hierarchical Classifier.” The classifier first determines whether a tweet is objective or subjective, and then categorizes only those tweets that express an opinion.
We made other adjustments to fine-tune our models for use with Twitter, such as accounting for URLs and usernames. We standardized all URLs and usernames (i.e. replaced them with the words URL or USERNAME) so that they would be recognized as recurring features. We also removed retweets from the training data, as these duplicates would introduce some bias.
We also added the ability to restrict our training data by language. One word may have an opposite meaning in a different language, potentially reducing the accuracy of our model. Twitter provides language metadata on its tweets, but we found it to be inaccurate. Instead, we employed another language classification technique that uses character frequency to determine the most likely language. This approach is ideal for our purposes because its accuracy is not compromised by short-form text like tweets. After making adjustments to account for Twitter-specific idioms, we were able to verify that this language classifier met our expectations.
To measure the effectiveness of our categorizer, we must compare its results with tweets that have been classified by an actual human. The simplest metric of effectiveness is to compute the percentage of tweets categorized correctly. Having classified the tweets by hand, we know the relative frequencies of each category. As such, we can compute our expected accuracy if we were to categorize tweets at random according to this same distribution. With the distribution of our test data, the expected success rate would be 38%. The model, on the other hand, correctly classifies 60% of all tweets. This represents a significant improvement over the random case.
Our users will be more interested in how accurate the classifications are to their ‘true’ average values than in the sentiment of any one given tweet. The questions we are attempting to accurately answer for our clients include: what fraction of tweets is positive, what fraction is negative, and what fraction is neutral. This relative frequency of sentiments can be viewed as a discrete distribution on one random variable taking three possible values. Thus, three relative frequencies can take any value between 0 and 1 so long as they sum to 1. Geometrically, the set of allowable relative frequencies describe a surface in the unit cube (three variables with one constraint). Specifically, the surface of the unit sphere with all coordinates positive represents the set of all allowable distributions.
What does this have to with our accuracy? The point associated with the human-measured frequencies and the point associated with the frequencies generated by our automatic classifier lie .057 apart on the surface of the sphere. Are we willing to accept this as close enough to the true frequency distribution? The set of points that are within .057 of the point for the human measured frequencies makes up less than 0.6% of the total area of the spherical section that represents all possible frequency distributions. To account for possible error in human measurement, me must consider that the true distribution may not be equal to the human measured distribution. A reasonable estimate of this error puts our classifier’s distribution at most twice this distance from the true distribution, but possibly much closer.
We can thus estimate that our categorizer has produced a frequency that is in the 97th percentile. While this exact number may depend on the data set in question, any results near this accuracy paint a clear picture of overall sentiment across the web. We find this very encouraging, and expect that our results will only improve as we expand our training data set. Furthermore, now that we have laid the groundwork in our implementation, we can easily apply the Maximum Entropy algorithm to other classification problems, and hope to extend it to deeper problems such as consumer intent.
Reference
1. Alec Go, Richa Bhayani and Lei Huang, “Twitter Sentiment Classification using Distant Supervision,” http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf”
2. Bo Pang and Lillian Lee, “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts,” http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html“





