Over the course of the past few weeks I have been so tied up in the working on a new technology stack that I haven’t formely introduced my PHPAzureContest (and university disertation project) ‘Twitter Sentiment Engine.’
Today I took an hour or two to catch up on some of the written side of this project and put togethert an introduction, rational, requirments and draft functional spec. I say ‘draft’ as at the moment the specification is more than a little vauge. In the next week (when I drag myself away from Azure) I’m going to fill this out more.
Twitter Sentiment Introduction And Rational
Twitter Sentiment Engine (TSE) is a tool for analysing the way people feel towards an event, brand or product on the social network Twitter. Twitter is one of the fastest growing social networks around the world. Users make status updates (tweets) about a variety of subjects. There is a clear commercial rationale for this product for companies who wish to answer the question “How are we perceived right now?” Answering this question would help companies quickly respond to emerging trends on this fast paced and influential social network.
The aim of this project is to deliver a working prototype of TSE. It should accept a keyword or keywords, gather a sample of positive and negative tweets about the keyword and use this sample to produce a filter that can monitor sentiment on an on going basis. The tool may initially consist of a reporting website but a key objective is that is should expose an API that lets developers access the product and integrate it with their own social media monitoring solutions. A secondary objective of this project is that TSE should be a highly scalable solution. It should be able to expand it’s processing power and throughput by adding more hardware to the solution.
Twitter Sentiment Engine Requirements Document
The product will be capable of inferring sentiment (positive or negative) on a Tweet.
The product will be capable of tracking keywords or terms on Twitter.
Tracked terms should have regular sentiment statistics available for them (ie, What do people feel about this tracked term now).
Sentiment history should be available for tracked terms.
The product should be capable of serving numerous endpoints / end applications.
The product should provide a means of authentication for the end points.
Tracking and the NLP processing should occur ‘off request’ / asynchronously.
Twitter Sentiment Engine Functional Specification (draft)
Section One: Sentiment Detection / Natural Language Processing
The product will make use of the naïve Bayes filtering technique to infer sentiment on tweets.
- A training corpus is created. The corpus consists of an equal sample of positive and negative Tweets.
- Every Tweet in the corpus is ‘normalised’. Punctuation is removed, letters are lower cased.
- Each tweet is split into its component words.
- For each sample class (+tive / -tive) the instances of words are counted and recorded.
- Tweets are gathered from the Twitter API and classified.
- Every tweet undergoes the same normalisation process as in the training corpus.
- Tweets are split into the component words. Bayes formula is used to determine the probability that a word is either positive or negative.
- Providing that the probability for either +tive / -tive exceeds a given certainty threshold the tweet is classified by the greater possibility +tive / -tive.
Reid(2005) describes the process of using a ‘noisy indicator’ to broadly cast a sample into one of a number of possible classes. He used the example of the presence of a smiley face glyph in small text samples to broadly infer the authors sentiment / feelings. This processes will be used in TSE to gather sample data from the twitter api.
- The Twitter search API will be used to gather tweets with the tracked term and a smiley face glyph.
- Positive Glyphs are:
- Negative Glyphs are:
- Tweets with more that one glyph are discarded (as this may symbolise conflicting sentiment or irony).
- The sample should be as large as possible. Twitter currently supports a sample of 1500 tweets per search term. Two search terms are used totalling a sample size of 3000 tweets.
- The training corpus will be persisted using a database.
- A term may have more than one training corpus associated with it (in the case of a retraining).
- The Twitter API will be used to query for Tweets with the search term used to in the training corpus on a regular basis.
- The persisted training corpus will be used to calculate the sentiment of tweets as described above.
Section Two: API & Infrastructure
The product will have an internal and external API. The external API will deal with tracking requests, sentiment statistics data requests. The internal API will expose an event based queuing system used by worker nodes to process the sample gathering and sentiment processing jobs.
Public REST API
The public REST API accepts tracking requests from authorized web clients. It also provides statistics on tracked terms. The public API utilises the AtomPub protocol for publishing data about the tracked terms. The service is hypermedia driven:
- A single AtomPub workspace exists containing the category ‘tracking’.
- An AtomPub feed provides hyper media links to entries which represent tracked terms.
- Tracked term entries contain links to a paginated resource representing the historical data help on that entity.
- The API will provide a search resource to obtain specific tracked item entries.
- Authentication is achieved using AuthBasic over SSL utilising API keys.
Internal REST API
- The internal API exposes a single workspace containing the category ‘tracking’.
- An AtomPub feed provides hypermedia links to <app> draft entities representing unfulfilled tracking requests. Workers monitor the feed and fulfil the jobs represented by each entry. Concurrency is managed using HTTP cache control headers and etags which are checked before a resource is PUT back to the service on job fulfilment.
- A similar feed is exposed to marshal the Sentiment statistics gathering jobs.
- Authentication is achieved using Auth Basic over SSL using API keys. IP based blocking should also be in place.