Sentiment Analysis with Twitter

Recently, I’ve been learning the basics of performing sentiment analysis on social media data with R. In particular, I used the TwitteR library – written and generously shared by Jeff Gentry – to pull tweets mentioning companies competing in the digital environment out of the twitter API, analyse their content using text mining methodologies, and plot their sentiment against each other.

This method can be helpful to benchmark the perception people have of a company against its competitors, and to understand what specific things do people like and dislike about them. The best part if it is that it all can be done for free.

The aim of this post it not to provide a comprehensive guide about how to perform sentiment analysis on Twitter data, but to explain step by step one of the simplest methods to do so, and be used as a starting point to develop a more advanced analysis in line with your company strategy.

Most of the code for this article has been taken from the Mining Twitter for Airline Consumer Sentiment.

Result

So, what is the expected result of a “sentiment analysis”? We’ll start showing an example output of the analysis, and then we’ll present the details of the process and the code used to get it.

Sentiment Graphs

The first and more visual result is a series of histograms that show, for each company in the analysis, the number of tweets for each level of sentiment. The tweets with a score lower than 0 are considered negative, the ones with a score equal to 0 are neutral and the ones with a score 1 or higher are deemed positive:

Histogram showing the sentiment of tweets mentioning companies of the digital space

Histogram showing the sentiment of tweets mentioning companies in the digital space that have participated in the last edition of DBi’s Barcelona Data Summit.

On the previous series of histograms, we can see that most companies in the analysis have a majority of mentions that are neutral. However, Google Analytics has more positive mentions than neutral ones. Also, Facebook and Optimizely have more negative mentions compared to the total.

Bare in mind that this is only basic and approximate, without pretensions of giving accurate information. To assess whether the tweets are truly negative or positive, we need to read their content.

Score Table

The next output is a table that shows the total score of each company we included. The score is calculated as the ratio between the number of tweets with a very positive sentiment score and the total number of tweets with a strong sentiment (tweets with very positive or very negative scores):

Table showing the sentiment scores of the companies in the study

Table showing the sentiment scores of the companies in the study

As we can see, companies with small number of mentions may not have any tweets with very positive or very negative sentiment scores.

Also, Facebook and Optimizely get a lower score, as they have a higher proportion of negative mentions compared to the total.

Tweets with sentiment

All the above is great, but how do we know that these tweets are actually positive or negative? Also, even if we trust these scores, how can we know what are they talking about? Well, we can get the a list of the tweets with very positive sentiment scores. As an example, we’ll look at the tweets mentioning Optimizely and with a score higher than 2.  The result is the following:

[1] RT @Optimizely: Honored to be in such good company for the Bay Area’s Best Places to Work 2016 @newrelic @eventbrite @zillow https://t.co/Y…

In this case, the tweet corresponds to a grateful employee. It is good to know that people are happy to work with us.

Now it is turn for the bad opinions. In this case, we’ll use the Facebook ones:

[1] @facebook… for f***s sake… Can’t find a post posted 3mins ago due to your stupid “non chronological” timeline…
[2] My friends are unable to send me messages it keeps saying error? @facebook
[3] In other news , deletion from @facebook doesn’t hurt , nothing compares to being blocked on @twitter
[4] Friend requests on @facebook I’m still confused
[5] Facebook wants us to hide nipples but cant f***ing delete shocking and violent photos of dead animals. @facebook F***ING FIX UR PRIORITIES
[6] @facebook so where did all of our links go on our fan pages? Taken down for 0 reason? Facebook is awful.
[7] @facebook my ad account is temporarily disabled and I want to remove all card details saved. 2/2
[8] @jiminmokena @2AFight @ChristiChat @facebook I guess they have missed mine somehow. Maybe a class action law suit by those wronged?
[9] Facebook loses first round in suit over storing biometric data – https://t.co/eutAOjYUsT via https://t.co/S7aCD42ucs @facebook #tech
[10] RT @TonyTGarnett: Goodbye @Facebook? Facebook blocked my website and blog. I’m still waiting on feedback from them. It’s a mystery. https:/…
[11] RT @mutludc: . @Facebook In Dispute w/ Pro-Kurdish Activists Over Deleted Posts @saramayspary @BuzzFeedUK https://t.co/XCUewn7ed6 https://t…
[12] @facebook I lost my Facebook page of Indain Boi. Link: https://t.co/PyyTMSn1bG its not showing in my fb account. Please reply asap.
[13] RT @facebook: @KickingK Hi there. Thanks for flagging. Please report this to us by following these Help Center steps: https://t.co/DwCfhOMC…
[14] Anybody else just recently getting a ridiculous amount of notifications from @facebook ?? Getting to the point where I’ll switch em all off

We can see in the first tweet that people are not particularly happy about the new timeline, the second complains about problems with the messaging system, others complaining about censorship policy, and we could keep going. Others, like the tweet 13, are not that negative, so it proves that the algorithm has to be revised and improved. Nonetheless, there is a relatively high correlation between the algorithm and sentiment, and certainly defined problems that Facebook can choose/or not to address using this analysis.

First Step: Prepare environment

Congratulations! If you are reading this it means you are actually interested in learning how to perform this sentiment analysis for your company and its competitors.

We’ll assume that you have R installed in your computer and have basic knowledge about how to add libraries and use the RStudio environment.

Install libraries

To begin with, make sure you have the following R libraries installed in your computer, we will be using them in the code later on:

  • twitteR
  • ggplot2
  • plyr
  • stringr
  • doBy

You can do it by calling the install.packages() function from the R console, or by navigating to “Tools > Install Packages …” in the RStudio menu.

Download Opinion Dictionaries

In order to know whether a tweet has a “positive” or a “negative” connotation, we need to use dictionaries. There are standard ones that will be vary handy as a starting point and can be downloaded form the University of Chicago’s Computer Science Department.

Please bare in mind that these dictionaries are generic, so you will probably want to refine them by adding vocabulary and words specific to your industry, and or removing irrelevant words.

Twitter credentials

In order to use the TwitteR library you will need a Twitter account and access to its API credentials. You have to create a Twitter application and get them from the interface. As I don’t want to repeat myself too much and create content that already exists, here there are 2 guides about how to do it:

The only struggle I had with them is to find the page where to create the app from. You can try with this URL: Create an App

Second Step: Writing the code

Now that we have everything we need in our computer, we can proceed to write the code that will authenticate against the Twitter API, pull the data, give a sentiment scores to each tweet, and return the results.

Require the libraries

To begin with, we need to include the necessary libraries to our environment:

  • TwitteR: This library encapsulates all the functionality we need to access and process Twitter data. It also contains some classes that will be used to manage tweets.
  • Gplot: We will be using ggplot2 to visualise our information. It is one of the most used visualisation libraries, and it is highly customisable.
  • plyr: This library is used to apply functions to arrays of data, data.frames and other R data structures.
  • stringr: We will use the functionality provided by this library to process and manipulate strings.
  • doBy: This provides functionality to present results and group them by particular criteria. We will mainly use it for its orderBy() function.

The code to include the libraries looks like the following:

library(twitteR)
library(ggplot2)
library(plyr)
library(stringr)
require(doBy)

Load the dictionaries

The next thing will be to read the dictionaries we downloaded on the first step. To do so, we’ll execute the following code:

positive_words <- scan('/Users/lluis gasso/Downloads/opinion-lexicon-English/positive-words.txt', what='character', comment.char=';')
negative_words <- scan('/Users/lluis gasso/Downloads/opinion-lexicon-English/negative-words.txt', what='character', comment.char=';')

Make sure you update the first parameter of the scan call with the path to the place where you uncompressed the .zip file.

Score sentiment function

The following function has been taken from the Mining Twitter for Airline Consumer Sentiment article.

It will help us finding the matches of our positive and negative words in the tweets and give a sentiment score to each of them depending on the number of matches of each type.

# Create the score sentiment function. The first parameter is the list of tweets.
# The second and third parameters are the lists of positive and negative words.
# The last parameter indicates that we don't want to see a progress bar while the function executes.
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  
  # Laply is from the plyr package, and will apply the function passed as second parameter to the
  # list of variables passed as first parameter. The "l" means we pass a list, the "a" that it will
  # return an array, and "ply" is standard for these functions.
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    
    # Clean up the sentences with gsub() to allow word matching and convert them to lower case
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub("[^[:alnum:]///' ]", '', sentence)
    sentence = gsub('\\d+', '', sentence)
    sentence = tolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    # Converting matches to True or False
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
  }, pos.words, neg.words, .progress=.progress )
  
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

Authenticate

Luckily for us, the TwitteR library handles the OAuth calls for us (phew!). This means that we only need to execute the code below to authenticate. You need to make sure you replace the credentials with the values obtained during the first step:

# Create Twitter account credentials
consumer_key <- "####"
consumer_secret <- "####"
access_token <- "####"
access_secret <- "####"

# Authenticate against UOC Mining application
setup_twitter_oauth(consumer_key, consumer_secret, access_token=access_token, access_secret=access_secret)

Get and process tweets

Now it is the moment to pull the tweets that are mentioning our company. To do that, we will use the searchTwitter() function, passing as first parameter the content we want to search for, as second the maximum number of tweets and as third the content language.

We can update these parameters to fit our needs but, if we change the language, we will need new dictionaries that contain the positive and negative words specific to it.

Finally, we label the results to visualise them later on.

# Pull the tweets in English that mention the Havas World Wide Twitter account and process them.
dbi.tweets <- searchTwitter('@DBiUK', n=100, lang= 'en')
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100 tweets were requested but the
## API can only return 2
dbi.text <- laply(dbi.tweets, function(t) t$getText())
dbi.score <- score.sentiment(dbi.text, positive_words, negative_words)
dbi.score$name = 'DBi'
dbi.score$code = 'DBi'

Once we have the tweets mentioning our company, we can start pulling the ones mentioning the rest of the companies in our analysis. You will probably have to spend some time investigating what are the best Twitter accounts to use for this exercise, but it is worth the time.

Use the following code, replacing the accounts with the ones that are more convenient for you. We are just repeating the process above for each competitor:

# Get tweets mentioning our competitors
omc.tweets = searchTwitter('@OracleMktgCloud', n=100, lang = 'en')
ga.tweets <- searchTwitter('@googleanalytics', n=100, lang= 'en')
kd.tweets <- searchTwitter('@KruxDigital', n=100, lang= 'en')
op.tweets <- searchTwitter('@Optimizely', n=100, lang= 'en')
mfg.tweets <- searchTwitter('@mfg_labs', n=100, lang= 'en')
fb.tweets <- searchTwitter('@facebook', n=100, lang= 'en')

# Process the text
omc.text <- laply(omc.tweets, function(t) t$getText())
ga.text <- laply(ga.tweets, function(t) t$getText())
kd.text <- laply(kd.tweets, function(t) t$getText())
op.text <- laply(op.tweets, function(t) t$getText())
mfg.text <- laply(mfg.tweets, function(t) t$getText())
fb.text <- laply(fb.tweets, function(t) t$getText())

# Get scores and label results
omc.score <- score.sentiment(omc.text, positive_words, negative_words)
omc.score$name = 'Oracle Marketing Cloud'
omc.score$code = 'OMC'
ga.score <- score.sentiment(ga.text, positive_words, negative_words)
ga.score$name = 'Google Analytics'
ga.score$code = 'GA'
kd.score <- score.sentiment(kd.text, positive_words, negative_words)
kd.score$name = 'Krux Digital'
kd.score$code = 'Krux'
op.score <- score.sentiment(op.text, positive_words, negative_words)
op.score$name = 'Optimizely'
op.score$code = 'OP'
mfg.score <- score.sentiment(mfg.text, positive_words, negative_words)
mfg.score$name = 'MFG Labs'
mfg.score$code = 'MFG'
fb.score <- score.sentiment(fb.text, positive_words, negative_words)
fb.score$name = 'Facebook'
fb.score$code = 'FB'

Third step: present the results

Now we have all the information we need, it is just a matter of presenting the results to understand what is going on. Here, we’ll repeat the output of the Results section, but we’ll explain how to obtain it.

Benchmarking histogram

We will get the first graph by plotting one histogram for each company in the analysis, with the number of tweets for each sentiment level. For this, we will be combining the results into a single list and using the Ggplot2 library to generate the graph.

Make sure you pass the rbind() function all of the score datasets obtained in the previous step.

# Aggregate and Visualise all the results
all.scores = rbind(ga.score, kd.score, op.score, mfg.score, fb.score, dbi.score)
g = ggplot(data=all.scores, mapping=aes(x=score, fill=name))
g = g + geom_bar(binwidth=1)
g = g + facet_grid(name~.)
g = g + theme_bw()
g

Score table

First, we’ll have to select the mentions that have a very positive or a negative connotation. We will do it by filtering out the ones with a sentiment score higher than 2 or smaller than 0.

all.scores$very.pos.bool <- all.scores$score < 0
all.scores$very.neg.bool <- all.scores$score > 2

Next, we will aggregate the number of very positive and very negative mentions for each of the companies we are taking into account. We will also add the calculated indicators that represent the ratio between positive mentions and total mentions with sentiment.

twitter.df <- ddply(all.scores, c('name', 'code'), summarise, very.pos.count=sum( very.pos.bool ), very.neg.count=sum( very.neg.bool ))
twitter.df$very.tot = twitter.df$very.pos.count + twitter.df$very.neg.count
twitter.df$score = round( 100 * twitter.df$very.pos.count / twitter.df$very.tot )

Now we are ready to represent the information in a table by printing the data.frame on a table, sorted by the positive comments ratio.

orderBy(~-score, twitter.df)
Table showing the sentiment scores of the companies in the study

Table showing the sentiment scores of the companies in the study

Mentions with sentiment

To get the messages with very good sentiment we will filter based on the score. We will look at unique tweets for a simple reason: re-tweets have the same messages, so we would get the repeated values:

unique(op.score[op.score$score>2,'text'])
[1] RT @Optimizely: Honored to be in such good company for the Bay Area's Best Places to Work 2016 @newrelic @eventbrite @zillow https://t.co/Y…

This allows us to know what are the things that people like about the company we are analysing. In this case, we can see that the only positive comment comes from someone who is very glad to work for Optimizely. Potentially, we can use this positive feeling of our stuff on our advantage using events, social media, or other platforms to incentivise our employees to share their joy.

We will proceed in exactly the same way to get the negative mentions, although this time we will filter out scores smaller than 0, and use Facebook tweets:

unique(fb.score[fb.score$score<0,'text'])
[1] @facebook… for f***s sake… Can’t find a post posted 3mins ago due to your stupid “non chronological” timeline…
[2] My friends are unable to send me messages it keeps saying error? @facebook
[3] In other news , deletion from @facebook doesn’t hurt , nothing compares to being blocked on @twitter \xed\xa0\xbd\xed\xb8\x82\xed\xa0\xbd\xed\xb8\x82\xed\xa0\xbd\xed\xb8\x82\xed\xa0\xbd\xed\xb8\x82
[4] Friend requests on @facebook I’m still confused \xed\xa0\xbe\xed\xb4\x94
[5] Facebook wants us to hide nipples but cant f***ing delete shocking and violent photos of dead animals. @facebook F***ING FIX UR PRIORITIES
[6] @facebook so where did all of our links go on our fan pages? Taken down for 0 reason? Facebook is awful.
[7] @facebook my ad account is temporarily disabled and I want to remove all card details saved. 2/2
[8] @jiminmokena @2AFight @ChristiChat @facebook I guess they have missed mine somehow. Maybe a class action law suit by those wronged?
[9] Facebook loses first round in suit over storing biometric data – https://t.co/eutAOjYUsT via https://t.co/S7aCD42ucs @facebook #tech
[10] RT @TonyTGarnett: Goodbye @Facebook? Facebook blocked my website and blog. I’m still waiting on feedback from them. It’s a mystery. https:/…
[11] RT @mutludc: . @Facebook In Dispute w/ Pro-Kurdish Activists Over Deleted Posts @saramayspary @BuzzFeedUK https://t.co/XCUewn7ed6 https://t…
[12] @facebook I lost my Facebook page of Indain Boi. Link: https://t.co/PyyTMSn1bG its not showing in my fb account. Please reply asap.
[13] RT @facebook: @KickingK Hi there. Thanks for flagging. Please report this to us by following these Help Center steps: https://t.co/DwCfhOMC…
[14] Anybody else just recently getting a ridiculous amount of notifications from @facebook ?? Getting to the point where I’ll switch em all off

It is quite easy to see that most complains about Facebook are related to their content policies, the notifications received and other service characteristics.

Conclusion

If you got to this part you can relax, the technical bit is over. We are now wrapping up with some conclusions, tips and pitfalls about the process itself and its results.

Overall, we can see that the method is pretty straightforward for anyone familiarised with R. Once you have decided what accounts you are going to use for your analysis, it is a matter of updating the code with them and run it using RStudio or R directly. Our recommendation is to use the R Markdown language in RStudio, as it allows to get a clean output in HTML or PDF format, that can easily be shared.

Deciding what competitor accounts to use is not trivial, and can be somehow time consuming. The main things to look at are:

  • Make sure that the account(s) selected have enough mentions. This analysis doesn’t make a lot of sense unless there are sufficient recent tweets. There isn’t a particular number that is right or wrong, you will see it as you pull the tweets and look at the content and the results.
  • Use relevant accounts. For example, if you are an online retailer make sure you compare yourself to similar sized competitors, with a similar type and quality of product; if you are a publisher, compare yourself with publishers that use similar type of format, content, etc. In general, use competitors that your customers could replace you with.
  • Be aware of country and language. Some accounts are global, others are country specific, others are region focused (e.g. EMEA, APAC, etc.). This is very important, as it will affect the decisions in the first two points and, more importantly, the language used for the retweets and the mentions. It is a good idea to use Twitter Analytics and other similar tools to explore the accounts and the tweets before running the code.

You probably won’t get it completely right the first time, and it will require some trial and error until you get valuable information, but it is very easy, and you will learn how to manipulate and filter the calls very quickly.

One final note about the positive and negative word dictionaries. It may be helpful to identify words that are not positive or negative in your industry and remove them, as well as to add any words that are specific to your sector and have a connotation, whether it is good or bad. For example, bulls and bears may have no connotation in many industries, but they have a meaning in the stock market, it would be interesting to understand whether they help analysing the sentiment of a comment.

What I like the most about this approach is that the data is opened to everyone, and for free! So it is a matter of identifying how can you apply it to your business, start pulling and analysing tweets, refining your methods and applying the learnings to your company.

 

by Lluis Gassó

Submit a Comment

Your email address will not be published. Required fields are marked *