*This analysis was performed in mid-February

While the Democratic primary race was hot and heavy earlier this year, my partner, Julia, and I thought it would be a great idea to do our Natural Language Processing (NLP) project on political Tweets during our Data Science bootcamp. Since Twitter is a very popular forum for news and people to share their opinions, we wanted to see if we could uncover any insights about how the race was going to shake out based on Twitter activity.

This is how we did it:

The Data

In order to get Tweets, we used an advanced Twitter scraping tool called Twint, which allows you to scrape Tweets without using the Twitter API. To avoid obvious biases, we focused on Tweets that mentioned the candidates as opposed to Tweets written by the candidates themselves.

  1. Each Tweet contained only one candidate's name and/or Twitter handle
  2. Used Tweets from verified accounts only in efforts to limit the number of Tweets, as there were tons
  3. Removed links and images from Tweets, and non-English using langdetect

Sentiment Analysis

In order to run sentiment analysis — extracting feeling/emotion from text — we used VADER Sentiment Analysis (Valence Aware Dictionary and sEntiment Reasoner). As stated by its documentation, VADER is "a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."

  1. Obtained the positive, negative, neutral, and compound score for Tweets about each candidate:
# Create sentiment analysis functions

analyzer = SentimentIntensityAnalyzer()
def positive_sentiment(row):
    scores = analyzer.polarity_scores(row['tweet'])
    positive_sentiment = scores['pos']
    return positive_sentiment
def neutral_sentiment(row):
    scores = analyzer.polarity_scores(row['tweet'])
    neutral_sentiment = scores['neu']
    return neutral_sentiment
def negative_sentiment(row):
    scores = analyzer.polarity_scores(row['tweet'])
    negative_sentiment = scores['neg']
    return negative_sentiment
def compound_sentiment(row):
    scores = analyzer.polarity_scores(row['tweet'])
    compound_sentiment = scores['compound']
    return compound_sentiment

2. Got the average polarity score in each category for each candidate

3. Marked each Tweet as either positive or negative depending on which score was higher

4. Created a dictionary with the average number of positive and negative Tweets for each candidate to use for a visualization:

None
(Included President Trump in Analysis for Comparison)

We see that overall that the majority of Tweets about each candidate were deemed 'positive,' with an uptick in 'negative' sentiment surrounding the more controversial candidates.

Subjectivity Analysis

For our subjectivity analysis — extracting personal opinions/views/beliefs — we turned to TextBlob, which is a Python library for processing textual data. TextBlob "provides a consistent API for diving into common Natural Language Processing."

  1. Obtained sentiment for each Tweet
  2. Used regular expressions to capture float values and then converted the strings to floats
None
Subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective

As we can see, most Tweets about Bernie Sanders seemed to be very objective, while there was a normal distribution amongst the rest of the subjectivity scores.

TF–IDF

TF–IDF (Term Frequency–Inverse Document Frequency), in this case, reflects how important a word is to a Tweet in a whole collection of Tweets. "The TF–IDF value increases proportionally to the number of times a word appears in the [Tweet] and is offset by the number of [Tweets] in the [collection of Tweets] that contain the word." For our TF–IDF analysis, we used Scikit-Learn's feature extraction method.

  1. Tokenized and lemmatized all of the Tweets using NLTK functions, and made all of the tokens lowercase
  2. Removed the words that are part of NLTK's stop words list
  3. Converted our collection of Tweets to a matrix of token counts
  4. Used TfidfVectorizer(), TfidfTransformer(), CountVectorizer(), fit_transform(), and fit() functions to get TF–IDF vector for each candidate's tweets
None

From the TF–IDF analysis, we gain insight into words that are important in the Tweets that we collected.

Word Clouds

Word Clouds were great visualizations to see which words appeared most frequently in each of the candidate's collection of Tweets. We leveraged NLTK's FreqDist() function to make our Word Clouds:

word_freqdist = FreqDist(biden_bank)
most_common_biden = word_freqdist.most_common(100)
biden_common_df = pd.DataFrame(most_common_biden, columns=['word', 'count'])
biden_common_df.set_index('word').to_dict()
biden_common_df = pd.DataFrame({biden_common_df['word']:biden_common_df['count']})

# Create the Word Cloud:
from wordcloud import WordCloud
wordcloud = WordCloud(colormap='Spectral').generate_from_frequencies(biden_dictionary)


# Display the generated image w/ Matplotlib:
plt.figure(figsize=(10,10), facecolor='k')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)

The bigger the word appears in the 'cloud,' the more commonly it was used in Tweets:

None
None

Similar to TF–IDF, we gained insights from the words that reoccurred across each candidate's collection of Tweets.

Topic Modeling with LDA

LDA, or Latent Dirichlet Allocation, separates the tokens in the Tweets based on underlying, unobserved similar topics to help users interpret them. For our LDA, we used pyLDAvis, which is a Python library for topic modeling visualization.

  1. Created one list of all of the Tweets across candidates and eliminated given and chosen stop words
  2. Ran CountVectorizer() and TfidfVectorizer() functions on the Tweets, followed by the LatentDirichletAllocation() function
  3. Attempted to create topic models with seven different topics (one for each of the Democratic primary candidates and one for President Trump) using either just TF or TF–IDF:

Overall, the Tweets mentioning the candidates were all very similar and thus, difficult to differentiate between each other. Twitter is oftentimes a very polarizing platform, with opinions being conveyed from both ends of the spectrum, possibly resulting in an overall neutrality. Julia and I felt that it would be interesting future work to compare this analysis with Tweets around the same time during the 2016 presidential election, or perhaps build a classification model to try to differentiate Tweets by the candidates that they are about. For now, however, it could be interesting to just look at Tweets referring to Biden and Trump, as they are the last two standing in the 2020 presidential election race.

If you're interested in looking at the full code from our project, check it out on my GitHub!