Hi, I'm Raj. A week ago, I had a discussion about the relevance of Bernie Sanders among millenials - and so, I set out to get a rough idea by looking at Twitter. This document is a summary of my findings, including data tables and graphs, as well as the code needed to generate them.
I don't explain any of the code in this document. If you just want to see the results, you can safely ignore the code snippets. If you're interested in going further with this data, I've posted the source and the dataset at .
If you have any comments or suggestions (on either the code or the analysis), please let me know at .
Enjoy!
First, I used Tweepy to pull down 20,000 tweets about each of Hillary Clinton, Bernie Sanders, Rand Paul, and Jeb Bush [retrieve_tweets.py
].
I've also already done some calculations, specifically of polarity, subjectivity, influence, influenced polarity, and longitude and latitude (all explained later) [preprocess.py
].
from arrows.preprocess import load_df
Just adding some imports and setting graph display options.
from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import cartopy
pd.set_option('display.max_colwidth', 200)
pd.options.display.mpl_style = 'default'
matplotlib.style.use('ggplot')
sns.set_context('talk')
sns.set_style('whitegrid')
% matplotlib inline
Let's look at our data!
load_df
loads it in as a pandas.DataFrame
, excellent for statistical analysis and graphing.
df = load_df('arrows/data/results.csv')
df.info()
We'll be looking primarily at candidate
, created_at
, lang
, place
, user_followers_count
, user_time_zone
, polarity
, and influenced_polarity
, and text
.
df[['candidate', 'created_at', 'lang', 'place', 'user_followers_count',
'user_time_zone', 'polarity', 'influenced_polarity', 'text']].head(1)
First I'll look at sentiment, calculated with TextBlob using the text
column. Sentiment is composed of two values, polarity - a measure of the positivity or negativity of a text - and subjectivity. Polarity is between -1.0 and 1.0; subjectivity between 0.0 and 1.0.
TextBlob('Tear down this wall!').sentiment
Unfortunately, it doesn't work too well on anything other than English.
TextBlob('Radix malorum est cupiditas.').sentiment
TextBlob has a translate()
function that uses Google Translate to take care of that for us, but we won't be using it here - just because tweets include a lot of slang and abbreviations that can't be translated very well.
sentence = TextBlob('Radix malorum est cupiditas.').translate()
print(sentence)
print(sentence.sentiment)
All right - let's figure out the most (positively) polarized English tweets.
english_df = df[df.lang == 'en']
english_df.sort('polarity', ascending = False).head(3)[['candidate', 'polarity', 'subjectivity', 'text']]
Extrema don't mean much. We might get more interesting data with mean polarities for each candidate. Let's also look at influenced polarity, which takes into account the number of retweets and followers.
candidate_groupby = english_df.groupby('candidate')
candidate_groupby[['polarity', 'influence', 'influenced_polarity']].mean()
I used the formula influence = sqrt(followers + 1) * sqrt(retweets + 1)
. You can experiment with different functions if you like: maybe it's the case that influenced_polarity
should be exponential with respect to polarity
[preprocess.py:influence
].
So tweets about Jeb Bush, on average, aren't as positive as the other candidates, but the people tweeting about Bush get more retweets and followers.
We can look at the most influential tweets about Jeb Bush to see what's up.
jeb = candidate_groupby.get_group('Jeb Bush')
jeb_influence = jeb.sort('influence', ascending = False)
jeb_influence[['influence', 'polarity', 'influenced_polarity', 'user_name', 'text', 'created_at']].head(5)
Side note: you can see that sentiment analysis isn't perfect - the last tweet is certainly negative toward Jeb Bush, but it was actually assigned a positive polarity. Over a large number of tweets, though, sentiment analysis is more meaningful.
As to the high influence of tweets about Bush: it looks like Donald Trump (someone with a lot of followers) has been tweeting a lot about Bush - one possible reason for Jeb's greater influenced_polarity
.
df[df.user_name == 'Donald J. Trump'].groupby('candidate').size()
Looks like our favorite toupéed candidate hasn't even been tweeting about anyone else!
What else can we do? We know the language each tweet was (tweeted?) in.
language_groupby = df.groupby(['candidate', 'lang'])
language_groupby.size()
That's a lot of languages! Let's try plotting to get a better idea, but first, I'll remove smaller language/candidate groups.
By the way, each lang
value is an IANA language tag - you can look them up at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry.
largest_languages = language_groupby.filter(lambda group: len(group) > 10)
I'll also remove English, since it would just dwarf all the other languages.
non_english = largest_languages[largest_languages.lang != 'en']
non_english_groupby = non_english.groupby(['lang', 'candidate'], as_index = False)
sizes = non_english_groupby.text.agg(np.size)
sizes = sizes.rename(columns={'text': 'count'})
sizes_pivot = sizes.pivot_table(index='lang', columns='candidate', values='count', fill_value=0)
plot = sns.heatmap(sizes_pivot)
plot.set_title('Number of non-English Tweets by Candidate', family='Ubuntu')
plot.set_ylabel('language code', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Looks like Spanish and Portuguese speakers mostly tweet about Jeb Bush, while Francophones lean more liberal.
Clinton attracts the largest variety of tweeters, in sharp contrast to Bernie Sanders and Rand Paul.
We also have the time-of-tweet information - I'll plot influenced polarity over time for each candidate. I'm going to resample the influenced_polarity
values to 1 hour intervals to get a smoother graph.
mean_polarities = df.groupby(['candidate', 'created_at']).influenced_polarity.mean()
plot = mean_polarities.unstack('candidate').resample('60min').plot()
plot.set_title('Influenced Polarity over Time by Candidate', family='Ubuntu')
plot.set_ylabel('influenced polarity', family='Ubuntu')
plot.set_xlabel('time', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Since I only took the last 20,000 tweets for each candidate, I didn't receive as large a timespan from Clinton (a candidate with many, many tweeters) compared to Rand Paul. You can get a rough idea of tweets per day for each candidate: in descending order, Clinton, Sanders, Bush, and Paul.
We can still analyze the data in terms of hour-of-day. I'd like to know when tweeters in each language tweet each day, and I'm going to use percentages instead of raw number of tweets so I can compare across different languages easily.
By the way, the times in the dataframe are in UTC.
language_sizes = df.groupby('lang').size()
threshold = language_sizes.quantile(.75)
top_languages_df = language_sizes[language_sizes > threshold]
top_languages = set(top_languages_df.index) - {'und'}
top_languages
df['hour'] = df.created_at.apply(lambda datetime: datetime.hour)
for language_code in top_languages:
lang_df = df[df.lang == language_code]
normalized = lang_df.groupby('hour').size() / lang_df.lang.count()
plot = normalized.plot(label = language_code)
plot.set_title('Tweet Frequency in non-English Languages by Hour of Day', family='Ubuntu')
plot.set_ylabel('normalized frequency', family='Ubuntu')
plot.set_xlabel('hour of day (UTC)', family='Ubuntu')
plot.legend()
plot.figure.set_size_inches(12, 7)
Note that English, French, and Spanish are significantly flatter than the other languages - this means that there's a large spread of speakers all over the globe.
But why is Portuguese spiking at 11pm Brasilia time / 3 am Lisbon time? Let's find out! My first guess was that maybe there's a single person making a ton of posts at that time.
df_of_interest = df[(df.hour == 2) & (df.lang == 'pt')]
print('Number of tweets:', df_of_interest.text.count())
print('Number of unique users:', df_of_interest.user_name.unique().size)
So that's not it. Maybe there was a major event everyone was retweeting?
df_of_interest.text.head(25).unique()
Seems to be a lot of these Jeb Bush diz que foi atingido... tweets. How many? We can't just count unique ones because they all are different slightly, but we can check for a large-enough substring.
df_of_interest[df_of_interest.text.str.contains('Jeb Bush diz que foi atingido')].text.count()
That's it!
By searching the text, I found a news article from a Brazilian website (http://jconline.ne10.uol.com.br/canal/mundo/internacional/noticia/2015/07/05/jeb-bush-diz-que-foi-atingido-por-criticas-de-trump-a-mexicanos-188801.php) that happened to get a lot of retweets at that time period. Still slightly suspicious, but I'm a programmer, not an investigative journalist.
A similar article in English is at http://www.nytimes.com/politics/first-draft/2015/07/04/an-angry-jeb-bush-says-he-takes-donald-trumps-remarks-personally/.
Since languages can span across different countries, we might get better results if we search by location, rather than just language.
We don't have very specific geolocation information other than timezone, so let's try plotting candidate sentiment over the 4 major U.S. timezones (Los Angeles, Denver, Chicago, and New York). This is also a good opportunity to look at a geographical map.
tz_df = english_df.dropna(subset=['user_time_zone'])
us_tz_df = tz_df[tz_df.user_time_zone.str.contains('US & Canada')]
us_tz_candidate_groupby = us_tz_dff.groupby(['candidate', 'user_time_zone'])
us_tz_candidate_groupby.influenced_polarity.mean()
Now to plot it on a map. I downloaded the timezone Shapefile from http://efele.net/maps/tz/world/. First, I read in the Shapefile with Cartopy.
tz_shapes = cartopy.io.shapereader.Reader('arrows/world/tz_world_mp.shp')
tz_records = list(tz_shapes.records())
tz_translator = {
'Eastern Time (US & Canada)': 'America/New_York',
'Central Time (US & Canada)': 'America/Chicago',
'Mountain Time (US & Canada)': 'America/Denver',
'Pacific Time (US & Canada)': 'America/Los_Angeles',
}
american_tz_records = {
tz_name: next(filter(lambda record: record.attributes['TZID'] == tz_id, tz_records))
for tz_name, tz_id
in tz_translator.items()
}
Next, I have to choose a projection and plot it (again using Cartopy). The Albers Equal-Area is good for maps of the U.S. I'll also download some featuresets from the Natural Earth dataset to display state borders.
albers_equal_area = cartopy.crs.AlbersEqualArea(-95, 35)
plate_carree = cartopy.crs.PlateCarree()
states_and_provinces = cartopy.feature.NaturalEarthFeature(
category='cultural',
name='admin_1_states_provinces_lines',
scale='50m',
facecolor='none'
)
cmaps = [matplotlib.cm.Blues, matplotlib.cm.Greens,
matplotlib.cm.Reds, matplotlib.cm.Purples]
norm = matplotlib.colors.Normalize(vmin=0, vmax=30)
candidates = df['candidate'].unique()
for index, candidate in enumerate(candidates):
plt.figure()
plot = plt.axes(projection=albers_equal_area)
plot.set_extent((-125, -66, 20, 50))
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor='#00BCD4')
for tz_name, record in american_tz_records.items():
tz_specific_df = us_tz_df[us_tz_df.user_time_zone == tz_name]
tz_candidate_specific_df = tz_specific_df[tz_specific_df.candidate == candidate]
mean_polarity = tz_candidate_specific_df.influenced_polarity.mean()
plot.add_geometries(
[record.geometry],
crs=plate_carree,
color=cmaps[index](norm(mean_polarity)),
alpha=.8
)
plot.set_title('Influenced Polarity toward {} by U.S. Timezone'.format(candidate), family='Ubuntu')
plot.figure.set_size_inches(6, 3.5)
plt.show()
print()
My friend Gabriel Wang pointed out that U.S. timezones other than Pacific don't mean much since each timezone covers both blue and red states, but the data is still interesting.
As expected, midwestern states lean toward Jeb Bush. I wasn't expecting Jeb Bush's highest polarity-tweets to come from the East; this is probably Donald Trump (New York, New York) messing with our data again.
What are tweeters outside the U.S. saying about our candidates?
Outside of the U.S., if someone is in a major city, the timezone is often that city itself. Here are the top (by number of tweets) non-American 25 timezones in our dataframe.
american_timezones = ('US & Canada|Canada|Arizona|America|Hawaii|Indiana|Alaska'
'|New_York|Chicago|Los_Angeles|Detroit|CST|PST|EST|MST')
foreign_tz_df = tz_df[~tz_df.user_time_zone.str.contains(american_timezones)]
foreign_tz_groupby = foreign_tz_df.groupby('user_time_zone')
foreign_tz_groupby.size().sort(inplace = False, ascending = False).head(25)
I also want to look at polarity, so I'll only use English tweets.
(Sorry, Central/South Americans - my very rough method of filtering out American timezones gets rid of some of your timezones too. Let me know if there's a better way to do this.)
foreign_english_tz_df = foreign_tz_df[foreign_tz_df.lang == 'en']
Now we have a dataframe containing (mostly) world cities as time zones. Let's get the top cities by number of tweets for each candidate, then plot polarities.
foreign_tz_groupby = foreign_english_tz_df.groupby(['candidate', 'user_time_zone'])
top_foreign_tz_df = foreign_tz_groupby.filter(lambda group: len(group) > 40)
top_foreign_tz_groupby = top_foreign_tz_df.groupby(['user_time_zone', 'candidate'], as_index = False)
mean_influenced_polarities = top_foreign_tz_groupby.influenced_polarity.mean()
pivot = mean_influenced_polarities.pivot_table(
index='user_time_zone',
columns='candidate',
values='influenced_polarity',
fill_value=0
)
plot = sns.heatmap(pivot)
plot.set_title('Influenced Polarity in Major Foreign Cities by Candidate', family='Ubuntu')
plot.set_ylabel('city', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Exercise for the reader: why is Rand Paul disliked in Athens? You can probably guess, but the actual tweets causing this are rather amusing.
Greco-libertarian relations aside, the data shows that London and Amsterdam are among the most influential of cities, with the former leaning toward Jeb Bush and the latter about neutral.
In India, Clinton-supporters reside in New Delhi while Chennai tweeters back Rand Paul. By contrast, in 2014, New Delhi constituents voted for the conservative Bharatiya Janata Party while Chennai voted for the more liberal All India Anna Dravida Munnetra Kazhagam Party - so there seems to be some kind of cultural shift between the voters of 2014 and the tweeters of today.
Last thing I thought was interesting: Athens has the highest mean polarity for Bernie Sanders, the only city for which this is the case. Modern Greece tends to lean toward social democracy, the same as Bernie Sanders.
Finally, I'll look at specific geolocation (latitude and longitude) data. Since only about 750 out of 80,000 tweets had geolocation enabled, this data can't be used for sentiment analysis, but we can still get a good idea of international spread.
First I'll plot everything on a world map, then break it up by candidate in the U.S.
df_place = df.dropna(subset=['place'])
mollweide = cartopy.crs.Mollweide()
plot = plt.axes(projection=mollweide)
plot.set_global()
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.scatter(
list(df_place.longitude),
list(df_place.latitude),
transform=plate_carree,
zorder=2
)
plot.set_title('International Tweeters with Geolocation Enabled', family='Ubuntu')
plot.figure.set_size_inches(14, 9)
plot = plt.axes(projection=albers_equal_area)
plot.set_extent((-125, -66, 20, 50))
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor='#00BCD4')
candidate_groupby = df_place.groupby('candidate', as_index = False)
colors = ['#1976d2', '#7cb342', '#f4511e', '#7b1fa2']
for index, (name, group) in enumerate(candidate_groupby):
longitudes = group.longitude.values
latitudes = group.latitude.values
plot.scatter(
longitudes,
latitudes,
transform=plate_carree,
color=colors[index],
label=name,
zorder=2
)
plot.set_title('U.S. Tweeters by Candidate', family='Ubuntu')
plt.legend(loc='lower left')
plot.figure.set_size_inches(12, 7)
As expected, U.S. tweeters are centered around L.A., the Bay Area, Chicago, New York, and Boston. Rand Paul and Bernie Sanders tweeters are more spread out over the country.
That's all I have for now. Back to the original question about Sanders' relevance among millenials: it's clear that the people who tweet about Sanders are very positive toward him, relative to the other candidates (polarity
).
Bernie is behind only Clinton in number of tweets per day, but if Sanders wants to receive the nomination, he needs to get large media and news outlets constantly writing about him (influence
) to outreach to swing states.
Sanders also needs to appeal to a wider variety of voters; specifically Spanish ones, who tweet about him least of all 4 candidates.
In the coming months, I'll be closely following the 2016 election and these candidates. Will Twitter statistics reflect on the general population? We'll find out next November.
The most interesting thing I've learned here (that may seem obvious), is that Internet reactions to events don't occur as a result of the event, but rather media coverage of that event. The New York Times article about Jeb Bush and Trump was published over a day before the Portuguese one, but Brazilian tweeters only reacted after the Portuguese article was published. This is one reason why it's important to have news coverage in as many languages as possible. "The medium is the message."
If you found this analysis interesting and are curious for more, I encourage you to download the dataset (or get your own dataset based on your interests) and share your findings.
Source code is at , and I can be reached at for any questions, comments, or criticism. Looking forward to hearing your feedback!