arrows: Yet Another Twitter/Python Data Analysis

Geospatially, Temporally, and Linguistically Analyzing Tweets about Top U.S. Presidential Candidates with Pandas, TextBlob, Seaborn, and Cartopy


Hi, I'm Raj. A week ago, I had a discussion about the relevance of Bernie Sanders among millenials - and so, I set out to get a rough idea by looking at Twitter. This document is a summary of my findings, including data tables and graphs, as well as the code needed to generate them.

I don't explain any of the code in this document. If you just want to see the results, you can safely ignore the code snippets. If you're interested in going further with this data, I've posted the source and the dataset at

If you have any comments or suggestions (on either the code or the analysis), please let me know at



First, I used Tweepy to pull down 20,000 tweets about each of Hillary Clinton, Bernie Sanders, Rand Paul, and Jeb Bush [].

I've also already done some calculations, specifically of polarity, subjectivity, influence, influenced polarity, and longitude and latitude (all explained later) [].

In [1]:
from arrows.preprocess import load_df

Just adding some imports and setting graph display options.

In [2]:
from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import cartopy
pd.set_option('display.max_colwidth', 200)
pd.options.display.mpl_style = 'default''ggplot')
% matplotlib inline

Let's look at our data!

load_df loads it in as a pandas.DataFrame, excellent for statistical analysis and graphing.

In [3]:
df = load_df('arrows/data/results.csv')
In [4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80000 entries, 0 to 80025
Data columns (total 21 columns):
candidate               80000 non-null object
coordinates             87 non-null object
created_at              80000 non-null datetime64[ns]
favorite_count          80000 non-null object
geo                     87 non-null object
id                      80000 non-null float64
lang                    80000 non-null object
place                   743 non-null object
retweet_count           80000 non-null float64
text                    80000 non-null object
user_followers_count    79994 non-null float64
user_location           53628 non-null object
user_name               79973 non-null object
user_screen_name        79974 non-null object
user_time_zone          50386 non-null object
polarity                80000 non-null float64
subjectivity            80000 non-null float64
influence               79994 non-null float64
influenced_polarity     79994 non-null float64
latitude                743 non-null float64
longitude               743 non-null float64
dtypes: datetime64[ns](1), float64(9), object(11)
memory usage: 13.4+ MB

We'll be looking primarily at candidate, created_at, lang, place, user_followers_count, user_time_zone, polarity, and influenced_polarity, and text.

In [5]:
df[['candidate', 'created_at', 'lang', 'place', 'user_followers_count', 
    'user_time_zone', 'polarity', 'influenced_polarity', 'text']].head(1)
candidate created_at lang place user_followers_count user_time_zone polarity influenced_polarity text
0 Bernie Sanders 2015-07-06 01:52:42 en NaN 1642 Eastern Time (US & Canada) 0.285714 16.378184 RT @DrTomMartinPhD: BERNIE SANDERS QUOTE ON #BILLMAHER, "Hillary Clinton &amp; I Have The Right Message. We're Both Speaking The Truth." http:/…


First I'll look at sentiment, calculated with TextBlob using the text column. Sentiment is composed of two values, polarity - a measure of the positivity or negativity of a text - and subjectivity. Polarity is between -1.0 and 1.0; subjectivity between 0.0 and 1.0.

In [6]:
TextBlob('Tear down this wall!').sentiment
Sentiment(polarity=-0.19444444444444448, subjectivity=0.2888888888888889)

Unfortunately, it doesn't work too well on anything other than English.

In [7]:
TextBlob('Radix malorum est cupiditas.').sentiment
Sentiment(polarity=0.0, subjectivity=0.0)

TextBlob has a translate() function that uses Google Translate to take care of that for us, but we won't be using it here - just because tweets include a lot of slang and abbreviations that can't be translated very well.

In [8]:
sentence = TextBlob('Radix malorum est cupiditas.').translate()
The root of evil.
Sentiment(polarity=-1.0, subjectivity=1.0)

All right - let's figure out the most (positively) polarized English tweets.

In [30]:
english_df = df[df.lang == 'en']
english_df.sort('polarity', ascending = False).head(3)[['candidate', 'polarity', 'subjectivity', 'text']]
candidate polarity subjectivity text
2287 Bernie Sanders 1 1.0 Republicans Welcomed Bernie Sanders to Wisconsin By Calling Him an Extremist. His Response? Perfect.
810 Bernie Sanders 1 0.3 BEST OF SUNDAY TALK CNN SOTU #Sanders draws 2016 record crowd in Iowa-
31467 Hillary Clinton 1 0.3 @whitehouse but there is one thing i want to be known by all the world: my best wish goes to Lady Hillary Clinton. it's said and done

Extrema don't mean much. We might get more interesting data with mean polarities for each candidate. Let's also look at influenced polarity, which takes into account the number of retweets and followers.

In [10]:
candidate_groupby = english_df.groupby('candidate')
candidate_groupby[['polarity', 'influence', 'influenced_polarity']].mean()
polarity influence influenced_polarity
Bernie Sanders 0.096348 162.142172 14.758500
Hillary Clinton 0.037577 176.315714 7.561452
Jeb Bush 0.026713 318.453703 16.174172
Rand Paul 0.086817 144.550312 10.042045

I used the formula influence = sqrt(followers + 1) * sqrt(retweets + 1). You can experiment with different functions if you like: maybe it's the case that influenced_polarity should be exponential with respect to polarity [].

So tweets about Jeb Bush, on average, aren't as positive as the other candidates, but the people tweeting about Bush get more retweets and followers.

We can look at the most influential tweets about Jeb Bush to see what's up.

In [11]:
jeb = candidate_groupby.get_group('Jeb Bush')
jeb_influence = jeb.sort('influence', ascending = False)
jeb_influence[['influence', 'polarity', 'influenced_polarity', 'user_name', 'text', 'created_at']].head(5)
influence polarity influenced_polarity user_name text created_at
52023 89594.614397 0.500000 44797.307199 CNN Breaking News Jeb Bush on Donald Trump: "His views are way out of the mainstream of what most Republicans think." 2015-07-05 02:14:26
55849 68470.590942 0.000000 0.000000 The New York Times Jeb Bush, whose wife is Mexican, says he takes Donald Trump’s remarks personally 2015-07-04 20:10:06
47246 53754.716258 -0.066667 -3583.647751 Donald J. Trump Flashback – Jeb Bush says illegal immigrants breaking our laws is an “act of love” He will never secure the border. 2015-07-05 15:23:20
50459 53641.142046 0.000000 0.000000 CNN Jeb Bush: Trump comments meant 'to draw attention.'\n 2015-07-05 03:55:25
47616 51601.878338 0.200000 10320.375668 Donald J. Trump Jeb Bush will never secure our border or negotiate great trade deals for American workers. Jeb doesn't see &amp; can't solve the problems. 2015-07-05 15:02:22

Side note: you can see that sentiment analysis isn't perfect - the last tweet is certainly negative toward Jeb Bush, but it was actually assigned a positive polarity. Over a large number of tweets, though, sentiment analysis is more meaningful.

As to the high influence of tweets about Bush: it looks like Donald Trump (someone with a lot of followers) has been tweeting a lot about Bush - one possible reason for Jeb's greater influenced_polarity.

In [12]:
df[df.user_name == 'Donald J. Trump'].groupby('candidate').size()
Jeb Bush    4
dtype: int64

Looks like our favorite toupéed candidate hasn't even been tweeting about anyone else!


What else can we do? We know the language each tweet was (tweeted?) in.

In [13]:
language_groupby = df.groupby(['candidate', 'lang'])
candidate        lang
Bernie Sanders   ar          1
                 da          2
                 de         55
                 el         33
                 en      19208
                 es         55
                 et          1
                 fr        447
                 in          6
                 it          1
                 ko          1
                 nl         47
                 no          1
                 pl          7
                 pt          8
                 sk          2
                 sl          1
                 sv          7
                 tl          2
                 tr          5
                 und       107
                 vi          3
Hillary Clinton  ar          5
                 de        168
                 en      18100
                 es        841
                 et          2
                 fa          4
                 fr        202
                 hi         31
Jeb Bush         sl          1
                 tl          1
                 tr          1
                 und        67
                 vi          7
                 zh          2
Rand Paul        da          3
                 de         14
                 en      19607
                 es        165
                 et         12
                 fi          1
                 fr         42
                 hi          2
                 ht          6
                 in          7
                 it          5
                 ja          6
                 ko          1
                 lv          2
                 nl          2
                 pl          3
                 pt         15
                 ru          9
                 sk          2
                 sv          2
                 th          2
                 tl          1
                 tr          4
                 und        87
dtype: int64

That's a lot of languages! Let's try plotting to get a better idea, but first, I'll remove smaller language/candidate groups.

By the way, each lang value is an IANA language tag - you can look them up at

In [14]:
largest_languages = language_groupby.filter(lambda group: len(group) > 10)

I'll also remove English, since it would just dwarf all the other languages.

In [15]:
non_english = largest_languages[largest_languages.lang != 'en']
non_english_groupby = non_english.groupby(['lang', 'candidate'], as_index = False)

sizes = non_english_groupby.text.agg(np.size)
sizes = sizes.rename(columns={'text': 'count'})
sizes_pivot = sizes.pivot_table(index='lang', columns='candidate', values='count', fill_value=0)

plot = sns.heatmap(sizes_pivot)
plot.set_title('Number of non-English Tweets by Candidate', family='Ubuntu')
plot.set_ylabel('language code', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)

Looks like Spanish and Portuguese speakers mostly tweet about Jeb Bush, while Francophones lean more liberal.

Clinton attracts the largest variety of tweeters, in sharp contrast to Bernie Sanders and Rand Paul.


We also have the time-of-tweet information - I'll plot influenced polarity over time for each candidate. I'm going to resample the influenced_polarity values to 1 hour intervals to get a smoother graph.

In [16]:
mean_polarities = df.groupby(['candidate', 'created_at']).influenced_polarity.mean()
plot = mean_polarities.unstack('candidate').resample('60min').plot()
plot.set_title('Influenced Polarity over Time by Candidate', family='Ubuntu')
plot.set_ylabel('influenced polarity', family='Ubuntu')
plot.set_xlabel('time', family='Ubuntu')
plot.figure.set_size_inches(12, 7)

Since I only took the last 20,000 tweets for each candidate, I didn't receive as large a timespan from Clinton (a candidate with many, many tweeters) compared to Rand Paul. You can get a rough idea of tweets per day for each candidate: in descending order, Clinton, Sanders, Bush, and Paul.

We can still analyze the data in terms of hour-of-day. I'd like to know when tweeters in each language tweet each day, and I'm going to use percentages instead of raw number of tweets so I can compare across different languages easily.

By the way, the times in the dataframe are in UTC.

In [17]:
language_sizes = df.groupby('lang').size()
threshold = language_sizes.quantile(.75)

top_languages_df = language_sizes[language_sizes > threshold]
top_languages = set(top_languages_df.index) - {'und'}
{'de', 'en', 'es', 'fr', 'in', 'it', 'nl', 'pt'}
In [18]:
df['hour'] = df.created_at.apply(lambda datetime: datetime.hour) 
for language_code in top_languages:
    lang_df = df[df.lang == language_code]
    normalized = lang_df.groupby('hour').size() / lang_df.lang.count()
    plot = normalized.plot(label = language_code)

plot.set_title('Tweet Frequency in non-English Languages by Hour of Day', family='Ubuntu')
plot.set_ylabel('normalized frequency', family='Ubuntu')
plot.set_xlabel('hour of day (UTC)', family='Ubuntu')
plot.figure.set_size_inches(12, 7)

Note that English, French, and Spanish are significantly flatter than the other languages - this means that there's a large spread of speakers all over the globe.

But why is Portuguese spiking at 11pm Brasilia time / 3 am Lisbon time? Let's find out! My first guess was that maybe there's a single person making a ton of posts at that time.

In [19]:
df_of_interest = df[(df.hour == 2) & (df.lang == 'pt')]

print('Number of tweets:', df_of_interest.text.count())
print('Number of unique users:', df_of_interest.user_name.unique().size)
Number of tweets: 446
Number of unique users: 407

So that's not it. Maybe there was a major event everyone was retweeting?

In [20]:
array([ 'Desabafo de garoto homossexual com medo do futuro comove Hillary Clinton via @UOLNoticias @UOL',
       'Hillary Clinton eleva tom contra a China de olho na Presidência dos EUA Governo chinês tem discurso bem menos ácido.',
       'Desabafo de garoto homossexual com medo do futuro comove Hillary Clinton - Notícias - Internacional',
       'RT @elpais_brasil: Facebook censurou o sofrimento de um garoto gay e Hillary Clinton saiu em sua defesa…',
       'RT @ReutersBrazil: Hillary Clinton diz que Irã continuará a ser ameaça a Israel apesar de acordo nuclear',
       'RT @folha: Jeb Bush diz que foi atingido por críticas de Trump a mexicanos.',
       'RT @jr140797: #betacaralhudosan Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-can... #betac…',
       'RT deigmar: Jeb Bush diz que foi atingido por críticas de Trump a mexicanos',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       '[FOLHA S.PAULO. BRA] Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candi... vía J.A.M.V',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb...',
       'RT @thiago_beta51: #SegueSigoDeVolta Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       '@DREWXAVECAO @EXFLOP Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos #folheando a @folha',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       'Jeb Bush diz que foi atingido por críticas de Trump a\xa0mexicanos'], dtype=object)

Seems to be a lot of these Jeb Bush diz que foi atingido... tweets. How many? We can't just count unique ones because they all are different slightly, but we can check for a large-enough substring.

In [21]:
df_of_interest[df_of_interest.text.str.contains('Jeb Bush diz que foi atingido')].text.count()

That's it!

By searching the text, I found a news article from a Brazilian website ( that happened to get a lot of retweets at that time period. Still slightly suspicious, but I'm a programmer, not an investigative journalist.

A similar article in English is at

Since languages can span across different countries, we might get better results if we search by location, rather than just language.


We don't have very specific geolocation information other than timezone, so let's try plotting candidate sentiment over the 4 major U.S. timezones (Los Angeles, Denver, Chicago, and New York). This is also a good opportunity to look at a geographical map.

In [22]:
tz_df = english_df.dropna(subset=['user_time_zone'])
us_tz_df = tz_df[tz_df.user_time_zone.str.contains('US & Canada')]
us_tz_candidate_groupby = us_tz_dff.groupby(['candidate', 'user_time_zone'])
candidate        user_time_zone             
Bernie Sanders   Central Time (US & Canada)     18.694226
                 Eastern Time (US & Canada)     20.221507
                 Mountain Time (US & Canada)    13.683829
                 Pacific Time (US & Canada)     16.496358
Hillary Clinton  Central Time (US & Canada)      3.302260
                 Eastern Time (US & Canada)     22.731770
                 Mountain Time (US & Canada)     0.196556
                 Pacific Time (US & Canada)      5.486306
Jeb Bush         Central Time (US & Canada)     14.766734
                 Eastern Time (US & Canada)     28.625515
                 Mountain Time (US & Canada)     6.356858
                 Pacific Time (US & Canada)     16.676979
Rand Paul        Central Time (US & Canada)      6.798783
                 Eastern Time (US & Canada)     15.359912
                 Mountain Time (US & Canada)    10.780279
                 Pacific Time (US & Canada)     12.918267
Name: influenced_polarity, dtype: float64

Now to plot it on a map. I downloaded the timezone Shapefile from First, I read in the Shapefile with Cartopy.

In [23]:
tz_shapes ='arrows/world/tz_world_mp.shp')
tz_records = list(tz_shapes.records())
tz_translator = {
     'Eastern Time (US & Canada)': 'America/New_York',
     'Central Time (US & Canada)': 'America/Chicago',
     'Mountain Time (US & Canada)': 'America/Denver',
     'Pacific Time (US & Canada)': 'America/Los_Angeles',
american_tz_records = {
    tz_name: next(filter(lambda record: record.attributes['TZID'] == tz_id, tz_records))
    for tz_name, tz_id 
    in tz_translator.items() 

Next, I have to choose a projection and plot it (again using Cartopy). The Albers Equal-Area is good for maps of the U.S. I'll also download some featuresets from the Natural Earth dataset to display state borders.

In [24]:
albers_equal_area =, 35)
plate_carree =

states_and_provinces = cartopy.feature.NaturalEarthFeature(

cmaps = [,, ,]
norm = matplotlib.colors.Normalize(vmin=0, vmax=30) 

candidates = df['candidate'].unique()

for index, candidate in enumerate(candidates):
    plot = plt.axes(projection=albers_equal_area)
    plot.set_extent((-125, -66, 20, 50))
    plot.add_feature(states_and_provinces, edgecolor='gray')
    plot.add_feature(cartopy.feature.LAKES, facecolor='#00BCD4')

    for tz_name, record in american_tz_records.items():
        tz_specific_df = us_tz_df[us_tz_df.user_time_zone == tz_name]
        tz_candidate_specific_df = tz_specific_df[tz_specific_df.candidate == candidate]
        mean_polarity = tz_candidate_specific_df.influenced_polarity.mean()

    plot.set_title('Influenced Polarity toward {} by U.S. Timezone'.format(candidate), family='Ubuntu')
    plot.figure.set_size_inches(6, 3.5)

My friend Gabriel Wang pointed out that U.S. timezones other than Pacific don't mean much since each timezone covers both blue and red states, but the data is still interesting.

As expected, midwestern states lean toward Jeb Bush. I wasn't expecting Jeb Bush's highest polarity-tweets to come from the East; this is probably Donald Trump (New York, New York) messing with our data again.


What are tweeters outside the U.S. saying about our candidates?

Outside of the U.S., if someone is in a major city, the timezone is often that city itself. Here are the top (by number of tweets) non-American 25 timezones in our dataframe.

In [25]:
american_timezones = ('US & Canada|Canada|Arizona|America|Hawaii|Indiana|Alaska'
foreign_tz_df = tz_df[~tz_df.user_time_zone.str.contains(american_timezones)]

foreign_tz_groupby = foreign_tz_df.groupby('user_time_zone')
foreign_tz_groupby.size().sort(inplace = False, ascending = False).head(25)
Quito                  1719
London                  967
Amsterdam               571
Athens                  368
Bangkok                 249
Beijing                 201
Brasilia                192
New Delhi               182
Tehran                  164
Jakarta                 164
Sydney                  134
Chennai                 133
Paris                   133
West Central Africa     130
Casablanca              129
Baghdad                 119
Dublin                  118
Tijuana                 117
Caracas                 103
Bucharest               100
Berlin                   99
Rome                     96
Madrid                   89
Greenland                86
Belgrade                 79
dtype: int64

I also want to look at polarity, so I'll only use English tweets.

(Sorry, Central/South Americans - my very rough method of filtering out American timezones gets rid of some of your timezones too. Let me know if there's a better way to do this.)

In [26]:
foreign_english_tz_df = foreign_tz_df[foreign_tz_df.lang == 'en']

Now we have a dataframe containing (mostly) world cities as time zones. Let's get the top cities by number of tweets for each candidate, then plot polarities.

In [27]:
foreign_tz_groupby = foreign_english_tz_df.groupby(['candidate', 'user_time_zone'])
top_foreign_tz_df = foreign_tz_groupby.filter(lambda group: len(group) > 40)

top_foreign_tz_groupby = top_foreign_tz_df.groupby(['user_time_zone', 'candidate'], as_index = False)

mean_influenced_polarities = top_foreign_tz_groupby.influenced_polarity.mean()

pivot = mean_influenced_polarities.pivot_table(

plot = sns.heatmap(pivot)
plot.set_title('Influenced Polarity in Major Foreign Cities by Candidate', family='Ubuntu')
plot.set_ylabel('city', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)

Exercise for the reader: why is Rand Paul disliked in Athens? You can probably guess, but the actual tweets causing this are rather amusing.

Greco-libertarian relations aside, the data shows that London and Amsterdam are among the most influential of cities, with the former leaning toward Jeb Bush and the latter about neutral.

In India, Clinton-supporters reside in New Delhi while Chennai tweeters back Rand Paul. By contrast, in 2014, New Delhi constituents voted for the conservative Bharatiya Janata Party while Chennai voted for the more liberal All India Anna Dravida Munnetra Kazhagam Party - so there seems to be some kind of cultural shift between the voters of 2014 and the tweeters of today.

Last thing I thought was interesting: Athens has the highest mean polarity for Bernie Sanders, the only city for which this is the case. Modern Greece tends to lean toward social democracy, the same as Bernie Sanders.


Finally, I'll look at specific geolocation (latitude and longitude) data. Since only about 750 out of 80,000 tweets had geolocation enabled, this data can't be used for sentiment analysis, but we can still get a good idea of international spread.

First I'll plot everything on a world map, then break it up by candidate in the U.S.

In [28]:
df_place = df.dropna(subset=['place'])
mollweide =

plot = plt.axes(projection=mollweide)

plot.set_title('International Tweeters with Geolocation Enabled', family='Ubuntu')
plot.figure.set_size_inches(14, 9)
In [29]:
plot = plt.axes(projection=albers_equal_area)

plot.set_extent((-125, -66, 20, 50))

plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor='#00BCD4')

candidate_groupby = df_place.groupby('candidate', as_index = False)

colors = ['#1976d2', '#7cb342', '#f4511e', '#7b1fa2']
for index, (name, group) in enumerate(candidate_groupby):
    longitudes = group.longitude.values
    latitudes = group.latitude.values
plot.set_title('U.S. Tweeters by Candidate', family='Ubuntu')
plt.legend(loc='lower left')
plot.figure.set_size_inches(12, 7)

As expected, U.S. tweeters are centered around L.A., the Bay Area, Chicago, New York, and Boston. Rand Paul and Bernie Sanders tweeters are more spread out over the country.


That's all I have for now. Back to the original question about Sanders' relevance among millenials: it's clear that the people who tweet about Sanders are very positive toward him, relative to the other candidates (polarity).

Bernie is behind only Clinton in number of tweets per day, but if Sanders wants to receive the nomination, he needs to get large media and news outlets constantly writing about him (influence) to outreach to swing states.

Sanders also needs to appeal to a wider variety of voters; specifically Spanish ones, who tweet about him least of all 4 candidates.

In the coming months, I'll be closely following the 2016 election and these candidates. Will Twitter statistics reflect on the general population? We'll find out next November.

The most interesting thing I've learned here (that may seem obvious), is that Internet reactions to events don't occur as a result of the event, but rather media coverage of that event. The New York Times article about Jeb Bush and Trump was published over a day before the Portuguese one, but Brazilian tweeters only reacted after the Portuguese article was published. This is one reason why it's important to have news coverage in as many languages as possible. "The medium is the message."

If you found this analysis interesting and are curious for more, I encourage you to download the dataset (or get your own dataset based on your interests) and share your findings.

Source code is at, and I can be reached at for any questions, comments, or criticism. Looking forward to hearing your feedback!