Harvard CSCI S-109A - Twitter Bot Detection

Project Team: Eumar Assis, Andrew Caide, Mark Carlebach, and Jiang Yusheng

Data Collection

Tweepy Data Collection

We collected data on known verified users from Twitter using the free, standard Tweepy API documented here: http://www.tweepy.org.

The verified users were identified manually and somewhat ‘at random’ by looking at Twitter’s desktop application. We did not discover a more automated, scalable way of doing this. In the end, we identified approximately 200 such accounts. For each account, we used Tweepy’s ‘tp.curor’ and ‘api.user_timeline’ methods to collect about 150 tweets per user.

We originally also collected tweets for users of unknown status (i.e., un-verified) but removed from analysis, focusing on binomial classification only (i.e., known-bot vs. known-verified users).

Known Bot Collection

A massive collection of over 200k tweets from known russian trolls were collected from the following location over at NBC News: https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731

Data Fields

Diagram: Simple Data Model

The above diagram shows the data fields we collected for the user and tweet objects from the two sources listed above. As the diagram depicts, we stored our data intially in 2 separate entities with the ability to join the data using screen_name.

The fields in bold are the fields we relied upon in our work. As noted in our conclusions and future considerations, more could be done with the fields that we did not make use of. Here is a description of the bolded fields that we did use in our analysis (and how we used them):

Cleaning

In our preliminary review of our data, we encountered the following issues which we resolved as described:

The final, resulting data frame was used across both analysis: the Microsoft Azure and NLTK groups.

Go to next page