Harvard CSCI S-109A - Twitter Bot Detection

Project Team: Eumar Assis, Andrew Caide, Mark Carlebach, and Jiang Yusheng

MODELING

Overview

The modeling we conducted mirrored the modeling we performed in our later assignments, involving many classes of models. The following diagram is an overview of our approach and is described below:

Diagram: Overview of Modeling Efforts

Data Sets for Modeling with User Level Data

As explained above, the size of our data did decrease substantially as we cleaned and rolled-up to the Twitter user level. As is noted many times in this report, more data would make our results more meaningful. Nonetheless, we tried to follow the correct process at each step as best possible.

After we had the enriched user level data, we split our data into a training set and a testing set. Given the small number of users, we had to split our data 50/50. This resulted in data sets with the following sizes:

Each data set had the same features for each Twitter user that is described above and re-stated below:

Prior to modeling, we did discover a lingering problem with ‘NaN’ values in the followers_count column. We ‘filled-in’ these holes with the mean from other rows in this column.

Default Model Bake-off

To get the ball rolling, we trained each of 7 models on the training data using the default settings for each model class provided by sklearn. We then scored each model on the testing data and reported the training and testing results for each model with default settings. The results are in the diagram below:

Diagram: Results of Default Model Bake-off

It was good to see that each model did perform better on the training data (i.e., the lack of tuning did lead to overfitting with the training data.) The biggest drop-off between training and testing was with the X model.

The overall winner in the bake-off among the default models was AdaBoost with a score of 94%.

Tuning Models

Each of the different classifiers in sklearn has a myriad of tuning options. Some more than others. Best practice is to optimize or tune each model by comparing/scoring a number of setting combinations with cross-validation data. A very powerful way to make this comparison with cross-validation data across a large parameter-space is with GridSearchCV, where GridSearchCV accepts as arguments one or more model types and very flexible list(s) of argument(s) that correspond to each model type. GridSearchCV perform cross validation for each combination of parameters for each model type specified and reports back relevant results.

We used GridSearchCV to tune each model type with the best combination of parameters. We then fit the ‘best-in-class’ version of that model type on the training data and scored on the testing data. The chart below shows the accuracy scores that resulted for these tuned models vs. their default/untuned counterparts that were described above:

Diagram: Results of Model Tuning

As the diagram shows, in some instances tuning does improve the performance of the model over the model with default parameters specified by sklearn. In other cases, however, we saw a slight decline when comparing the accuracy score of the tuned model selected by GridSearchCV with the model that used default parameters from sklearn. We did ensure the set of parameters used by GridSearchCV included the default options specified by sklearn. The root problem is likely the small amount of training data out of which GridSearchCV is creating cross-validation data.

The overall winner in the bake-off among the tuned models was AdaBoost with a score of 89%.

Ensembling

In an effort to improve upon the predictive ability of any one model, we created a meta-model trained on the predictions from the 7 ‘best-in-class’ models described in the previous section. The following describes our approach in more detail:

The meta-model did achieve an accuracy score of 90% on the testing data. This score was just slighlty higher than the best contributing model in the ensemble.

NLTK Results

Diagram: Results of Hand-Made NLP Features

Employing the ensemble to the hand-engineered NLP features, a maximum score shy over 85% was achieved with the Random Forest Classifier, followed by the Adaboost Classifier. These results are not nearly as strong as the same emseble run with the Microsoft Azure generated features, which scored around 90% across all classifiers. While an excellent option for the broke student, the features obtained from processing through Microsoft Azure’s NLP API was much richer, thereby producing better results, from 5 to 15%.

On a small scale, such as a final project, these results may be fine. However, an investor purchasing expensive ad space on websites interested in marketting products to specific, potential clients, I would recommend utilizing Microsoft Azure for a minimum ROI of 5% to a maximum of 15% on targetting.

Neural Networks

An additional modeling effort was to create a neural network using Keras libraries. In this implementation, we made a small number of design decisions:

We fit the model on the scaled training data with 50 epochs. We scored for accuracy on the scaled testing data. After re-running several times, we noticed the accuracy score was rather volatile, implicitly varying due to random initial conditions controlled by Keras. The overall accuracy score of 69% was the lowest of the models in this project, confirming an intuition that neural network success is dependent on significiantly larger sizes of data on which to train.

MLP Classifier

Lastly, we also applied MLPClassifier to our data. This model is based on the similar concept of the above-proposed Neural Networks. It gave us a comparable classification accuracy (0.69). It also appeared to vary significant run-to-run, unlike the earlier models.

Go to next page