Project Team: Eumar Assis, Andrew Caide, Mark Carlebach, and Jiang Yusheng
In our final bakeoff, our highest prediction accuracy was achieved with the Ensemble model based on user level enriched data. The differnce between the best score and the second best score from the tuned AdaBoost model was very small. In fact, in many runs the results were reversed.
Our hand-made system was a little less successful, capable of only achieving a maximum of 80% accuracy. Below are the results from the analysis conducted using NLTK:
Overall our methods were pretty successful. The neural network developed inhouse certainly wasn’t optimal, perhaps hindered by lack of data. From a business perspective, if money allows, Azure would make a great addition to an NLP pipeline to add an additional 10% accuracy to the models we employed. Otherwise, a hand-coded NLP classifier system can easily achieve a minimum score of 80%!
It is interesting that our models were able to predict classifications. It is also interesting that with these non-parametric models, our ability to say more about why or how seems limited and drives home a point made repeatedly in class.
Future Considerations
This project would clearly benefit from additional time. We found the effort very educational. We also know there are many things we could improve upon with more time and experience in the future. Here are some of those areas for improvement:
Resources
The notebooks and data files developed as part of this project are submitted separately for examination and are available on our GitHub site listed here: https://github.com/eumarassis/Harvard-s109-TwitterBotDetection
The diagram below provides a graphical view of how the notebooks and data are ogranized. The items in yellow represent the data and files that can be easily loaded and run to produce the modeling output above.
Diagram: Notebooks and Data Pipeline
If you do wish to run the data collection notebooks that collect data directly from Twitter, include in your directory a file called ‘twitter_credentials.py’ with the following entries for Twitter access tokens (as these are non-valid samples):
We read material at the following links as research during the planning and EDA phase of our project: