Project Team: Eumar Assis, Andrew Caide, Mark Carlebach, and Jiang Yusheng
Diagram: Enriched User Data
While the data collection and data cleaning was focused on tweets, we pursued one path of analysis that focused on engineering user related fields, derivied from the underlying tweet data (that contained the Azure NLP attributes) for each user. To perform this step, we grouped the de-normalized tweet-user data structure by screen_name and aggregated fields as described below to create the user level features on which we did the modeling:
Below is the code snippet for generating these features:
tweet_df_grouped_user = tweet_df.groupby(['screen_name']).agg({
'screen_name' : np.min,
'id': 'count',
'is_bot' : np.min,
'nlp_count_key_phrases': np.mean,
'nlp_sentiment_score': np.mean,
'is_tweet': np.mean,
'followers_count' : np.mean,
'created_at' : calculate_avg_delta,
'text' : calculate_avg_tweet_length
}).rename(columns={
'id': 'count_tweets',
'created_at' : 'avg_intertweet_time',
'text' : 'avg_text_length'
})
def calculate_avg_delta (x):
'''Function to generate inter-tweet feature for each user.'''
x_sorted = np.sort(x)
total_items = len(x)
#return large number for user with just one Tweet
if total_items == 1:
return 24 * 60; # return one day
array_deltas = np.zeros(total_items)
one_hour_delta = np.timedelta64(1, 'h')
for (i, item) in enumerate(x_sorted):
if i != (total_items - 1):
d1 = item
d2 = x_sorted[i + 1]
array_deltas[i] = (d2-d1) / one_hour_delta
return np.std(array_deltas)
def calculate_avg_tweet_length (x):
'''Function to calculate average tweets text size for each user.'''
array_deltas = [len(item) for item in x]
return np.mean(array_deltas)
Another path we pursued, was to analyze the tweets with the NLTK API. Two key parameters were engineered for this branch of analysis: the average length of the tweets per user, and the 10 most bot-favored word choices per user.
# Assuming the data has already been loaded as 'text_data'
# Tokenizer breaks tweet down into str objects inside lists.
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
# Remove NAs
text_data.dropna(subset=['text'], inplace=True)
# Turn tweets into lists of str objects
text_data['tokens'] = text_data['text'].apply(tt.tokenize)
# Count the strings in the list
text_data['tweet_length'] = text_data['tokens'].str.len()
# Aggregate by name, compute the mean tweet lengths, and sort!
text_data.groupby(['name']).tweet_length.mean().sort_values(ascending=False)
# Assuming the data has already been loaded as 'text_data'
# FreqDist is a powerful tool to count frequency of strings
# stopwords is a collection of stop-words in the nltk library
from nltk import FreqDist
from nltk.corpus import stopwords
import string
# This was an earlier step to segregate the data by user-type
bot_texts = text_data.loc[text_data.known_bot == True][useful_cols]
real_texts= text_data.loc[text_data.known_bot == False][useful_cols]
# cluster words by twitter name
bot_words = bot_texts.groupby(['name']).tokens.agg(sum)
usr_words = real_texts.groupby(['name']).tokens.agg(sum)
# Clean up the two arrays created above: insert into dataframes, label, and remove stopwords
bot_words = pd.DataFrame(bot_words)
usr_words = pd.DataFrame(usr_words)
bot_words.columns = ['words']
usr_words.columns = ['words']
stop_words = stopwords.words('english') + list(string.punctuation) + [' ','rt',"\'", "...", "..","`",'\"', '–', '’', "I'm", '…','""','“','”']
# Construct list of cleaned words
usr_words['cleaned_words'] = [[word for word in words if word.lower() not in stop_words]
for words in usr_words['words']]
bot_words['cleaned_words'] = [[word for word in words if word.lower() not in stop_words]
for words in bot_words['words']]
# Find the frequency of the cleaned_words (words with stopwords removed) per all users in the two groups.
freq_per_usr = FreqDist(list([a for b in usr_words.cleaned_words.tolist() for a in b]))
freq_per_bot = FreqDist(list([a for b in bot_words.cleaned_words.tolist() for a in b]))
# Most common words, clean
common_words_bot = pd.DataFrame(freq_per_bot.most_common())
common_words_usr = pd.DataFrame(freq_per_usr.most_common())
cols = ["Words", "Count"]
common_words_bot.columns = cols
common_words_usr.columns = cols
# Find the frequency of each word used
common_words_usr['Frequency'] = common_words_usr['Count']/len(common_words_usr)
common_words_bot['Frequency'] = common_words_bot['Count']/len(common_words_bot)
# Remove the small (len(word)<2) words which could be nonsense (emojis, 'hi', etc)
filter1 = (common_words_usr['Words'].str.len()>=3)
filter2 = (common_words_bot['Words'].str.len()>=3)
filtered_usr = common_words_usr.loc[filter1]
filtered_bot = common_words_bot.loc[filter2]
The filtered_usr/bot lists are the most used ‘important’ words in our tweet collections. Next the top 10 bot words shall be selected. Columns will be constructed for these words in the original dataframe (text_data), defaulted to 0 indicating the word has never been used.
naughty_words = filtered_bot[:10]
# Set these to 0
for word in naughty_words['Words']:
text_data[word] = 0
# Count the instances at which these words occur across all tokens
for word in naughty_words['Words']:
text_data[word] = text_data.apply(lambda row: row['tokens'].count(word), axis=1)
# IMPORTANT!
# Now the word-use frequency is calculated per user.
# This is important to do because there are more bots than real people in this collection; in order to provide proportionate datasets, the frequency at which these words are uttered by that specific user in their tweers are more important than the total number of times they mention it.
# Consider this:
# Person A debating politics vs Person B shouting "trump trump trump trump": while the Person A debating politics may mention trump more than Person B, person A is using that word less frequently (perhaps in lieu for an actual conversation rather than spam).
# Sum all of the instances these select words were used by each twitter user. Also merge in the known_bot status (soon to be the endogenous variable).
text_by_names = text_data.groupby(['name']).sum()[naughty_words['Words']]
to_join = text_data[['name','known_bot']].drop_duplicates().set_index('name')
text_by_names=text_by_names.join(to_join, how='inner').drop_duplicates()
# To identify the frequency at which these words are used by the twitter user, we need to know how long their average tweets are
bot_texts2 = text_by_names.loc[text_by_names.known_bot == True].join(tweet_len_by_bot, how='inner')
usr_texts2= text_by_names.loc[text_by_names.known_bot == False].join(tweet_len_by_usr, how='inner')
# Remember: if it sounds like a bot it might just be a bot.
for word in naughty_words['Words']:
usr_texts2[word+"_freq"] = usr_texts2[word]/usr_texts2['mean_tweet_length']
bot_texts2[word+"_freq"] = bot_texts2[word]/bot_texts2['mean_tweet_length']
The next step is to scale the important data and separate into training and test data.
# sklearn provides excellent modules on breaking data into train/test sets, and a scaling function.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# Preserving important columns for labeling
cols = list(naughty_words['Words']+"_freq")
cols.append("known_bot")
cols.append('mean_tweet_length')
# Binding the bots and real users
all_res = bot_texts2.append(usr_texts2)
all_res = all_res[cols]
# Separating into train-test sets
X_train, X_test = train_test_split(all_res, test_size=.2, stratify=all_res['known_bot'])
# Extract our endogenous variable, make it binary (0's [false, not a bot] 1's [true, a bot])
y_train = X_train['known_bot']*1
y_test = X_test['known_bot']*1
# Preserving important columns for labeling
cols = list(naughty_words['Words']+"_freq")
cols.append("mean_tweet_length")
# Drop our endogenous variable from our exogenous variables.
X_train = X_train.drop('known_bot', axis=1)
X_test = X_test.drop('known_bot', axis=1)
# Scale!
scaler = MinMaxScaler().fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))
# Label the columns
X_train.columns = cols
X_test.columns = cols
And now the data is ready for the custom NLP processing!