Social Media Analytics

--- A sentiment analysis on tweets towards airlines

image.png

Introduction

Purpose

  • The purpose of this project is to predict whether the sentiment of a tweet is negative or not.

About the dataset

  • There are three datasets in this project.
  • Two csv files for 1700 complaints and 1700 noncomlaints respectively
  • One csv file for 4555 non-labeled tweets.

Method

  • First, I converted each of those tweets into vectors by use SciKit Learn's CountVectorizer. This algorithm help convert a collection of text documents to a matrix of token counts.

  • I chose TF-IDF as the text mining method to weight the matrix I got.

  • I used SciKit Learn's pipeline capabilities to store a pipeline of workflow. This allowed me to set up all the transformations all at once.

Model

  • I chose three models---Naive Bayes classifier algorithm, Random Forest Classifier and Light Gradient Boosting Model to train on.

  • Using gridSearchCV, for each hyperparameter of each model, I passed in a bunch of possible values, compared the roc_auc score, and chose the best parameters for each model.

  • For each chosen model, I found the optimal threshold, got accuracy over the test set for each model, and plotted ROC curve for them.

  • I chose MultinomialNB as the final model because it got the highest accuracy which is 75.74% and auc which is 0.8283.

image.png

Import Useful Packages

In [10]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from nltk.corpus import stopwords
import string
from tqdm import tqdm
import re

Load Data

In [11]:
# Take a look like what the data looks like
complaints = [line.rstrip() for line in open('/Users/apple/Desktop/complaint.csv')]

noncomplaints = [line.rstrip() for line in open('/Users/apple/Desktop/noncomplaint.csv')]
In [12]:
# The data has three columns
complaints[0]
Out[12]:
'"id","airline","tweet"'
In [13]:
complaints = pd.read_csv('/Users/apple/Desktop/complaint.csv', sep=',')
noncomplaints = pd.read_csv('/Users/apple/Desktop/noncomplaint.csv', sep=',')

Exploratory Data Analysis

  • I wonder which airline is being complained most?
In [19]:
complaints['airline'].value_counts()
Out[19]:
American         502
United           486
SouthWest        231
Delta            190
JetBlue          173
VirginAmerica     60
AlaskaAir         58
Name: airline, dtype: int64
In [26]:
plt.figure(figsize=(10,6))
sns.countplot(x=complaints['airline'],palette = 'rainbow')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fdf6208>
In [20]:
noncomplaints['airline'].value_counts()
Out[20]:
JetBlue          363
American         343
United           320
SouthWest        255
VirginAmerica    179
AlaskaAir        143
Delta             97
Name: airline, dtype: int64
In [27]:
plt.figure(figsize=(10,6))
sns.countplot(x=noncomplaints['airline'],palette = 'rainbow')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x12016b198>

Data Preprocessing

Labeling

  • Set the label. complaints for 0 and noncomplaints for 1
In [25]:
complaints['label']=0
noncomplaints['label']=1
In [26]:
data = pd.concat([complaints,noncomplaints])
In [29]:
data['tweet'] = data['tweet'].apply(lambda s:s.lower())
In [30]:
data.head()
Out[30]:
id airline tweet label
0 80938 United @united i'm having issues. yesterday i rebooke... 0
1 10959 United @united kinda feel like the $6.99 you charge f... 0
2 130813 SouthWest livid in vegas, delayed, again&amp; again&amp;... 0
3 146589 United @united the most annoying man on earth is on m... 0
4 117579 United @united the last 2 weeks i've flown wit u, you... 0

Text processing

  • The text data now is still quite dirty. All the mentions and punctuations are irrelavant to sentiment, so I will remove them. Also, I need to remove all the stopwords in English. However, the negations are important to determine the sentiment of a tweet, so I remain them and put them into a white list. The function shows below:
In [31]:
def text_process(tweet):
    """
    Takes in a string of text, then performs the following:
    1. remove all mentions
    2. Remove all punctuation
    3. Remove all stopwords
    4. Returns a list of the cleaned text
    """
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    
    # remove the @
    tweet = re.sub(r'@\w+', '', tweet)
    
    # Check characters to see if they are in punctuation
    nopunc = [char for char in tweet if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords, keep words in the whitelist and remove the @
    return  [word for word in nopunc.split() if (word.lower() not in stopwords.words('english')
           or word.lower() in whitelist) and len(word) > 1]
In [32]:
data['tweet'].head(5).apply(text_process)
Out[32]:
0    [im, issues, yesterday, rebooked, 24, hours, s...
1    [kinda, feel, like, 699, charge, flight, wifi,...
2    [livid, vegas, delayed, againamp, againampagai...
3                 [annoying, man, earth, flight, help]
4    [last, weeks, ive, flown, wit, given, reasons,...
Name: tweet, dtype: object

Modeling

Train Test Split

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['tweet'], data['label'], test_size=0.2)

Set The Pipeline

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
def getPipeline(model):
    """
    apply model using the pipeline
    """
    pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', model),  # train on TF-IDF vectors
    ])
    return pipeline

GridSearchCV for Model Tuning

In [35]:
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, auc, roc_auc_score
def gridSearchCV(model, params):
    """
    @param    model: sklearn estimator
    @param    params (dict): Dictionary of possible parameters
    
    @return   cv_results (DataFrame)
    """
    model_cv = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
    
    pipeline = getPipeline(model_cv)
    pipeline.fit(X_train, y_train)
    cv_results = pd.DataFrame(model_cv.cv_results_)[['params', 'mean_test_score']]
    
    return cv_results

Model Evaluation

In [36]:
def evaluate(model, plotROC=False):
    """
    1. Plot ROC AUC of the test set
    2. Return the best threshold
    """
    pipeline = getPipeline(model)
    pipeline.fit(X_train, y_train)
    probs = pipeline.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    print(f'AUC: {roc_auc:.4f}')
    
    # Find optimal threshold
    rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
    rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
    optimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
    print(optimalThreshold)
    
    # Get accuracy over the test set
    y_pred = np.where(preds >= optimalThreshold, 1, 0)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy*100:.2f}%')
    
    # Plot ROC AUC
    if plotROC:
        plt.title('Receiver Operating Characteristic')
        plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
        plt.legend(loc = 'lower right')
        plt.plot([0, 1], [0, 1],'r--')
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')
        plt.show()

1. MultinomialNB

In [37]:
from sklearn.naive_bayes import MultinomialNB
In [38]:
params = {'alpha': [1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4]}
mnb = MultinomialNB(fit_prior=True, class_prior=None)
print(gridSearchCV(mnb, params))
           params  mean_test_score
0  {'alpha': 1.6}         0.819147
1  {'alpha': 1.7}         0.819421
2  {'alpha': 1.8}         0.819686
3  {'alpha': 1.9}         0.819826
4  {'alpha': 2.0}         0.819972
5  {'alpha': 2.1}         0.820207
6  {'alpha': 2.2}         0.820256
7  {'alpha': 2.3}         0.820270
8  {'alpha': 2.4}         0.820289
In [39]:
mnb = MultinomialNB(alpha=2.3, fit_prior=True, class_prior=None)
evaluate(mnb, plotROC=True)
AUC: 0.8283
0.45179398066892024
Accuracy: 75.74%

2. Random Forest

In [288]:
from sklearn.ensemble import RandomForestClassifier
In [300]:
params = {'n_estimators': [175,200,500]}
rfc = RandomForestClassifier(random_state=1)
print(gridSearchCV(rfc, params))
                  params  mean_test_score
0  {'n_estimators': 175}         0.800777
1  {'n_estimators': 200}         0.800467
2  {'n_estimators': 500}         0.800628
In [303]:
rfc = RandomForestClassifier(n_estimators=500,random_state=1)
evaluate(rfc,plotROC=True)
AUC: 0.7925
Accuracy: 73.68%

3. LightGBM

In [214]:
from lightgbm import LGBMClassifier
params1 = {'learning_rate': [0.01,0.05,0.1]}
params2 = {'n_estimators': [100,200,500]}
params3 = {'num_leaves': [5,6,7,8,9,10]}
params4 = {'min_data_in_leaf': [3,4,5,6,7,8]}
params5 = {'max_depth': [5, 6, 7 ,8, 9]}
params6 = {'max_bin': [45,50,55,60,65]}
params7 = {'bagging_fraction': [0.7,0.72,0.75,0.78,0.8]}
params8 = {'bagging_freq': [3,4,5,6,7,8,9]}
params9 = {'feature_fraction': [0.2,0.22,0.24,0.26]}
params10 = {'feature_fraction_seed': [5,6,7,8,9]}
params11 = {'bagging_seed': [5,6,7,8,9]}
params12 = {'min_sum_hessian_in_leaf': [7,8,9,10,11,12]}

lightGBM = LGBMClassifier(learning_rate=0.01,
                          n_estimators=500,
                          num_leaves=5,
                          min_data_in_leaf=5,
                          max_depth=5,
                          max_bin=55,
                          bagging_fraction=0.8,
                          bagging_freq=4,
                          feature_fraction=0.22,
                          feature_fraction_seed=9,
                          bagging_seed=7,
                          min_sum_hessian_in_leaf=7)
print(gridSearchCV(lightGBM, params11))
                params  mean_test_score
0  {'bagging_seed': 5}         0.735416
1  {'bagging_seed': 6}         0.734045
2  {'bagging_seed': 7}         0.735735
3  {'bagging_seed': 8}         0.735136
4  {'bagging_seed': 9}         0.733867
In [341]:
lightGBM = LGBMClassifier(learning_rate=0.01,
                          n_estimators=500,
                          num_leaves=5,
                          min_data_in_leaf=5,
                          max_depth=5,
                          max_bin=55,
                          bagging_fraction=0.8,
                          bagging_freq=4,
                          feature_fraction=0.22,
                          feature_fraction_seed=9,
                          bagging_seed=7,
                          min_sum_hessian_in_leaf=7)
evaluate(lightGBM, plotROC=True)
AUC: 0.7575
0.5590903396065869
Accuracy: 68.68%

Load Test Data

In [40]:
mytest_set = pd.read_csv('/Users/apple/Downloads/test_data.csv', sep=',')
In [41]:
mytest_set.head()
Out[41]:
Unnamed: 0 id tid_not_to_be_used airline tag tweet
0 1 12 5.062364e+17 United rO(x6c15'Can @jordanhembrough @united I just flew them to P...
1 2 122 5.063613e+17 United rO(x6c15'Can Another @united flight = another long delay. S...
2 3 170 5.064473e+17 American rO(x6c15'Can @AmericanAir pd $129 in advc for main cabin xt...
3 4 181 5.064553e+17 American rO(x6c15'Can @AmericanAir Flight 1242 is delayed, 1 gate ag...
4 5 187 5.064594e+17 VirginAmerica rO(x6c15'Can @IIJERiiCHOII @VirginAmerica What. The. Fuck.
In [42]:
mytest_set['tweet'] = mytest_set['tweet'].apply(lambda s:s.lower())
In [43]:
mytest_set['tweet'].head(5).apply(text_process)
Out[43]:
0    [flew, phoenix, got, bag, locks, also, ripped,...
1    [another, flight, another, long, delay, sittin...
2    [pd, 129, advc, main, cabin, xtra, seat, bcn, ...
3    [flight, 1242, delayed, gate, agent, handling,...
4                                               [fuck]
Name: tweet, dtype: object
In [44]:
test = mytest_set['tweet']
In [45]:
test.head()
Out[45]:
0    @jordanhembrough @united i just flew them to p...
1    another @united flight = another long delay. s...
2    @americanair pd $129 in advc for main cabin xt...
3    @americanair flight 1242 is delayed, 1 gate ag...
4        @iijeriichoii @virginamerica what. the. fuck.
Name: tweet, dtype: object

Predict

In [58]:
X = data['tweet']
y = data['label']
final_model = getPipeline(mnb).fit(X, y)
In [59]:
# set the optimal threshold for prediction
predictions = final_model.predict_proba(test)[:, 1]
predictions = np.where(predictions >= 0.61,1,0)
In [60]:
mytest_set['predictions'] = predictions
In [61]:
mytest_set['predictions'].value_counts()
Out[61]:
0    4316
1     239
Name: predictions, dtype: int64
In [67]:
output = mytest_set[mytest_set['predictions']==1].drop(['Unnamed: 0','tid_not_to_be_used',
                                                        'airline','tag'],axis=1).reset_index(drop=True)
In [63]:
output.head()
Out[63]:
id tweet predictions
0 210 @iijeriichoii @virginamerica we all knew there... 1
1 297 @united likes to lie to its customer, i advise... 1
2 1203 not only guilted into giving up aisle for a mi... 1
3 1284 mom fought with @united to switch the ticket s... 1
4 1288 @clutch08 you're going to wish that flight was... 1
In [64]:
mytest_set[mytest_set['predictions']==1]['tweet'].head(5).apply(text_process)
Out[64]:
6     [knew, one, pax, travel, horror, story, sorry,...
9     [likes, lie, customer, advise, fly, watch, gam...
29    [not, guilted, giving, aisle, middle, seat, fl...
33    [mom, fought, switch, ticket, go, nola, see, s...
34    [youre, going, wish, flight, longer, rocks, mu...
Name: tweet, dtype: object
In [69]:
output.to_csv('/Users/apple/Desktop/output2.csv', sep=',', encoding='utf-8')

Result

  • After reading all the noncomplaints I predict from the test set. I got 180 over 239 correct predictions.

  • The precision of my model is around 75.3%