Social Media Analytics¶

--- A sentiment analysis on tweets towards airlines¶

Introduction¶

Purpose¶

The purpose of this project is to predict whether the sentiment of a tweet is negative or not.

About the dataset¶

There are three datasets in this project.
Two csv files for 1700 complaints and 1700 noncomlaints respectively
One csv file for 4555 non-labeled tweets.

Method¶

First, I converted each of those tweets into vectors by use SciKit Learn's CountVectorizer. This algorithm help convert a collection of text documents to a matrix of token counts.
I chose TF-IDF as the text mining method to weight the matrix I got.
I used SciKit Learn's pipeline capabilities to store a pipeline of workflow. This allowed me to set up all the transformations all at once.

Model¶

I chose three models---Naive Bayes classifier algorithm, Random Forest Classifier and Light Gradient Boosting Model to train on.
Using gridSearchCV, for each hyperparameter of each model, I passed in a bunch of possible values, compared the roc_auc score, and chose the best parameters for each model.
For each chosen model, I found the optimal threshold, got accuracy over the test set for each model, and plotted ROC curve for them.
I chose MultinomialNB as the final model because it got the highest accuracy which is 75.74% and auc which is 0.8283.

Import Useful Packages¶

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from nltk.corpus import stopwords
import string
from tqdm import tqdm
import re

Load Data¶

# Take a look like what the data looks like
complaints = [line.rstrip() for line in open('/Users/apple/Desktop/complaint.csv')]

noncomplaints = [line.rstrip() for line in open('/Users/apple/Desktop/noncomplaint.csv')]

# The data has three columns
complaints[0]

'"id","airline","tweet"'

complaints = pd.read_csv('/Users/apple/Desktop/complaint.csv', sep=',')
noncomplaints = pd.read_csv('/Users/apple/Desktop/noncomplaint.csv', sep=',')

Exploratory Data Analysis¶

I wonder which airline is being complained most?

complaints['airline'].value_counts()

American         502
United           486
SouthWest        231
Delta            190
JetBlue          173
VirginAmerica     60
AlaskaAir         58
Name: airline, dtype: int64

plt.figure(figsize=(10,6))
sns.countplot(x=complaints['airline'],palette = 'rainbow')

<matplotlib.axes._subplots.AxesSubplot at 0x11fdf6208>

noncomplaints['airline'].value_counts()

JetBlue          363
American         343
United           320
SouthWest        255
VirginAmerica    179
AlaskaAir        143
Delta             97
Name: airline, dtype: int64

plt.figure(figsize=(10,6))
sns.countplot(x=noncomplaints['airline'],palette = 'rainbow')

<matplotlib.axes._subplots.AxesSubplot at 0x12016b198>

Data Preprocessing¶

Labeling¶

Set the label. complaints for 0 and noncomplaints for 1

complaints['label']=0
noncomplaints['label']=1

data = pd.concat([complaints,noncomplaints])

data['tweet'] = data['tweet'].apply(lambda s:s.lower())

data.head()

Text processing¶

The text data now is still quite dirty. All the mentions and punctuations are irrelavant to sentiment, so I will remove them. Also, I need to remove all the stopwords in English. However, the negations are important to determine the sentiment of a tweet, so I remain them and put them into a white list. The function shows below:

def text_process(tweet):
    """
    Takes in a string of text, then performs the following:
    1. remove all mentions
    2. Remove all punctuation
    3. Remove all stopwords
    4. Returns a list of the cleaned text
    """
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    
    # remove the @
    tweet = re.sub(r'@\w+', '', tweet)
    
    # Check characters to see if they are in punctuation
    nopunc = [char for char in tweet if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords, keep words in the whitelist and remove the @
    return  [word for word in nopunc.split() if (word.lower() not in stopwords.words('english')
           or word.lower() in whitelist) and len(word) > 1]

data['tweet'].head(5).apply(text_process)

0    [im, issues, yesterday, rebooked, 24, hours, s...
1    [kinda, feel, like, 699, charge, flight, wifi,...
2    [livid, vegas, delayed, againamp, againampagai...
3                 [annoying, man, earth, flight, help]
4    [last, weeks, ive, flown, wit, given, reasons,...
Name: tweet, dtype: object

Modeling¶

Train Test Split¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['tweet'], data['label'], test_size=0.2)

Set The Pipeline¶

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
def getPipeline(model):
    """
    apply model using the pipeline
    """
    pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', model),  # train on TF-IDF vectors
    ])
    return pipeline

GridSearchCV for Model Tuning¶

from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, auc, roc_auc_score
def gridSearchCV(model, params):
    """
    @param    model: sklearn estimator
    @param    params (dict): Dictionary of possible parameters
    
    @return   cv_results (DataFrame)
    """
    model_cv = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
    
    pipeline = getPipeline(model_cv)
    pipeline.fit(X_train, y_train)
    cv_results = pd.DataFrame(model_cv.cv_results_)[['params', 'mean_test_score']]
    
    return cv_results

Model Evaluation¶

def evaluate(model, plotROC=False):
    """
    1. Plot ROC AUC of the test set
    2. Return the best threshold
    """
    pipeline = getPipeline(model)
    pipeline.fit(X_train, y_train)
    probs = pipeline.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    print(f'AUC: {roc_auc:.4f}')
    
    # Find optimal threshold
    rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
    rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
    optimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
    print(optimalThreshold)
    
    # Get accuracy over the test set
    y_pred = np.where(preds >= optimalThreshold, 1, 0)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy*100:.2f}%')
    
    # Plot ROC AUC
    if plotROC:
        plt.title('Receiver Operating Characteristic')
        plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
        plt.legend(loc = 'lower right')
        plt.plot([0, 1], [0, 1],'r--')
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')
        plt.show()

1. MultinomialNB¶

from sklearn.naive_bayes import MultinomialNB

params = {'alpha': [1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4]}
mnb = MultinomialNB(fit_prior=True, class_prior=None)
print(gridSearchCV(mnb, params))

           params  mean_test_score
0  {'alpha': 1.6}         0.819147
1  {'alpha': 1.7}         0.819421
2  {'alpha': 1.8}         0.819686
3  {'alpha': 1.9}         0.819826
4  {'alpha': 2.0}         0.819972
5  {'alpha': 2.1}         0.820207
6  {'alpha': 2.2}         0.820256
7  {'alpha': 2.3}         0.820270
8  {'alpha': 2.4}         0.820289

mnb = MultinomialNB(alpha=2.3, fit_prior=True, class_prior=None)
evaluate(mnb, plotROC=True)

AUC: 0.8283
0.45179398066892024
Accuracy: 75.74%

2. Random Forest¶

from sklearn.ensemble import RandomForestClassifier

params = {'n_estimators': [175,200,500]}
rfc = RandomForestClassifier(random_state=1)
print(gridSearchCV(rfc, params))

                  params  mean_test_score
0  {'n_estimators': 175}         0.800777
1  {'n_estimators': 200}         0.800467
2  {'n_estimators': 500}         0.800628

rfc = RandomForestClassifier(n_estimators=500,random_state=1)
evaluate(rfc,plotROC=True)

AUC: 0.7925
Accuracy: 73.68%

3. LightGBM¶

from lightgbm import LGBMClassifier
params1 = {'learning_rate': [0.01,0.05,0.1]}
params2 = {'n_estimators': [100,200,500]}
params3 = {'num_leaves': [5,6,7,8,9,10]}
params4 = {'min_data_in_leaf': [3,4,5,6,7,8]}
params5 = {'max_depth': [5, 6, 7 ,8, 9]}
params6 = {'max_bin': [45,50,55,60,65]}
params7 = {'bagging_fraction': [0.7,0.72,0.75,0.78,0.8]}
params8 = {'bagging_freq': [3,4,5,6,7,8,9]}
params9 = {'feature_fraction': [0.2,0.22,0.24,0.26]}
params10 = {'feature_fraction_seed': [5,6,7,8,9]}
params11 = {'bagging_seed': [5,6,7,8,9]}
params12 = {'min_sum_hessian_in_leaf': [7,8,9,10,11,12]}

lightGBM = LGBMClassifier(learning_rate=0.01,
                          n_estimators=500,
                          num_leaves=5,
                          min_data_in_leaf=5,
                          max_depth=5,
                          max_bin=55,
                          bagging_fraction=0.8,
                          bagging_freq=4,
                          feature_fraction=0.22,
                          feature_fraction_seed=9,
                          bagging_seed=7,
                          min_sum_hessian_in_leaf=7)
print(gridSearchCV(lightGBM, params11))

                params  mean_test_score
0  {'bagging_seed': 5}         0.735416
1  {'bagging_seed': 6}         0.734045
2  {'bagging_seed': 7}         0.735735
3  {'bagging_seed': 8}         0.735136
4  {'bagging_seed': 9}         0.733867

lightGBM = LGBMClassifier(learning_rate=0.01,
                          n_estimators=500,
                          num_leaves=5,
                          min_data_in_leaf=5,
                          max_depth=5,
                          max_bin=55,
                          bagging_fraction=0.8,
                          bagging_freq=4,
                          feature_fraction=0.22,
                          feature_fraction_seed=9,
                          bagging_seed=7,
                          min_sum_hessian_in_leaf=7)
evaluate(lightGBM, plotROC=True)

AUC: 0.7575
0.5590903396065869
Accuracy: 68.68%

Load Test Data¶

mytest_set = pd.read_csv('/Users/apple/Downloads/test_data.csv', sep=',')

mytest_set.head()

mytest_set['tweet'] = mytest_set['tweet'].apply(lambda s:s.lower())

mytest_set['tweet'].head(5).apply(text_process)

0    [flew, phoenix, got, bag, locks, also, ripped,...
1    [another, flight, another, long, delay, sittin...
2    [pd, 129, advc, main, cabin, xtra, seat, bcn, ...
3    [flight, 1242, delayed, gate, agent, handling,...
4                                               [fuck]
Name: tweet, dtype: object

test = mytest_set['tweet']

test.head()

0    @jordanhembrough @united i just flew them to p...
1    another @united flight = another long delay. s...
2    @americanair pd $129 in advc for main cabin xt...
3    @americanair flight 1242 is delayed, 1 gate ag...
4        @iijeriichoii @virginamerica what. the. fuck.
Name: tweet, dtype: object

Predict¶

X = data['tweet']
y = data['label']
final_model = getPipeline(mnb).fit(X, y)

# set the optimal threshold for prediction
predictions = final_model.predict_proba(test)[:, 1]
predictions = np.where(predictions >= 0.61,1,0)

mytest_set['predictions'] = predictions

mytest_set['predictions'].value_counts()

0    4316
1     239
Name: predictions, dtype: int64

output = mytest_set[mytest_set['predictions']==1].drop(['Unnamed: 0','tid_not_to_be_used',
                                                        'airline','tag'],axis=1).reset_index(drop=True)

output.head()

mytest_set[mytest_set['predictions']==1]['tweet'].head(5).apply(text_process)

6     [knew, one, pax, travel, horror, story, sorry,...
9     [likes, lie, customer, advise, fly, watch, gam...
29    [not, guilted, giving, aisle, middle, seat, fl...
33    [mom, fought, switch, ticket, go, nola, see, s...
34    [youre, going, wish, flight, longer, rocks, mu...
Name: tweet, dtype: object

output.to_csv('/Users/apple/Desktop/output2.csv', sep=',', encoding='utf-8')

Result¶

After reading all the noncomplaints I predict from the test set. I got 180 over 239 correct predictions.
The precision of my model is around 75.3%

	id	airline	tweet
0	80938	United	@united i'm having issues. yesterday i rebooke...
1	10959	United	@united kinda feel like the $6.99 you charge f...
2	130813	SouthWest	livid in vegas, delayed, again& again&...
3	146589	United	@united the most annoying man on earth is on m...
4	117579	United	@united the last 2 weeks i've flown wit u, you...

	Unnamed: 0	id	tid_not_to_be_used	airline	tag	tweet
0	1	12	5.062364e+17	United	rO(x6c15'Can	@jordanhembrough @united I just flew them to P...
1	2	122	5.063613e+17	United	rO(x6c15'Can	Another @united flight = another long delay. S...
2	3	170	5.064473e+17	American	rO(x6c15'Can	@AmericanAir pd $129 in advc for main cabin xt...
3	4	181	5.064553e+17	American	rO(x6c15'Can	@AmericanAir Flight 1242 is delayed, 1 gate ag...
4	5	187	5.064594e+17	VirginAmerica	rO(x6c15'Can	@IIJERiiCHOII @VirginAmerica What. The. Fuck.

	id	tweet	predictions
0	210	@iijeriichoii @virginamerica we all knew there...	1
1	297	@united likes to lie to its customer, i advise...	1
2	1203	not only guilted into giving up aisle for a mi...	1
3	1284	mom fought with @united to switch the ticket s...	1
4	1288	@clutch08 you're going to wish that flight was...	1