Introduction¶

Long-Term Opioid Therapy (LTOT) can post a series of severe physical and mental issues for patients. Long Term Opioid Therapy is defined as the continuous use of opioid medication with 90% of days covered over six months.

To assist Humana in identifying patients likely to overuse opioids and make a timely intervention, first, we used the given dataset to identify and label patients having LTOT. Then we compared LTOT and non-LTOT patients to discover the most different features (e.g. opioid use history, the sum of claim charged amount). After that, we performed feature engineering to get essential features for each patient. Next, using our preprocessed dataset, we used cross-validation to tune, evaluate, and choose the best model among Logistic Regression, Random Forest, and Light Gradient Boosting models. Lastly, we trained our chosen LightGB model on the whole train dataset and make predictions on the holdout dataset.

About the dataset¶

The training dataset covers 4-year medical records of 14,000 patients and has six million rows, each of which provides details corresponding to an event for the particular member. The holdout dataset covers 6,000 patients and only includes data up to the first opioid naïve event. In this dataset the four main event categories are: Claim, Diagnosis, Call, and RxClaim.

Load the data¶

# import useful packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# read the raw data
Data = pd.read_csv("HMAHCC_COMP.csv")
testData = pd.read_csv("holdOut_processed.csv")

/Users/apple/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

# serve as mark in order to distinguish training and testing data
testID = pd.DataFrame({'id':testData.id.unique()})
trainID = pd.DataFrame({'id':Data.id.unique()})

print(trainID.shape)
print(testID.shape)

(14000, 1)
(6000, 1)

# We only need to care about patient behavior before opoid naive because we are here to make predictions
trainData = Data[Data.Days <= 0]

fullData = pd.concat([trainData, testData], axis=0)

print(fullData.shape)

(4320622, 20)

Labeling LTOT¶

To correctly identify and label the event of interest, we subsetted the opioid paid RX claim data. Firstly, we sorted the data by patient id, drug types, and days to avoid overlapping. Then, we identify opioid naïve events of each patient. After that, based on days supplied, we calculated how many days each patient use opioid drugs within 180 days after the last opioid naïve event. This number is reset whenever a patient has a new opioid naive event. Within 180 days after each opioid naive event, if a patient has more than 162 days of opioid supplies, she or he will be labeled as LTOT.

To predict the probability of having LTOT for a patient, we focused on analyzing opioid usage history and each of the main event categories to learn from their past behaviors.

label = pd.read_csv("label.csv")

label.shape

(13949, 2)

# Merge train data with label
trainDataLabel = trainData.merge(label, on='id')

Exploratory Data Analysis¶

This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the total amount

def analyze_attr(attr, df):
    df1 = df[['id', attr, 'LTOP']].copy().groupby(['id', attr]).mean().reset_index()
    df2 = df1.groupby([attr, 'LTOP']).count().reset_index()
    df2.columns = [attr, 'LTOP', 'count']

    df3 = df2.pivot_table(values='count', index=[attr], columns=['LTOP']).reset_index()
    df3.columns = [attr, 'LTOP_0', 'LTOP_1']
    df3['diff'] = df3.LTOP_1 - df3.LTOP_0
    return df3.sort_values('diff', ascending=False)

This function is created to visualize the formal analysis in a countplot

def plot_attr(attr, df, size):
    df1 = df[['id', attr, 'LTOP']].copy().groupby(['id', attr]).mean().reset_index()
    plt.figure(figsize=(size, 4))
    sns.countplot(x=attr, hue='LTOP', palette='coolwarm', data=df1)

1. Opioid RX Claim¶

When labeling LTOT, we realized that the history of opioid usage might be highly correlated with the likelihood of future LTOT. So we decided to discover opioid usage history first. We tried to explore whether LTOT and non-LTOT patients would have different patterns in various attributes provided in the data.

EDA_data = Data.merge(label, on='id')

opioid = EDA_data[(pd.notnull(EDA_data['PAY_DAY_SUPPLY_CNT']))&(EDA_data['event_descr']=='RX Claim - Paid')]
#analyze_attr('event_attr5',trainDataLabel).head(5)

analyze_attr('event_attr1',opioid).head(5)

plot_attr('event_attr1',opioid,12)

From the count plot above, we found out that patients who had LTOT would use certain drug classes more than non-LTOT patients, indicating the drug class description may have predictive power in distinguishing the LTOT risk. For example, Opioid Agonist is more likely to cause LTOT.

With a similar approach, we can also find other potential descriptive attributes to distinguish between LTOT and non-LTOT patients which are GPI Drug Class Description and Speciality.

analyze_attr('event_attr5',opioid)
analyze_attr('Specialty',opioid)
plot_attr('event_attr5',opioid,50)
plot_attr('Specialty',opioid,50)

Moreover, the number of opioid prescriptions and the number of opioid drug supplies are also different between LTOT and non-LTOT patients. Meanwhile, we found a significant difference between LTOT and non-LTOT patients of how many drugs they received at the opioid naive day,

opioidDay0Supplies = opioid[opioid.Days==0][['id','PAY_DAY_SUPPLY_CNT','LTOP']].groupby('id').max().reset_index() #7
opioidDay0Supplies.columns = ['id','opioidDay0Supplies','LTOP']
opioidDay0Supplies.head()
sns.boxplot(x='LTOP',y='opioidDay0Supplies',data=opioidDay0Supplies,palette='coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x12cb83080>

2. Other RX Claims¶

Besides RX Claims related to opioids, there might exist other attributes that we can add to our predictors. By plotting the average count of Rx Claim of LTOT and non-LTOT patients, we found some drug classes and drug groups that are used differently among LTOT and non-LTOT patients. Patients having LTOT uses a lot more pulmonary fibrosis agents, pulmonary hypertension, immunomodulators than non-LTOT patients.

OtherRX = EDA_data[(EDA_data['event_descr']=='RX Claim - Paid')& (EDA_data['Days']<=0)]
OtherRX.head()

This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the average value of each patient

def getcountdiff(attr,df,name):
    df1 = df[['id',attr, 'LTOP','event_attr5']].copy().groupby(['id',attr,'LTOP']).count().reset_index()
    df2 =  df1.copy().groupby([attr,'LTOP']).mean().reset_index()
    df2.columns=[attr,'LTOP','count']
    df2.sort_values('count',ascending=False,inplace=True)
    df3 = df2.pivot(index=attr, columns='LTOP', values='count').reset_index()
    df3.columns=[name,'LTOP_0','LTOP_1']
    df3['diff']=abs(df3['LTOP_0']-df3['LTOP_1'])
    df3.sort_values('diff',ascending=False,inplace=True)
    return df3.head(5)

This function is created to visualize the formal analysis in a countplot

def plotcountdiff(attr,df,name):
    df1 = df[['id',attr, 'LTOP','event_attr5']].copy().groupby(['id',attr,'LTOP']).count().reset_index()
    df2 =  df1.copy().groupby([attr,'LTOP']).mean().reset_index()
    df2.columns=[attr,'LTOP','count']
    df2.sort_values('count',ascending=False,inplace=True)
    df3 = df2.pivot(index=attr, columns='LTOP', values='count').reset_index()
    df3.columns=[name,'LTOP_0','LTOP_1']
    df3['diff']=abs(df3['LTOP_0']-df3['LTOP_1'])
    df3.sort_values('diff',ascending=False,inplace=True)
    df4 = df3.head(5).melt(id_vars=name,value_vars=['LTOP_0','LTOP_1'])
    return sns.barplot(y=name, x='value', hue='variable', data=df4,palette='coolwarm')

getcountdiff('event_attr1',OtherRX,'drugclass')

plotcountdiff('event_attr1',OtherRX,'Drug Class')

<matplotlib.axes._subplots.AxesSubplot at 0x11f3290f0>

getcountdiff('event_attr6',OtherRX,'Drug Group')

plotcountdiff('event_attr6',OtherRX,'Drug Group')

<matplotlib.axes._subplots.AxesSubplot at 0x11eee6a58>

3. Fully Paid Claim¶

For fully paid claim events, we tried to identify what diagnoses and treatment place are the most different between LTOT and non-LTOT patients.

FullyPaid = EDA_data[(EDA_data['event_descr']=='Fully Paid Claim')& (EDA_data['Days']<=0)]
FullyPaid.head()

getcountdiff('event_attr1',FullyPaid,'diagnosis')

plotcountdiff('event_attr1',FullyPaid,'Diagnosis')

<matplotlib.axes._subplots.AxesSubplot at 0x123eb9978>

getcountdiff('event_attr2',FullyPaid,'Treatment Place')

plotcountdiff('event_attr2',FullyPaid,'Treatment Place')

<matplotlib.axes._subplots.AxesSubplot at 0x120c04978>

4. Inbound Calls¶

The same method is used to explore inbound call category; we find five important attributes: call category, inquiry reasons, disposition, origin, and the state of each inbound call. Patients having LTOT make more “reprocessed-allowed” and “payment plan” call inquiry.

Call = EDA_data[(EDA_data['event_descr']=='Inbound Call by Mbr')|
                (EDA_data['event_descr']=='Inbound Call by Prov')|
                (EDA_data['event_descr']=='Inbound Call by Other')&
                (EDA_data['Days']<=0)]

getcountdiff('event_attr2',Call,'Call Inquiry')

plotcountdiff('event_attr2',Call,'Call Inquiry')

<matplotlib.axes._subplots.AxesSubplot at 0x129adb710>

getcountdiff('event_attr3',Call,'Diposition')

plotcountdiff('event_attr3',Call,'Diposition')

<matplotlib.axes._subplots.AxesSubplot at 0x12a603898>

Feature Engineering¶

1. Opioid data¶

# filter out the data only related to RX Claim - Paid 
rxClaimPaid = fullData[fullData.event_descr=='RX Claim - Paid'].copy()

rxClaimPaid.shape

(1899090, 20)

opioid = rxClaimPaid[rxClaimPaid.PAY_DAY_SUPPLY_CNT >= 0].copy()

opioid.shape

(138549, 20)

len(opioid.id.unique())

19882

1.1. Get dummies¶

define a function to get categorical variables

def getDummies(col, data):
    """
    Get dummy variables for specified column
    """
    df = data[['id', col]].copy()
    df = pd.get_dummies(df.set_index('id'), drop_first=True).groupby(['id']).max().reset_index()
    return df

define a function to get count of events of each patient

def getCount(col, data):
    """
    Get count of events for each id
    """
    df = data.groupby(['id', col])['event_descr'].count().reset_index()
    df = df.pivot(index='id', columns=col, values='event_descr').reset_index().fillna(0)
    return df

opioidDrugClass = getCount('event_attr1', opioid) # 1
opioidBrandName = getCount('event_attr5', opioid) # 2
opioidSpecialty = getCount('Specialty', opioid) # 3

print(opioidDrugClass.shape)
opioidDrugClass.head(3)

(19882, 7)

print(opioidBrandName.shape)
opioidBrandName.head(3)

(19882, 76)

print(opioidSpecialty.shape)
opioidSpecialty.head(3)

(19839, 178)

1.4. opioidCount (Count of opioid prescriptions)¶

opioidCount = opioid.groupby('id')['event_descr'].count().reset_index() # 4
opioidCount.columns = ['id', 'opioidCount']
print(opioidCount.shape)
opioidCount.head(3)

(19882, 2)

1.5. opioidDaysSupplied (Sum of opioid days supplied)¶

opioidDaysSupplied = opioid.groupby('id')['PAY_DAY_SUPPLY_CNT'].sum().reset_index() # 5
opioidDaysSupplied.columns = ['id', 'opioidDaysSupplied']
print(opioidDaysSupplied.shape)
opioidDaysSupplied.head(3)

(19882, 2)

1.6. opioidDay0Specialty (dummies of Specialty on Day 0)¶

opioidDay0Specialty = getDummies('Specialty', opioid[opioid.Days==0]) # 6
print(opioidDay0Specialty.shape)
opioidDay0Specialty.head(3)

(19693, 132)

1.7. opioidDay0Supplies (opioid supplies on day 0)¶

opioidDay0Supplies = opioid[opioid.Days==0][['id','PAY_DAY_SUPPLY_CNT']].groupby('id').max().reset_index() #7
opioidDay0Supplies.columns = ['id', 'opioidDay0Supplies']
print(opioidDay0Supplies.shape)
opioidDay0Supplies.head(3)

(19693, 2)

2. Fully Paid Claim¶

fullData.event_descr.value_counts()

RX Claim - Paid                     1899090
Fully Paid Claim                    1170330
RX Claim - Rejected                  515133
Inbound Call by Mbr                  248800
RX Claim - New Drug                  192671
New provider                         133964
Inbound Call by Prov                  91264
Inbound Call by Other                 24250
Surgery                               20992
RX Claim - First Time Mail Order       8020
New diagnosis - CPD                    3534
New diagnosis - Hypertension           3291
New diagnosis - Top 5                  3254
New diagnosis - CAD                    2210
New diagnosis - Diabetes               2124
New diagnosis - CHF                    1695
Name: event_descr, dtype: int64

fullyPaidClaim = fullData[fullData.event_descr=='Fully Paid Claim'].copy()

fullyPaidClaim.shape

(1170330, 20)

2.1. Sum of Amount¶

This function is used for calculating the total amount of money each patient spent

def getAmount(event):
    """
    Get Count, ChargeAmount, NetPaidAmount and ResponsibleAmount
        for specified event
    """
    # get fully paid claim data
    data = fullData[fullData.event_descr==event].copy()
    
    # Convert amount to float
    data['event_attr3'] = data['event_attr3'].apply(float)
    data['event_attr4'] = data['event_attr4'].apply(float)
    data['event_attr5'] = data['event_attr5'].apply(float)

    # Group by sum of charged amount
    groupbyData = data.groupby('id').agg({'event_attr3': ['count', 'sum'],
                                          'event_attr4': 'sum',
                                          'event_attr5': 'sum'})

    groupbyData.columns = [
        'Count', 'ChargeAmount', 'NetPaidAmount', 'ResponsibleAmount']
    groupbyData = groupbyData.reset_index()
    return groupbyData

claimAmount = getAmount('Fully Paid Claim') #8
print(claimAmount.shape)
claimAmount.head(3)

(19899, 5)

2.2. Place of Treatment¶

fullyPaidClaim = fullData[fullData.event_descr=='Fully Paid Claim'].copy()

fullyPaidClaimTrainLabel = trainDataLabel[trainDataLabel.event_descr=='Fully Paid Claim'].copy()

This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the average value of each patient

def getDifference(col, data, n):
    df0 = data.groupby(['id', col, 'LTOP'])['event_descr'].count().reset_index().fillna(0)
    df1 = df0.groupby([col, 'LTOP'])['event_descr'].mean().reset_index()
    df1 = df1.pivot(index=col, columns='LTOP', values='event_descr').reset_index()
    df1.columns = [col, 'LTOP_0', 'LTOP_1']

    df1['diff'] = abs(df1.LTOP_0 - df1.LTOP_1)
    df1 = df1.sort_values('diff', ascending=False)
    
    differenceList = list(df1[col].head(n))
    
    return df1, differenceList

This function is used for selecting the values that appear most frequently in a particular attribute

def getPopular(col, data, n):
    df = data.groupby([col])['id'].nunique().reset_index().sort_values('id', ascending=False)
    popularList = list(df[col].head(n))
    
    return df, popularList

This function is used for creating a list of those most important values we found in the former step

def getImportant(col, data, n1, n2):
    df1, differenceList = getDifference(col, data, n1)
    df2, popularList = getPopular(col, data, n2)

    importantList = []
    for event in differenceList:
        if event in popularList:
            importantList.append(event)
            
    return importantList

This function is used for getting the count of events of each patient

def getImportantCount(col, importantList, data):
    return getCount(col, data[data[col].apply(lambda event: event in importantList)])

importantList = getImportant('event_attr2', fullyPaidClaimTrainLabel, 30, 30) #9
treatmentPlace = getImportantCount('event_attr2', importantList, fullyPaidClaim)

print(treatmentPlace.shape)
treatmentPlace.head(3)

(19898, 29)

2.3. Diagnosis¶

# getDifference('event_attr1', fullyPaidClaimTrain, 30)[0]
# getPopular('event_attr1', fullyPaidClaimTrain, 30)[0]
# getImportant('event_attr1', fullyPaidClaimTrain, 1000, 1000)

importantList = getDifference('event_attr1', fullyPaidClaimTrainLabel, 100)[1] 
diagnosisClaim = getImportantCount('event_attr1', importantList, fullyPaidClaim) #10

print(diagnosisClaim.shape)
diagnosisClaim.head(3)

(791, 101)

3. RX Claim¶

3.1. Drug Class¶

rxClaimPaidLabel = rxClaimPaid.merge(label, on='id')

drugClass = getImportantCount('event_attr1', getImportant('event_attr1', rxClaimPaidLabel, 200, 200), rxClaimPaid) #11
print(drugClass.shape)
drugClass.head(3)

(19981, 96)

3.2. Drug Group¶

drugGroup = getCount('event_attr6', rxClaimPaid) #12
print(drugGroup.shape)
drugGroup.head()

(19982, 29)

3.2. RxCost¶

rxClaimPaid.shape

(1899090, 20)

This function is created to get the total number and costs of RX Claim

def getRxCost():
    """
    Get total RxCost and NetPaidAmount for Rx Claim Paid
    """
    # get fully paid claim data
    data = rxClaimPaid.copy()
    
    # Convert amount to float
    data['event_attr3'] = data['event_attr3'].apply(float)
    data['event_attr4'] = data['event_attr4'].apply(float)

    # Group by sum of charged amount
    groupbyData = data.groupby('id').agg({'event_attr3': ['count', 'sum'],
                                          'event_attr4': 'sum'})

    groupbyData.columns = [
        'Count', 'RxCost', 'NetPaidAmount']
    groupbyData = groupbyData.reset_index()
    return groupbyData

rxCost = getRxCost() #13
print(rxCost.shape)
rxCost.head(3)

(19982, 4)

4. Inbound Call¶

inboundCall = fullData[fullData['event_descr'].apply(
    lambda event: event in ['Inbound Call by Mbr',
                            'Inbound Call by Prov',
                            'Inbound Call by Other'])].copy()

inboundCallLabel = inboundCall.merge(label, on='id')

4.1. Call category¶

callCategory = getImportantCount('event_attr1',
                                   getImportant('event_attr1', inboundCallLabel, 200, 200), inboundCall)  # 14
print(callCategory.shape)
callCategory.head(3)

(18310, 185)

4.2. Inquiry¶

# getDifference('event_attr2', inboundCallLabel, 30)[0]
# getPopular('event_attr1', rxClaimPaidLabel, 30)[0]
# getImportant('event_attr2', inboundCallLabel, 100, 100)
# callCategory = getImportantDummies('event_attr1', getImportant('event_attr1', inboundCallLabel, 200, 200), inboundCall) # 14

callInquiry = getImportantCount('event_attr2',
                                   getImportant('event_attr2', inboundCallLabel, 200, 200), inboundCall)  # 15
print(callInquiry.shape)
callInquiry.head(3)

(16624, 51)

4.3. Disposition¶

callDisposition = getImportantCount('event_attr3',
                                      getImportant('event_attr3', inboundCallLabel, 200, 200), inboundCall)  # 16
print(callDisposition.shape)
callDisposition.head(3)

(17939, 86)

4.4. Location¶

inboundCall['state'] = inboundCall['event_attr5'].fillna(
    'Na, Na').apply(lambda location: location.split(', ')[1])

callState = getDummies('state', inboundCall) # 17
print(callState.shape)
callState.head(3)

(18316, 16)

5. Sum of Charge¶

fullData.event_descr.value_counts()

RX Claim - Paid                     2711881
Fully Paid Claim                    1735868
RX Claim - Rejected                  743715
Inbound Call by Mbr                  308020
RX Claim - New Drug                  227173
New provider                         171288
Inbound Call by Prov                 107528
Surgery                               30995
Inbound Call by Other                 27051
RX Claim - First Time Mail Order       7405
New diagnosis - CPD                    3705
New diagnosis - Hypertension           2775
New diagnosis - Top 5                  2656
New diagnosis - CAD                    2526
New diagnosis - CHF                    2193
New diagnosis - Diabetes               2190
Name: event_descr, dtype: int64

surgeryAmount = getAmount('Surgery') # 18
print(surgeryAmount.shape)
surgeryAmount.head(3)

(9080, 5)

newCPDAmount = getAmount('New diagnosis - CPD') # 19
print(newCPDAmount.shape)
newCPDAmount.head(3)

(3534, 5)

newHyperAmount = getAmount('New diagnosis - Hypertension') # 20
print(newHyperAmount.shape)
newHyperAmount.head(3)

(3291, 5)

newTop5Amount = getAmount('New diagnosis - Top 5') # 21
print(newTop5Amount.shape)
newTop5Amount.head(3)

(3254, 5)

newCHFAmount = getAmount('New diagnosis - CHF') # 22
print(newCHFAmount.shape)
newCHFAmount.head(3)

(1695, 5)

newDiabAmount = getAmount('New diagnosis - Diabetes') # 23
print(newDiabAmount.shape)
newDiabAmount.head(3)

(2124, 5)

6. Merge¶

features = [opioidDrugClass, opioidBrandName, opioidSpecialty, opioidCount,
           opioidDaysSupplied, opioidDay0Specialty, opioidDay0Supplies, claimAmount,
           treatmentPlace, diagnosisClaim, drugClass, drugGroup,
           rxCost, callCategory, callInquiry, callDisposition, callState,
           surgeryAmount, newCPDAmount, newHyperAmount, newTop5Amount, newCHFAmount, newDiabAmount]

from functools import reduce
mergedDf = reduce(lambda left, right: pd.merge(
    left, right, on=['id'], how='outer'), features)

# Impute missing value
mergedDf = mergedDf.fillna(0)

mergedDf.shape

(19982, 1009)

# Split train data and test data
trainDf = mergedDf.merge(trainID, on='id')
testDf = mergedDf.merge(testID, on='id')

# Add label to train data
trainDf = trainDf.merge(label, how='outer', on='id').fillna(0)

# Save data
trainDf.to_csv('trainDf.csv', index=False)
testDf.to_csv('testDf.csv', index=False)

Recursive Feature Elimination¶

Recursive Feature Elimination with ensembles of Extra Trees is used as the estimator to compute the relative importance of each attribute.

# import useful packages
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV, RFE
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

X = trainDf.drop(['LTOP','id'], axis=1)
target = trainDf['LTOP']

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X, target)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                     oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

# visualize the important features
dset = pd.DataFrame()
dset['attr'] = X.columns
dset['importance'] = model.feature_importances_
dset = dset.sort_values(by='importance', ascending=False)
dplot = dset.head(10)

plt.figure(figsize=(16, 8))
plt.barh(y=dplot['attr'], width=dplot['importance'], color='cornflowerblue')
plt.title('Feature Importances', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()

According to the chart above, the most important features are Opioid Day 0 Supplies (opioid supplies on the first opioid naive day), Opioid Days Supplied (opioid supplied days before first opioid naive day), Pain (whether patient use pain drug), Opioid Agonists (opioid type), Specialty_Anesthesiology Pain Medicine (Specialty of doctor giving opioid on Day 0).

Nearly all of the important features are in opioid data, which totally makes sense because this category is most related to LTOT. Patients' behavior in the future highly depends on their historical opioid usage.

Recommendations based on the result¶

1. Opioid Data¶

The first two features are Opioid Day 0 Supplies, which refers to the opioid drug supplies on day 0 and Opioid Days Supplied, which means the sum of opioid days supplied. They convey the information about how deeply the patients relied on opioid drugs. If patients get a lot more opioid drugs on the first day, they are more likely to order more in the future, while opioid days supplied are a direct indicator of LTOT, which undoubtedly greatly contributes to the predictions.

Opioid Agonists and Opioid Combinations belong to opioid drug class, which implies that patients who order these two kinds of drugs are more likely to have LTOT. Also, in Opioid Combinations, Hydrocodone/Acetaminophen may be the most important brand.

As for the specialty on day 0, Specialty_Anesthesiology Pain Medicine and Speciaty_Family Practice stand out. Patients who receive opioids from doctors who specialized in Anesthesiology Pain Medicine and Family Practice have a higher probability of having LTOT.

Opioid Count is the count of opioid prescriptions. If a patient has more RX claims related to opioids, he/she may get more chances to have LTOT in the future.

2. Other Categories¶

Pain belongs to drug groups in other RX claims. It is quite reasonable because opioids are primarily used for pain relief medically.

Modeling¶

1. Load Data¶

trainDf = pd.read_csv("/content/trainDf.csv")
testDf = pd.read_csv("/content/testDf.csv")

testID = pd.read_csv("/content/testID.csv")

print(trainDf.shape)
print(testDf.shape)

(13997, 1010)
(5998, 1009)

2. Feature Scaling¶

Feature scaling basically helps to normalise the data within a particular range.It also helps in speeding up the calculations in logistic algorithm.

allData = pd.concat([trainDf.drop(['LTOP'], axis=1).set_index('id'), testDf.set_index('id')], axis=0)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
allData = pd.DataFrame(scaler.fit_transform(allData), columns=allData.columns)

X = allData.iloc[:13997, ]
X_testDf = allData.iloc[13997:, ]

y = trainDf.LTOP

print(X.shape)
print(y.shape)
print(X_testDf.shape)

(13997, 1008)
(13997,)
(5998, 1008)

3. Model Tuning¶

In the modeling process, we tried three algorithms: Logistic Regression, Random Forest, and Light Gradient Boosting. We used 10-fold cross-validation, and the AUC score to tune and evaluate our models.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2019)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, auc, roc_auc_score
from sklearn.model_selection import KFold, cross_val_score

This function is created to show the confusion matrix and classification report of the model

def getAccuracy(model, threshold):
    
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:, 1]
    y_pred = np.where(probs >= threshold, 1, 0)
    
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\n")
    print("Classification Report:\n",classification_report(y_test, y_pred))
    return accuracy_score(y_test, y_pred)

This function is used for excuting a 10-fold cross validation

n_folds = 10

def cv_ROC_AUC(model):
    """
    Return the average AUC score.
    """
    # Set KFold to shuffle data before the split
    kf = KFold(n_folds, shuffle=True, random_state=42)
    
    # Get accuracy score
    roc_auc = cross_val_score(model, X, y, scoring="roc_auc", cv=kf)
    
    return roc_auc.mean()

This fuction is used for getting the ROC score of the model

def get_ROC_AUC(model):
    """
    Return AUC score
    """
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:,1]
    roc_auc = roc_auc_score(y_test, probs)
    return roc_auc

This function is created to visualize the ROC AUC and find the optimal threshold of the model

def plotROC(model):
    """
    1. Plot ROC AUC
    2. Return the best threshold
    """
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC AUC
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
    # Find optimal threshold
    rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
    rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
    optimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
    
    return optimalThreshold

This function is created to display the results of modeling

def evaluate(model):
    optimalThreshold = plotROC(model)
    print("Optimal Threshold: ", optimalThreshold)
    print("\n")
    print("Accuracy: ", getAccuracy(model, optimalThreshold))

3.1 Random Forest¶

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=1000)
print("AUC score: ", get_ROC_AUC(rfc))

AUC score:  0.8643438495519058

evaluate(rfc)

Optimal Threshold:  0.539


Confusion Matrix:
 [[1301  293]
 [ 250  956]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.84      0.82      0.83      1594
         1.0       0.77      0.79      0.78      1206

    accuracy                           0.81      2800
   macro avg       0.80      0.80      0.80      2800
weighted avg       0.81      0.81      0.81      2800

Accuracy:  0.8060714285714285

3.2 Logistic Regression¶

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(penalty='l1', C=0.5, max_iter=1000)
print("AUC score: ", get_ROC_AUC(logit))

AUC score:  0.8342988112553085

evaluate(logit)

Optimal Threshold:  0.39689265894596626


Confusion Matrix:
 [[1232  362]
 [ 257  949]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.83      0.77      0.80      1594
         1.0       0.72      0.79      0.75      1206

    accuracy                           0.78      2800
   macro avg       0.78      0.78      0.78      2800
weighted avg       0.78      0.78      0.78      2800

Accuracy:  0.7789285714285714

3.3 LightGBM¶

from lightgbm import LGBMClassifier

# Model tuning
paraList = [500, 1000, 1500]
roc_auc_list = []

for i in paraList:
    lgb = LGBMClassifier(objective='binary',
                         learning_rate=0.049,
                         n_estimators=i, # 1462
                         num_leaves=8,
                         min_data_in_leaf=4,
                         max_depth=3,
                         max_bin=41,
                         bagging_fraction=0.845,
                         bagging_freq=5,
                         feature_fraction=0.24,
                         feature_fraction_seed=9,
                         bagging_seed=9,
                         min_sum_hessian_in_leaf=11)
    roc_auc = get_ROC_AUC(lgb)
    roc_auc_list.append(roc_auc)
    print("Parameter: ", i, "\tAUC score: ", roc_auc)

plt.plot(np.array(paraList),
         np.array(roc_auc_list))

Parameter:  500 	AUC score:  0.8708725298642713
Parameter:  1000 	AUC score:  0.871080607002628
Parameter:  1500 	AUC score:  0.8698815624928474

[<matplotlib.lines.Line2D at 0x7fd6d3437320>]

# Best model chosen after tuning
lgb = LGBMClassifier(objective='binary',
                        learning_rate=0.03,
                        n_estimators=1130,
                        num_leaves=8,
                        min_data_in_leaf=1,
                        max_depth=10,
                        max_bin=80,
                        bagging_fraction=0.8,
                        bagging_freq=5,
                        feature_fraction=0.24,
                        feature_fraction_seed=9,
                        bagging_seed=9,
                        min_sum_hessian_in_leaf=12)
cv_ROC_AUC(lgb)

0.8741430509656014

evaluate(lgb)

Optimal Threshold:  0.37778002376412156


Confusion Matrix:
 [[1206  388]
 [ 176 1030]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.87      0.76      0.81      1594
         1.0       0.73      0.85      0.79      1206

    accuracy                           0.80      2800
   macro avg       0.80      0.81      0.80      2800
weighted avg       0.81      0.80      0.80      2800

Accuracy:  0.7985714285714286

Predict¶

choose LightGBM as the final model¶

# Use test data to make predictions
lgb.fit(X, y)
predictions = lgb.predict_proba(X_testDf)[:, 1]

Calculate the probability of LTOT for each patient and rank them by descending order

output = pd.DataFrame({'ID': testDf.id,
                       'SCORE': predictions})

testID.columns = ['ID']
output = output.merge(testID, on='ID', how='outer').fillna(0)
output['RANK'] = output.SCORE.rank(method='min', ascending=False)
output = output.sort_values('SCORE', ascending=False)
output.head(10)

pd.Series(np.where(output.SCORE >= 0.5, 1, 0)).value_counts()

1    3019
0    2981
dtype: int64

y.value_counts()

0.0    7670
1.0    6327
Name: LTOP, dtype: int64

	event_attr1	LTOP_0	LTOP_1	diff
1	OPIOID AGONISTS	4184.0	4876.0	692.0
2	OPIOID ANTAGONISTS	72.0	302.0	230.0
4	OPIOID PARTIAL AGONISTS	69.0	157.0	88.0
3	OPIOID COMBINATIONS	6574.0	5127.0	-1447.0
0	ANTIPERISTALTIC AGENTS	NaN	1.0	NaN

	id	event_descr	event_attr1	event_attr2	event_attr3	event_attr4	event_attr5	event_attr6	event_attr7	event_attr8	...	event_attr10	Days	PAY_DAY_SUPPLY_CNT	PAYABLE_QTY	MME	DRUG_TYPE	Specialty	Specialty2	Specialty3
0	ID10010854159	RX Claim - Paid	BIGUANIDES	NaN	5	2.35	METFORMIN HCL	DIABETES	NaN	METFORMIN HCL TAB 500 MG	...	27250050.0	-664	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ID10010854159	RX Claim - Paid	BETA BLOCKERS CARDIO-SELECTIVE	NaN	5	2.35	ATENOLOL	CARDIO	NaN	ATENOLOL TAB 25 MG	...	33200020.0	-659	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	ID10010854159	RX Claim - Paid	FLUOROQUINOLONES	NaN	5.8	3.15	CIPROFLOXACIN HCL	INF-ANTIBIOTICS	NaN	CIPROFLOXACIN HCL TAB 500 MG (BASE EQUIV)	...	5000020.0	-646	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	ID10010854159	RX Claim - Paid	HMG COA REDUCTASE INHIBITORS	NaN	13	10.35	PRAVASTATIN SODIUM	CARDIO	NaN	PRAVASTATIN SODIUM TAB 20 MG	...	39400065.0	-646	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	ID10010854159	RX Claim - Paid	THIAZIDES AND THIAZIDE-LIKE DIURETICS	NaN	5	2.35	HYDROCHLOROTHIAZIDE	CARDIO	NaN	HYDROCHLOROTHIAZIDE TAB 25 MG	...	37600040.0	-644	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	drugclass	LTOP_0	LTOP_1	diff
197	GROWTH HORMONES	52.000000	8.000000	44.000000
344	RESTLESS LEG SYNDROME (RLS) AGENTS	30.000000	6.000000	24.000000
333	PULMONARY FIBROSIS AGENTS	3.333333	19.500000	16.166667
335	PULMONARY HYPERTENSION - PHOSPHODIESTERASE INH...	5.125000	17.333333	12.208333
215	IMMUNOMODULATORS	5.166667	16.000000	10.833333

	id	event_descr	event_attr1	event_attr2	event_attr3	event_attr4	event_attr5	event_attr6	event_attr7	event_attr8	...	event_attr10	Days	PAY_DAY_SUPPLY_CNT	PAYABLE_QTY	MME	DRUG_TYPE	Specialty	Specialty2	Specialty3
7	ID10010854159	Fully Paid Claim	ALLERGIC RHINITIS, CAUSE UNSPECIFIED	OFFICE	100.0	0.0	67.99	NaN	NaN	NaN	...	NaN	-630	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	ID10010854159	Fully Paid Claim	UNSPECIFIED ESSENTIAL HYPERTENSION	ON CAMPUS-HOSPITAL OUTPATIENT	312.0	28.11	13.58	NaN	NaN	NaN	...	NaN	-595	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	ID10010854159	Fully Paid Claim	UNSPECIFIED ESSENTIAL HYPERTENSION	OFFICE	100.0	0.0	67.99	NaN	NaN	NaN	...	NaN	-588	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	ID10010854159	Fully Paid Claim	URINARY TRACT INFECTION, SITE NOT SPECIFIED	OFFICE	100.0	0.0	67.99	NaN	NaN	NaN	...	NaN	-568	NaN	NaN	NaN	NaN	NaN	NaN	NaN
34	ID10010854159	Fully Paid Claim	ACUTE FRONTAL SINUSITIS	OFFICE	450.0	143.61	5.23	NaN	NaN	NaN	...	NaN	-498	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	diagnosis	LTOP_0	LTOP_1	diff
7769	OTHER COMPLICATIONS OF SKIN GRAFT (ALLOGRAFT) ...	79.0	16.0	63.0
6096	MALIGNANT NEOPLASM OF MIDDLE THIRD OF ESOPHAGUS	5.0	61.0	56.0
6066	MALIGNANT NEOPLASM OF LEFT OVARY	2.0	56.0	54.0
7702	OTHER CHRONIC HEMATOGENOUS OSTEOMYELITIS, RIGH...	1.0	44.0	43.0
13439	VESICOINTESTINAL FISTULA	5.0	44.0	39.0

	Treatment Place	LTOP_0	LTOP_1	diff
7	COMPREHENSIVE OUTPATIENT REHAB FACILITY	22.000000	7.000000	15.000000
16	INPATIENT HOSPITAL	13.014461	17.645380	4.630919
6	COMPREHENSIVE INPATIENT REHAB FACILITY	9.722222	6.483871	3.238351
8	CUSTODIAL CARE FACILITY	3.500000	6.655172	3.155172
29	PSYCHIATRIC FACILITY PARTIAL HOSPITAL	3.400000	6.444444	3.044444

	Call Inquiry	LTOP_0	LTOP_1	diff
846	OVER 4 DAY AGED	3.870347	5.236211	1.365863
34	ACTIVITY SUBMISSION	1.000000	2.333333	1.333333
964	PREMIUM QUESTION	1.000000	2.000000	1.000000
1053	REFERRAL LETTER INQUIRY	2.000000	1.000000	1.000000
416	ESRD	1.000000	2.000000	1.000000

	Diposition	LTOP_0	LTOP_1	diff
265	ERROR/CALL CENTER	9.000000	1.000000	8.000000
562	PAYMENT PLAN ADJUSTED	8.000000	1.000000	7.000000
64	ARRANGED DELIVERY	7.202312	5.875776	1.326536
563	PAYMENT PLAN SETUP	2.000000	1.000000	1.000000
381	INSULIN PUMP DEPENDENT	1.000000	2.000000	1.000000

event_attr1	id	OPIOID AGONISTS
0	ID10006701904	4.0
1	ID10010854159	1.0
2	ID10013863216	1.0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID10006701904	81	57383.01	16921.91	1457.72
1	ID10010854159	43	23227.94	4274.76	1778.33
2	ID10013863216	20	18580.46	2893.69	602.36

event_attr2	id	AMBULANCE - LAND	AMBULATORY SURGICAL CENTER	EMERGENCY ROOM - HOSPITAL	HOME	...	ON CAMPUS-HOSPITAL OUTPATIENT	SKILLED NURSING FACILITY
0	ID10006701904	1.0	0.0	8.0	7.0	...	9.0	1.0
1	ID10010854159	0.0	3.0	0.0	0.0	...	10.0	0.0
2	ID10013863216	0.0	0.0	0.0	0.0	...	9.0	0.0

event_attr1	id	ALCOHOLIC LIVER DISEASE, UNSPECIFIED	...
0	ID1002482139	0.0	...
1	ID10161175490	0.0	...
2	ID10201747254	11.0	...

event_attr6	id	CARDIO	DERM	DIABETES	GASTRO	INF-ANTIBIOTICS	...	OTH-THYROID	OTHER	OTHER ANTI INFECTIVES	PAIN	RESPIRATORY	SENSE ORGANS EARS/EYES/MOUTH	SUPPLIES/SUPPLEMENTS/TESTS
0	ID10006701904	22.0	3.0	0.0	10.0	3.0	...	7.0	18.0	2.0	7.0	0.0	0.0	0.0
1	ID10010854159	40.0	0.0	9.0	3.0	6.0	...	0.0	1.0	1.0	8.0	1.0	3.0	0.0
2	ID10013863216	2.0	1.0	0.0	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0.0	0.0
3	ID10020514442	46.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	4.0	0.0	0.0	26.0
4	ID10024447278	16.0	0.0	5.0	0.0	2.0	...	0.0	1.0	0.0	5.0	0.0	0.0	14.0

	id	Count	RxCost	NetPaidAmount
0	ID10006701904	80	11765.44	11590.77
1	ID10010854159	76	1335.77	1170.11
2	ID10013863216	9	125.12	65.94

event_attr2	id	BALANCE INQUIRY	BENEFITS/NEW RX	BILLING ISSUES	CHECK INFO/RECOUPMENT	CLAIM DENIED	CLAIM PENDED	...	SNF/HOME HEALTH	TECH CONSULT	TRANSFERRED	VERIFICATION
0	ID10006701904	4.0	2.0	0.0	1.0	3.0	2.0	...	2.0	13.0	2.0	1.0
1	ID10010854159	1.0	3.0	1.0	0.0	1.0	1.0	...	0.0	0.0	1.0	2.0
2	ID10024447278	0.0	1.0	1.0	0.0	0.0	0.0	...	0.0	0.0	2.0	1.0

	id	...
0	ID10006701904	...
1	ID10010854159	...
2	ID10013863216	...

	id	state_AZ	state_KY	state_NC	state_Na	state_OH	state_TX	state_UT	state_WI
0	ID10006701904	1	1	1	1	1	1	1	1
1	ID10010854159	1	1	0	1	1	1	0	0
2	ID10013863216	0	1	0	0	0	0	0	0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID10006701904	1	2138.90	197.10	275.0
1	ID10010854159	1	2314.00	619.08	250.0
2	ID10013863216	1	12337.42	1396.98	275.0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID1002482139	1	137.0	0.00	0.0
1	ID10052944491	1	115.0	99.00	15.0
2	ID1007116120	1	1431.1	187.44	0.0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID10112532931	1	640.0	59.23	0.0
1	ID10189080475	1	1078.0	213.53	0.0
2	ID10230809175	1	80.0	47.03	20.0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID10112532931	1	5333.00	0.00	0.0
1	ID10169051312	1	0.01	0.00	0.0
2	ID10230809175	1	80.00	47.03	20.0

	id	Count	ChargeAmount	NetPaidAmount	ResponsibleAmount
0	ID10006701904	1	2296.74	0.00	0.0
1	ID10020514442	1	118.00	62.90	0.0
2	ID10074598346	1	185.00	23.52	50.0

	id	Count	ChargeAmount	NetPaidAmount
0	ID10006701904	1	6.52	0.00
1	ID10112532931	1	2513.00	1188.32
2	ID10119459922	1	20.55	0.00

	ID	SCORE	RANK
5026	ID85903555148	0.995055	1.0
1365	ID3065072932	0.992463	2.0
4319	ID74889635841	0.989188	3.0
4261	ID74172758671	0.987919	4.0
5063	ID86509078362	0.986938	5.0
1977	ID40152538353	0.986662	6.0
644	ID19551240941	0.986612	7.0
735	ID21086130021	0.986352	8.0
2245	ID44040914598	0.986309	9.0
5192	ID88456216402	0.985301	10.0