Introduction

  • Long-Term Opioid Therapy (LTOT) can post a series of severe physical and mental issues for patients. Long Term Opioid Therapy is defined as the continuous use of opioid medication with 90% of days covered over six months.

image.png

  • To assist Humana in identifying patients likely to overuse opioids and make a timely intervention, first, we used the given dataset to identify and label patients having LTOT. Then we compared LTOT and non-LTOT patients to discover the most different features (e.g. opioid use history, the sum of claim charged amount). After that, we performed feature engineering to get essential features for each patient. Next, using our preprocessed dataset, we used cross-validation to tune, evaluate, and choose the best model among Logistic Regression, Random Forest, and Light Gradient Boosting models. Lastly, we trained our chosen LightGB model on the whole train dataset and make predictions on the holdout dataset.

About the dataset

  • The training dataset covers 4-year medical records of 14,000 patients and has six million rows, each of which provides details corresponding to an event for the particular member. The holdout dataset covers 6,000 patients and only includes data up to the first opioid naïve event. In this dataset the four main event categories are: Claim, Diagnosis, Call, and RxClaim.

Load the data

In [0]:
# import useful packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [0]:
# read the raw data
Data = pd.read_csv("HMAHCC_COMP.csv")
testData = pd.read_csv("holdOut_processed.csv")
/Users/apple/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [0]:
# serve as mark in order to distinguish training and testing data
testID = pd.DataFrame({'id':testData.id.unique()})
trainID = pd.DataFrame({'id':Data.id.unique()})
In [0]:
print(trainID.shape)
print(testID.shape)
(14000, 1)
(6000, 1)
In [0]:
# We only need to care about patient behavior before opoid naive because we are here to make predictions
trainData = Data[Data.Days <= 0]
In [0]:
fullData = pd.concat([trainData, testData], axis=0)
In [0]:
print(fullData.shape)
(4320622, 20)

Labeling LTOT

  • To correctly identify and label the event of interest, we subsetted the opioid paid RX claim data. Firstly, we sorted the data by patient id, drug types, and days to avoid overlapping. Then, we identify opioid naïve events of each patient. After that, based on days supplied, we calculated how many days each patient use opioid drugs within 180 days after the last opioid naïve event. This number is reset whenever a patient has a new opioid naive event. Within 180 days after each opioid naive event, if a patient has more than 162 days of opioid supplies, she or he will be labeled as LTOT.
  • To predict the probability of having LTOT for a patient, we focused on analyzing opioid usage history and each of the main event categories to learn from their past behaviors.
In [0]:
label = pd.read_csv("label.csv")
In [0]:
label.shape
Out[0]:
(13949, 2)
In [0]:
# Merge train data with label
trainDataLabel = trainData.merge(label, on='id')

Exploratory Data Analysis

image.png

  • This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the total amount
In [0]:
def analyze_attr(attr, df):
    df1 = df[['id', attr, 'LTOP']].copy().groupby(['id', attr]).mean().reset_index()
    df2 = df1.groupby([attr, 'LTOP']).count().reset_index()
    df2.columns = [attr, 'LTOP', 'count']

    df3 = df2.pivot_table(values='count', index=[attr], columns=['LTOP']).reset_index()
    df3.columns = [attr, 'LTOP_0', 'LTOP_1']
    df3['diff'] = df3.LTOP_1 - df3.LTOP_0
    return df3.sort_values('diff', ascending=False)
  • This function is created to visualize the formal analysis in a countplot
In [0]:
def plot_attr(attr, df, size):
    df1 = df[['id', attr, 'LTOP']].copy().groupby(['id', attr]).mean().reset_index()
    plt.figure(figsize=(size, 4))
    sns.countplot(x=attr, hue='LTOP', palette='coolwarm', data=df1)

1. Opioid RX Claim

  • When labeling LTOT, we realized that the history of opioid usage might be highly correlated with the likelihood of future LTOT. So we decided to discover opioid usage history first. We tried to explore whether LTOT and non-LTOT patients would have different patterns in various attributes provided in the data.
In [0]:
EDA_data = Data.merge(label, on='id')
In [0]:
opioid = EDA_data[(pd.notnull(EDA_data['PAY_DAY_SUPPLY_CNT']))&(EDA_data['event_descr']=='RX Claim - Paid')]
#analyze_attr('event_attr5',trainDataLabel).head(5)
In [0]:
analyze_attr('event_attr1',opioid).head(5)
Out[0]:
event_attr1 LTOP_0 LTOP_1 diff
1 OPIOID AGONISTS 4184.0 4876.0 692.0
2 OPIOID ANTAGONISTS 72.0 302.0 230.0
4 OPIOID PARTIAL AGONISTS 69.0 157.0 88.0
3 OPIOID COMBINATIONS 6574.0 5127.0 -1447.0
0 ANTIPERISTALTIC AGENTS NaN 1.0 NaN
In [0]:
plot_attr('event_attr1',opioid,12)
  • From the count plot above, we found out that patients who had LTOT would use certain drug classes more than non-LTOT patients, indicating the drug class description may have predictive power in distinguishing the LTOT risk. For example, Opioid Agonist is more likely to cause LTOT.
  • With a similar approach, we can also find other potential descriptive attributes to distinguish between LTOT and non-LTOT patients which are GPI Drug Class Description and Speciality.
In [0]:
analyze_attr('event_attr5',opioid)
analyze_attr('Specialty',opioid)
plot_attr('event_attr5',opioid,50)
plot_attr('Specialty',opioid,50)
  • Moreover, the number of opioid prescriptions and the number of opioid drug supplies are also different between LTOT and non-LTOT patients. Meanwhile, we found a significant difference between LTOT and non-LTOT patients of how many drugs they received at the opioid naive day,
In [0]:
opioidDay0Supplies = opioid[opioid.Days==0][['id','PAY_DAY_SUPPLY_CNT','LTOP']].groupby('id').max().reset_index() #7
opioidDay0Supplies.columns = ['id','opioidDay0Supplies','LTOP']
opioidDay0Supplies.head()
sns.boxplot(x='LTOP',y='opioidDay0Supplies',data=opioidDay0Supplies,palette='coolwarm')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x12cb83080>

2. Other RX Claims

  • Besides RX Claims related to opioids, there might exist other attributes that we can add to our predictors. By plotting the average count of Rx Claim of LTOT and non-LTOT patients, we found some drug classes and drug groups that are used differently among LTOT and non-LTOT patients. Patients having LTOT uses a lot more pulmonary fibrosis agents, pulmonary hypertension, immunomodulators than non-LTOT patients.
In [0]:
OtherRX = EDA_data[(EDA_data['event_descr']=='RX Claim - Paid')& (EDA_data['Days']<=0)]
OtherRX.head()
Out[0]:
id event_descr event_attr1 event_attr2 event_attr3 event_attr4 event_attr5 event_attr6 event_attr7 event_attr8 ... event_attr10 Days PAY_DAY_SUPPLY_CNT PAYABLE_QTY MME DRUG_TYPE Specialty Specialty2 Specialty3 LTOP
0 ID10010854159 RX Claim - Paid BIGUANIDES NaN 5 2.35 METFORMIN HCL DIABETES NaN METFORMIN HCL TAB 500 MG ... 27250050.0 -664 NaN NaN NaN NaN NaN NaN NaN 0
1 ID10010854159 RX Claim - Paid BETA BLOCKERS CARDIO-SELECTIVE NaN 5 2.35 ATENOLOL CARDIO NaN ATENOLOL TAB 25 MG ... 33200020.0 -659 NaN NaN NaN NaN NaN NaN NaN 0
3 ID10010854159 RX Claim - Paid FLUOROQUINOLONES NaN 5.8 3.15 CIPROFLOXACIN HCL INF-ANTIBIOTICS NaN CIPROFLOXACIN HCL TAB 500 MG (BASE EQUIV) ... 5000020.0 -646 NaN NaN NaN NaN NaN NaN NaN 0
4 ID10010854159 RX Claim - Paid HMG COA REDUCTASE INHIBITORS NaN 13 10.35 PRAVASTATIN SODIUM CARDIO NaN PRAVASTATIN SODIUM TAB 20 MG ... 39400065.0 -646 NaN NaN NaN NaN NaN NaN NaN 0
5 ID10010854159 RX Claim - Paid THIAZIDES AND THIAZIDE-LIKE DIURETICS NaN 5 2.35 HYDROCHLOROTHIAZIDE CARDIO NaN HYDROCHLOROTHIAZIDE TAB 25 MG ... 37600040.0 -644 NaN NaN NaN NaN NaN NaN NaN 0

5 rows × 21 columns

  • This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the average value of each patient
In [0]:
def getcountdiff(attr,df,name):
    df1 = df[['id',attr, 'LTOP','event_attr5']].copy().groupby(['id',attr,'LTOP']).count().reset_index()
    df2 =  df1.copy().groupby([attr,'LTOP']).mean().reset_index()
    df2.columns=[attr,'LTOP','count']
    df2.sort_values('count',ascending=False,inplace=True)
    df3 = df2.pivot(index=attr, columns='LTOP', values='count').reset_index()
    df3.columns=[name,'LTOP_0','LTOP_1']
    df3['diff']=abs(df3['LTOP_0']-df3['LTOP_1'])
    df3.sort_values('diff',ascending=False,inplace=True)
    return df3.head(5)
  • This function is created to visualize the formal analysis in a countplot
In [0]:
def plotcountdiff(attr,df,name):
    df1 = df[['id',attr, 'LTOP','event_attr5']].copy().groupby(['id',attr,'LTOP']).count().reset_index()
    df2 =  df1.copy().groupby([attr,'LTOP']).mean().reset_index()
    df2.columns=[attr,'LTOP','count']
    df2.sort_values('count',ascending=False,inplace=True)
    df3 = df2.pivot(index=attr, columns='LTOP', values='count').reset_index()
    df3.columns=[name,'LTOP_0','LTOP_1']
    df3['diff']=abs(df3['LTOP_0']-df3['LTOP_1'])
    df3.sort_values('diff',ascending=False,inplace=True)
    df4 = df3.head(5).melt(id_vars=name,value_vars=['LTOP_0','LTOP_1'])
    return sns.barplot(y=name, x='value', hue='variable', data=df4,palette='coolwarm')
In [0]:
getcountdiff('event_attr1',OtherRX,'drugclass')
Out[0]:
drugclass LTOP_0 LTOP_1 diff
197 GROWTH HORMONES 52.000000 8.000000 44.000000
344 RESTLESS LEG SYNDROME (RLS) AGENTS 30.000000 6.000000 24.000000
333 PULMONARY FIBROSIS AGENTS 3.333333 19.500000 16.166667
335 PULMONARY HYPERTENSION - PHOSPHODIESTERASE INH... 5.125000 17.333333 12.208333
215 IMMUNOMODULATORS 5.166667 16.000000 10.833333
In [0]:
plotcountdiff('event_attr1',OtherRX,'Drug Class')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f3290f0>
In [0]:
getcountdiff('event_attr6',OtherRX,'Drug Group')
In [0]:
plotcountdiff('event_attr6',OtherRX,'Drug Group')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x11eee6a58>

3. Fully Paid Claim

  • For fully paid claim events, we tried to identify what diagnoses and treatment place are the most different between LTOT and non-LTOT patients.
In [0]:
FullyPaid = EDA_data[(EDA_data['event_descr']=='Fully Paid Claim')& (EDA_data['Days']<=0)]
FullyPaid.head()
Out[0]:
id event_descr event_attr1 event_attr2 event_attr3 event_attr4 event_attr5 event_attr6 event_attr7 event_attr8 ... event_attr10 Days PAY_DAY_SUPPLY_CNT PAYABLE_QTY MME DRUG_TYPE Specialty Specialty2 Specialty3 LTOP
7 ID10010854159 Fully Paid Claim ALLERGIC RHINITIS, CAUSE UNSPECIFIED OFFICE 100.0 0.0 67.99 NaN NaN NaN ... NaN -630 NaN NaN NaN NaN NaN NaN NaN 0
17 ID10010854159 Fully Paid Claim UNSPECIFIED ESSENTIAL HYPERTENSION ON CAMPUS-HOSPITAL OUTPATIENT 312.0 28.11 13.58 NaN NaN NaN ... NaN -595 NaN NaN NaN NaN NaN NaN NaN 0
18 ID10010854159 Fully Paid Claim UNSPECIFIED ESSENTIAL HYPERTENSION OFFICE 100.0 0.0 67.99 NaN NaN NaN ... NaN -588 NaN NaN NaN NaN NaN NaN NaN 0
23 ID10010854159 Fully Paid Claim URINARY TRACT INFECTION, SITE NOT SPECIFIED OFFICE 100.0 0.0 67.99 NaN NaN NaN ... NaN -568 NaN NaN NaN NaN NaN NaN NaN 0
34 ID10010854159 Fully Paid Claim ACUTE FRONTAL SINUSITIS OFFICE 450.0 143.61 5.23 NaN NaN NaN ... NaN -498 NaN NaN NaN NaN NaN NaN NaN 0

5 rows × 21 columns

In [0]:
getcountdiff('event_attr1',FullyPaid,'diagnosis')
Out[0]:
diagnosis LTOP_0 LTOP_1 diff
7769 OTHER COMPLICATIONS OF SKIN GRAFT (ALLOGRAFT) ... 79.0 16.0 63.0
6096 MALIGNANT NEOPLASM OF MIDDLE THIRD OF ESOPHAGUS 5.0 61.0 56.0
6066 MALIGNANT NEOPLASM OF LEFT OVARY 2.0 56.0 54.0
7702 OTHER CHRONIC HEMATOGENOUS OSTEOMYELITIS, RIGH... 1.0 44.0 43.0
13439 VESICOINTESTINAL FISTULA 5.0 44.0 39.0
In [0]:
plotcountdiff('event_attr1',FullyPaid,'Diagnosis')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x123eb9978>
In [0]:
getcountdiff('event_attr2',FullyPaid,'Treatment Place')
Out[0]:
Treatment Place LTOP_0 LTOP_1 diff
7 COMPREHENSIVE OUTPATIENT REHAB FACILITY 22.000000 7.000000 15.000000
16 INPATIENT HOSPITAL 13.014461 17.645380 4.630919
6 COMPREHENSIVE INPATIENT REHAB FACILITY 9.722222 6.483871 3.238351
8 CUSTODIAL CARE FACILITY 3.500000 6.655172 3.155172
29 PSYCHIATRIC FACILITY PARTIAL HOSPITAL 3.400000 6.444444 3.044444
In [0]:
plotcountdiff('event_attr2',FullyPaid,'Treatment Place')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x120c04978>

4. Inbound Calls

  • The same method is used to explore inbound call category; we find five important attributes: call category, inquiry reasons, disposition, origin, and the state of each inbound call. Patients having LTOT make more “reprocessed-allowed” and “payment plan” call inquiry.
In [0]:
Call = EDA_data[(EDA_data['event_descr']=='Inbound Call by Mbr')|
                (EDA_data['event_descr']=='Inbound Call by Prov')|
                (EDA_data['event_descr']=='Inbound Call by Other')&
                (EDA_data['Days']<=0)]
In [0]:
getcountdiff('event_attr2',Call,'Call Inquiry')
Out[0]:
Call Inquiry LTOP_0 LTOP_1 diff
846 OVER 4 DAY AGED 3.870347 5.236211 1.365863
34 ACTIVITY SUBMISSION 1.000000 2.333333 1.333333
964 PREMIUM QUESTION 1.000000 2.000000 1.000000
1053 REFERRAL LETTER INQUIRY 2.000000 1.000000 1.000000
416 ESRD 1.000000 2.000000 1.000000
In [0]:
plotcountdiff('event_attr2',Call,'Call Inquiry')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x129adb710>
In [0]:
getcountdiff('event_attr3',Call,'Diposition')
Out[0]:
Diposition LTOP_0 LTOP_1 diff
265 ERROR/CALL CENTER 9.000000 1.000000 8.000000
562 PAYMENT PLAN ADJUSTED 8.000000 1.000000 7.000000
64 ARRANGED DELIVERY 7.202312 5.875776 1.326536
563 PAYMENT PLAN SETUP 2.000000 1.000000 1.000000
381 INSULIN PUMP DEPENDENT 1.000000 2.000000 1.000000
In [0]:
plotcountdiff('event_attr3',Call,'Diposition')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x12a603898>

Feature Engineering

image.png

1. Opioid data

In [0]:
# filter out the data only related to RX Claim - Paid 
rxClaimPaid = fullData[fullData.event_descr=='RX Claim - Paid'].copy()
In [0]:
rxClaimPaid.shape
Out[0]:
(1899090, 20)
In [0]:
opioid = rxClaimPaid[rxClaimPaid.PAY_DAY_SUPPLY_CNT >= 0].copy()
In [0]:
opioid.shape
Out[0]:
(138549, 20)
In [0]:
len(opioid.id.unique())
Out[0]:
19882

1.1. Get dummies

  • define a function to get categorical variables
In [0]:
def getDummies(col, data):
    """
    Get dummy variables for specified column
    """
    df = data[['id', col]].copy()
    df = pd.get_dummies(df.set_index('id'), drop_first=True).groupby(['id']).max().reset_index()
    return df
  • define a function to get count of events of each patient
In [0]:
def getCount(col, data):
    """
    Get count of events for each id
    """
    df = data.groupby(['id', col])['event_descr'].count().reset_index()
    df = df.pivot(index='id', columns=col, values='event_descr').reset_index().fillna(0)
    return df
In [0]:
opioidDrugClass = getCount('event_attr1', opioid) # 1
opioidBrandName = getCount('event_attr5', opioid) # 2
opioidSpecialty = getCount('Specialty', opioid) # 3
In [0]:
print(opioidDrugClass.shape)
opioidDrugClass.head(3)
(19882, 7)
Out[0]:
event_attr1 id ANTIPERISTALTIC AGENTS ANTISPASMODICS OPIOID AGONISTS OPIOID ANTAGONISTS OPIOID COMBINATIONS OPIOID PARTIAL AGONISTS
0 ID10006701904 0.0 0.0 4.0 0.0 0.0 0.0
1 ID10010854159 0.0 0.0 1.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 1.0 0.0 0.0 0.0
In [0]:
print(opioidBrandName.shape)
opioidBrandName.head(3)
(19882, 76)
Out[0]:
event_attr5 id ACETAMINOPHEN/CAFFEINE/DI ACETAMINOPHEN/CODEINE ACETAMINOPHEN/CODEINE #3 ACETAMINOPHEN/CODEINE #4 ACETAMINOPHEN/CODEINE PHO ASCOMP/CODEINE AVINZA BELBUCA BELLADONNA ALKALOIDS & OP ... TREZIX ULTRAM VICODIN VICODIN ES VICODIN HP VICOPROFEN XARTEMIS XR XTAMPZA ER ZOHYDRO ER ZUBSOLV
0 ID10006701904 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 76 columns

In [0]:
print(opioidSpecialty.shape)
opioidSpecialty.head(3)
(19839, 178)
Out[0]:
Specialty id Acupuncturist Allergy & Immunology Anesthesiology Anesthesiology Pain Medicine Chiropractor Clinic/Center Federally Qualif Hlth Ctr Clinic/Center Medical Specialty Clinic/Center Physical Therapy Clinic/Center Primary Care ... Surgery Plastic & Reconstructive Surgery Surgery Surgical Critical Care Surgery Surgical Oncology Surgery Trauma Surgery Surgery Vascular Surgery Surgery of the Hand TRANSPLANT HEPATOLOGY Technician, Pathology Phlebotomy Thoracic Surgery Cardiothoracic Vascular Urology
0 ID10006701904 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 178 columns

1.4. opioidCount (Count of opioid prescriptions)

In [0]:
opioidCount = opioid.groupby('id')['event_descr'].count().reset_index() # 4
opioidCount.columns = ['id', 'opioidCount']
print(opioidCount.shape)
opioidCount.head(3)
(19882, 2)
Out[0]:
id opioidCount
0 ID10006701904 4
1 ID10010854159 1
2 ID10013863216 1

1.5. opioidDaysSupplied (Sum of opioid days supplied)

In [0]:
opioidDaysSupplied = opioid.groupby('id')['PAY_DAY_SUPPLY_CNT'].sum().reset_index() # 5
opioidDaysSupplied.columns = ['id', 'opioidDaysSupplied']
print(opioidDaysSupplied.shape)
opioidDaysSupplied.head(3)
(19882, 2)
Out[0]:
id opioidDaysSupplied
0 ID10006701904 240.0
1 ID10010854159 5.0
2 ID10013863216 90.0

1.6. opioidDay0Specialty (dummies of Specialty on Day 0)

In [0]:
opioidDay0Specialty = getDummies('Specialty', opioid[opioid.Days==0]) # 6
print(opioidDay0Specialty.shape)
opioidDay0Specialty.head(3)
(19693, 132)
Out[0]:
id Specialty_Allergy & Immunology Specialty_Anesthesiology Specialty_Anesthesiology Pain Medicine Specialty_Chiropractor Specialty_Clinic/Center Primary Care Specialty_Clinical Nurse Specialist Gerontology Specialty_Colon & Rectal Surgery Specialty_Counselor Mental Health Specialty_Dentist ... Specialty_STUDENT IN ORGANIZED HEALTHCARE EDU/TRAI Specialty_Specialist Specialty_Surgery Specialty_Surgery Plastic & Reconstructive Surgery Specialty_Surgery Surgical Critical Care Specialty_Surgery Vascular Surgery Specialty_Surgery of the Hand Specialty_TRANSPLANT HEPATOLOGY Specialty_Thoracic Surgery Cardiothoracic Vascular Specialty_Urology
0 ID10006701904 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 ID10010854159 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 ID10013863216 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 132 columns

1.7. opioidDay0Supplies (opioid supplies on day 0)

In [0]:
opioidDay0Supplies = opioid[opioid.Days==0][['id','PAY_DAY_SUPPLY_CNT']].groupby('id').max().reset_index() #7
opioidDay0Supplies.columns = ['id', 'opioidDay0Supplies']
print(opioidDay0Supplies.shape)
opioidDay0Supplies.head(3)
(19693, 2)
Out[0]:
id opioidDay0Supplies
0 ID10006701904 60.0
1 ID10010854159 5.0
2 ID10013863216 90.0

2. Fully Paid Claim

In [0]:
fullData.event_descr.value_counts()
Out[0]:
RX Claim - Paid                     1899090
Fully Paid Claim                    1170330
RX Claim - Rejected                  515133
Inbound Call by Mbr                  248800
RX Claim - New Drug                  192671
New provider                         133964
Inbound Call by Prov                  91264
Inbound Call by Other                 24250
Surgery                               20992
RX Claim - First Time Mail Order       8020
New diagnosis - CPD                    3534
New diagnosis - Hypertension           3291
New diagnosis - Top 5                  3254
New diagnosis - CAD                    2210
New diagnosis - Diabetes               2124
New diagnosis - CHF                    1695
Name: event_descr, dtype: int64
In [0]:
fullyPaidClaim = fullData[fullData.event_descr=='Fully Paid Claim'].copy()
In [0]:
fullyPaidClaim.shape
Out[0]:
(1170330, 20)

2.1. Sum of Amount

  • This function is used for calculating the total amount of money each patient spent
In [0]:
def getAmount(event):
    """
    Get Count, ChargeAmount, NetPaidAmount and ResponsibleAmount
        for specified event
    """
    # get fully paid claim data
    data = fullData[fullData.event_descr==event].copy()
    
    # Convert amount to float
    data['event_attr3'] = data['event_attr3'].apply(float)
    data['event_attr4'] = data['event_attr4'].apply(float)
    data['event_attr5'] = data['event_attr5'].apply(float)

    # Group by sum of charged amount
    groupbyData = data.groupby('id').agg({'event_attr3': ['count', 'sum'],
                                          'event_attr4': 'sum',
                                          'event_attr5': 'sum'})

    groupbyData.columns = [
        'Count', 'ChargeAmount', 'NetPaidAmount', 'ResponsibleAmount']
    groupbyData = groupbyData.reset_index()
    return groupbyData
In [0]:
claimAmount = getAmount('Fully Paid Claim') #8
print(claimAmount.shape)
claimAmount.head(3)
(19899, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10006701904 81 57383.01 16921.91 1457.72
1 ID10010854159 43 23227.94 4274.76 1778.33
2 ID10013863216 20 18580.46 2893.69 602.36

2.2. Place of Treatment

In [0]:
fullyPaidClaim = fullData[fullData.event_descr=='Fully Paid Claim'].copy()
In [0]:
fullyPaidClaimTrainLabel = trainDataLabel[trainDataLabel.event_descr=='Fully Paid Claim'].copy()
  • This functions is created to show the difference between LTOT and non-LTOT of a particular attribute by calculating the average value of each patient
In [0]:
def getDifference(col, data, n):
    df0 = data.groupby(['id', col, 'LTOP'])['event_descr'].count().reset_index().fillna(0)
    df1 = df0.groupby([col, 'LTOP'])['event_descr'].mean().reset_index()
    df1 = df1.pivot(index=col, columns='LTOP', values='event_descr').reset_index()
    df1.columns = [col, 'LTOP_0', 'LTOP_1']

    df1['diff'] = abs(df1.LTOP_0 - df1.LTOP_1)
    df1 = df1.sort_values('diff', ascending=False)
    
    differenceList = list(df1[col].head(n))
    
    return df1, differenceList
  • This function is used for selecting the values that appear most frequently in a particular attribute
In [0]:
def getPopular(col, data, n):
    df = data.groupby([col])['id'].nunique().reset_index().sort_values('id', ascending=False)
    popularList = list(df[col].head(n))
    
    return df, popularList
  • This function is used for creating a list of those most important values we found in the former step
In [0]:
def getImportant(col, data, n1, n2):
    df1, differenceList = getDifference(col, data, n1)
    df2, popularList = getPopular(col, data, n2)

    importantList = []
    for event in differenceList:
        if event in popularList:
            importantList.append(event)
            
    return importantList
  • This function is used for getting the count of events of each patient
In [0]:
def getImportantCount(col, importantList, data):
    return getCount(col, data[data[col].apply(lambda event: event in importantList)])
In [0]:
importantList = getImportant('event_attr2', fullyPaidClaimTrainLabel, 30, 30) #9
treatmentPlace = getImportantCount('event_attr2', importantList, fullyPaidClaim)

print(treatmentPlace.shape)
treatmentPlace.head(3)
(19898, 29)
Out[0]:
event_attr2 id AMBULANCE - LAND AMBULATORY SURGICAL CENTER COMMUNITY MENTAL HEALTH CENTER COMPREHENSIVE INPATIENT REHAB FACILITY CUSTODIAL CARE FACILITY EMERGENCY ROOM - HOSPITAL END STAGE RENAL DISEASE TX FACILITY FEDERALLY QUALIFIED HEALTH CENTER HOME ... ON CAMPUS-HOSPITAL OUTPATIENT OTHER/UNKNOWN POT-CD PHARMACY PSYCHIATRIC FACILITY PARTIAL HOSPITAL RURAL HEALTH CLINIC SKILLED NURSING FACILITY STATE OR LOCAL PUBLIC HEALTH CLINIC TEMPORARY LODGING URGENT CARE FACILITY WALK-IN RETAIL HEALTH CLINIC
0 ID10006701904 1.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 7.0 ... 9.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 29 columns

2.3. Diagnosis

In [0]:
# getDifference('event_attr1', fullyPaidClaimTrain, 30)[0]
# getPopular('event_attr1', fullyPaidClaimTrain, 30)[0]
# getImportant('event_attr1', fullyPaidClaimTrain, 1000, 1000)
In [0]:
importantList = getDifference('event_attr1', fullyPaidClaimTrainLabel, 100)[1] 
diagnosisClaim = getImportantCount('event_attr1', importantList, fullyPaidClaim) #10

print(diagnosisClaim.shape)
diagnosisClaim.head(3)
(791, 101)
Out[0]:
event_attr1 id ACUTE MYOCARDIAL INFARCTION OF ANTEROLATERAL WALL, EPISODE OF CARE UNSPECIFIED ADRENOGENITAL DISORDER, UNSPECIFIED ALCOHOLIC LIVER DISEASE, UNSPECIFIED ALLERGIC PURPURA ANTINEOPLASTIC CHEMOTHERAPY INDUCED PANCYTOPENIA ATAXIA FOLLOWING UNSPECIFIED CEREBROVASCULAR DISEASE CAUSALGIA OF UNSPECIFIED LOWER LIMB CHRONIC ANGLE-CLOSURE GLAUCOMA, BILATERAL, SEVERE STAGE CHRONIC INFLAMMATORY DEMYELINATING POLYNEURITIS ... UNSPECIFIED FRACTURE OF LEFT LOWER LEG, INITIAL ENCOUNTER FOR CLOSED FRACTURE UNSPECIFIED FRACTURE OF SHAFT OF RIGHT ULNA, INITIAL ENCOUNTER FOR CLOSED FRACTURE UNSPECIFIED INJURY OF BLADDER, INITIAL ENCOUNTER UNSPECIFIED INTRACAPSULAR FRACTURE OF UNSPECIFIED FEMUR, INITIAL ENCOUNTER FOR CLOSED FRACTURE UNSPECIFIED MYCOSIS UNSPECIFIED OPEN WOUND OF RIGHT GREAT TOE WITHOUT DAMAGE TO NAIL, INITIAL ENCOUNTER UNSPECIFIED OPEN WOUND OF UNSPECIFIED BUTTOCK, SUBSEQUENT ENCOUNTER UNSPECIFIED PARALYSIS VESICOINTESTINAL FISTULA WEGENER'S GRANULOMATOSIS WITHOUT RENAL INVOLVEMENT
0 ID1002482139 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10161175490 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10201747254 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 101 columns

3. RX Claim

3.1. Drug Class

In [0]:
rxClaimPaidLabel = rxClaimPaid.merge(label, on='id')
In [0]:
drugClass = getImportantCount('event_attr1', getImportant('event_attr1', rxClaimPaidLabel, 200, 200), rxClaimPaid) #11
print(drugClass.shape)
drugClass.head(3)
(19981, 96)
Out[0]:
event_attr1 id 5-HT3 RECEPTOR ANTAGONISTS ACNE PRODUCTS AMINOGLYCOSIDES AMPHETAMINES ANALGESIC COMBINATIONS ANDROGENS ANGIOTENSIN II RECEPTOR ANTAGONISTS ANOREXIANTS NON-AMPHETAMINE ANTACIDS - BICARBONATE ... STEROID INHALANTS STIMULANTS - MISC. SURFACTANT LAXATIVES THROMBIN INHIBITORS URINARY ANTISPASMODIC - ANTIMUSCARINICS (ANTICHOLINERGIC) URINARY ANTISPASMODICS - BETA-3 ADRENERGIC AGONISTS VAGINAL ANTI-INFECTIVES VALPROIC ACID WATER SOLUBLE VITAMINS XANTHINES
0 ID10006701904 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 96 columns

3.2. Drug Group

In [0]:
drugGroup = getCount('event_attr6', rxClaimPaid) #12
print(drugGroup.shape)
drugGroup.head()
(19982, 29)
Out[0]:
event_attr6 id CANCER CARDIO DERM DIABETES GASTRO GENDER-FEMALE GENDER-MALE INF-ANTIBIOTICS LAXATIVE/BLADDER ... OTH-THYROID OTHER OTHER ANTI INFECTIVES PAIN PSYCH PSYCH-ANX PSYCH-DEP RESPIRATORY SENSE ORGANS EARS/EYES/MOUTH SUPPLIES/SUPPLEMENTS/TESTS
0 ID10006701904 0.0 22.0 3.0 0.0 10.0 0.0 0.0 3.0 0.0 ... 7.0 18.0 2.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 40.0 0.0 9.0 3.0 0.0 0.0 6.0 0.0 ... 0.0 1.0 1.0 8.0 0.0 0.0 0.0 1.0 3.0 0.0
2 ID10013863216 0.0 2.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ID10020514442 0.0 46.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 26.0
4 ID10024447278 0.0 16.0 0.0 5.0 0.0 0.0 0.0 2.0 0.0 ... 0.0 1.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 14.0

5 rows × 29 columns

3.2. RxCost

In [0]:
rxClaimPaid.shape
Out[0]:
(1899090, 20)
  • This function is created to get the total number and costs of RX Claim
In [0]:
def getRxCost():
    """
    Get total RxCost and NetPaidAmount for Rx Claim Paid
    """
    # get fully paid claim data
    data = rxClaimPaid.copy()
    
    # Convert amount to float
    data['event_attr3'] = data['event_attr3'].apply(float)
    data['event_attr4'] = data['event_attr4'].apply(float)

    # Group by sum of charged amount
    groupbyData = data.groupby('id').agg({'event_attr3': ['count', 'sum'],
                                          'event_attr4': 'sum'})

    groupbyData.columns = [
        'Count', 'RxCost', 'NetPaidAmount']
    groupbyData = groupbyData.reset_index()
    return groupbyData
In [0]:
rxCost = getRxCost() #13
print(rxCost.shape)
rxCost.head(3)
(19982, 4)
Out[0]:
id Count RxCost NetPaidAmount
0 ID10006701904 80 11765.44 11590.77
1 ID10010854159 76 1335.77 1170.11
2 ID10013863216 9 125.12 65.94

4. Inbound Call

In [0]:
inboundCall = fullData[fullData['event_descr'].apply(
    lambda event: event in ['Inbound Call by Mbr',
                            'Inbound Call by Prov',
                            'Inbound Call by Other'])].copy()
In [0]:
inboundCallLabel = inboundCall.merge(label, on='id')

4.1. Call category

In [0]:
callCategory = getImportantCount('event_attr1',
                                   getImportant('event_attr1', inboundCallLabel, 200, 200), inboundCall)  # 14
print(callCategory.shape)
callCategory.head(3)
(18310, 185)
Out[0]:
event_attr1 id ACCESS TO CARE(NOT IN SYSTEM) ANNUAL NOTIFICATION OF CHANGE (ANOC) AUTH-REFERRAL AUTH-REFERRAL (MCRU) AUTH-REFERRAL (RSO) AUTH-REFERRAL UPDATED AUTH/REFERRAL AUTHORIZATIONS AUTHORIZATIONS RX ... VENDOR VERIFICATION VOB VOB RX WEB WEB-MOBILE WEB/MOBILE WEBSITE (CT & TM) WELL DINE WELLNESS
0 ID10006701904 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 185 columns

4.2. Inquiry

In [0]:
# getDifference('event_attr2', inboundCallLabel, 30)[0]
# getPopular('event_attr1', rxClaimPaidLabel, 30)[0]
# getImportant('event_attr2', inboundCallLabel, 100, 100)
# callCategory = getImportantDummies('event_attr1', getImportant('event_attr1', inboundCallLabel, 200, 200), inboundCall) # 14

callInquiry = getImportantCount('event_attr2',
                                   getImportant('event_attr2', inboundCallLabel, 200, 200), inboundCall)  # 15
print(callInquiry.shape)
callInquiry.head(3)
(16624, 51)
Out[0]:
event_attr2 id AMBULATORY PYMT CLASS AUTH/REFERRAL NOT OBTAINED AUTHORIZATION BALANCE INQUIRY BENEFITS/NEW RX BILLING ISSUES CHECK INFO/RECOUPMENT CLAIM DENIED CLAIM PENDED ... SMARTSUMMARY SNF/HOME HEALTH TECH CONSULT TEST TRANSFERRED TRANSPORTATION VACATION OVERRIDE VERIFICATION WEB ACCESS WEB SITE
0 ID10006701904 0.0 0.0 0.0 4.0 2.0 0.0 1.0 3.0 2.0 ... 0.0 2.0 13.0 0.0 2.0 0.0 0.0 1.0 0.0 0.0
1 ID10010854159 0.0 0.0 0.0 1.0 3.0 1.0 0.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0
2 ID10024447278 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0 0.0 0.0 1.0 0.0 0.0

3 rows × 51 columns

4.3. Disposition

In [0]:
callDisposition = getImportantCount('event_attr3',
                                      getImportant('event_attr3', inboundCallLabel, 200, 200), inboundCall)  # 16
print(callDisposition.shape)
callDisposition.head(3)
(17939, 86)
Out[0]:
event_attr3 id ACH UPDATED/CHANGED ADDRESSED/ADMINISTRATION ADDRESSED/DRUG-DRUG INTERACTIONS ADDRESSED/OTHER ADVISED NO POA ON FILE/CONSENT FORM ON FILE ARRANGED DELIVERY AUTHORIZATION ENTERED AUTHORIZED BALANCE INQUIRY ... TRANSFERRED TO HEALTHHELP TRANSFERRED TO MEDICAL CCR TRANSFERRED TO ORTHONET TRANSFERRED TO OTC TRANSFERRED TO PHARMACIST TRANSFERRED TO RIGHTSOURCE TRANSFERRED TO RSSP UNABLE TO UPDATE INFORMATION WEB EMULATION WEB NAVIGATION EDUCATION PROVIDED
0 ID10006701904 0.0 0.0 0.0 1.0 0.0 0.0 0.0 4.0 0.0 ... 0.0 3.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
1 ID10010854159 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 2.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2 ID10013863216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 86 columns

4.4. Location

In [0]:
inboundCall['state'] = inboundCall['event_attr5'].fillna(
    'Na, Na').apply(lambda location: location.split(', ')[1])

callState = getDummies('state', inboundCall) # 17
print(callState.shape)
callState.head(3)
(18316, 16)
Out[0]:
id state_AL state_AZ state_FL state_ID state_KS state_KY state_NC state_Na state_OH state_PR state_TN state_TX state_UT state_WA state_WI
0 ID10006701904 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1
1 ID10010854159 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0
2 ID10013863216 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

5. Sum of Charge

In [0]:
fullData.event_descr.value_counts()
Out[0]:
RX Claim - Paid                     2711881
Fully Paid Claim                    1735868
RX Claim - Rejected                  743715
Inbound Call by Mbr                  308020
RX Claim - New Drug                  227173
New provider                         171288
Inbound Call by Prov                 107528
Surgery                               30995
Inbound Call by Other                 27051
RX Claim - First Time Mail Order       7405
New diagnosis - CPD                    3705
New diagnosis - Hypertension           2775
New diagnosis - Top 5                  2656
New diagnosis - CAD                    2526
New diagnosis - CHF                    2193
New diagnosis - Diabetes               2190
Name: event_descr, dtype: int64
In [0]:
surgeryAmount = getAmount('Surgery') # 18
print(surgeryAmount.shape)
surgeryAmount.head(3)
(9080, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10006701904 1 2138.90 197.10 275.0
1 ID10010854159 1 2314.00 619.08 250.0
2 ID10013863216 1 12337.42 1396.98 275.0
In [0]:
newCPDAmount = getAmount('New diagnosis - CPD') # 19
print(newCPDAmount.shape)
newCPDAmount.head(3)
(3534, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID1002482139 1 137.0 0.00 0.0
1 ID10052944491 1 115.0 99.00 15.0
2 ID1007116120 1 1431.1 187.44 0.0
In [0]:
newHyperAmount = getAmount('New diagnosis - Hypertension') # 20
print(newHyperAmount.shape)
newHyperAmount.head(3)
(3291, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10112532931 1 640.0 59.23 0.0
1 ID10189080475 1 1078.0 213.53 0.0
2 ID10230809175 1 80.0 47.03 20.0
In [0]:
newTop5Amount = getAmount('New diagnosis - Top 5') # 21
print(newTop5Amount.shape)
newTop5Amount.head(3)
(3254, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10112532931 1 5333.00 0.00 0.0
1 ID10169051312 1 0.01 0.00 0.0
2 ID10230809175 1 80.00 47.03 20.0
In [0]:
newCHFAmount = getAmount('New diagnosis - CHF') # 22
print(newCHFAmount.shape)
newCHFAmount.head(3)
(1695, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10006701904 1 2296.74 0.00 0.0
1 ID10020514442 1 118.00 62.90 0.0
2 ID10074598346 1 185.00 23.52 50.0
In [0]:
newDiabAmount = getAmount('New diagnosis - Diabetes') # 23
print(newDiabAmount.shape)
newDiabAmount.head(3)
(2124, 5)
Out[0]:
id Count ChargeAmount NetPaidAmount ResponsibleAmount
0 ID10006701904 1 6.52 0.00 0.0
1 ID10112532931 1 2513.00 1188.32 0.0
2 ID10119459922 1 20.55 0.00 0.0

6. Merge

In [0]:
features = [opioidDrugClass, opioidBrandName, opioidSpecialty, opioidCount,
           opioidDaysSupplied, opioidDay0Specialty, opioidDay0Supplies, claimAmount,
           treatmentPlace, diagnosisClaim, drugClass, drugGroup,
           rxCost, callCategory, callInquiry, callDisposition, callState,
           surgeryAmount, newCPDAmount, newHyperAmount, newTop5Amount, newCHFAmount, newDiabAmount]
In [0]:
from functools import reduce
mergedDf = reduce(lambda left, right: pd.merge(
    left, right, on=['id'], how='outer'), features)
In [0]:
# Impute missing value
mergedDf = mergedDf.fillna(0)
In [0]:
mergedDf.shape
Out[0]:
(19982, 1009)
In [0]:
# Split train data and test data
trainDf = mergedDf.merge(trainID, on='id')
testDf = mergedDf.merge(testID, on='id')
In [0]:
# Add label to train data
trainDf = trainDf.merge(label, how='outer', on='id').fillna(0)
In [0]:
# Save data
trainDf.to_csv('trainDf.csv', index=False)
testDf.to_csv('testDf.csv', index=False)

Recursive Feature Elimination

  • Recursive Feature Elimination with ensembles of Extra Trees is used as the estimator to compute the relative importance of each attribute.
In [ ]:
# import useful packages
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV, RFE
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None
In [0]:
X = trainDf.drop(['LTOP','id'], axis=1)
target = trainDf['LTOP']
In [0]:
from sklearn.ensemble import ExtraTreesClassifier
In [0]:
model = ExtraTreesClassifier()
model.fit(X, target)
Out[0]:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                     oob_score=False, random_state=None, verbose=0,
                     warm_start=False)
In [0]:
# visualize the important features
dset = pd.DataFrame()
dset['attr'] = X.columns
dset['importance'] = model.feature_importances_
dset = dset.sort_values(by='importance', ascending=False)
dplot = dset.head(10)

plt.figure(figsize=(16, 8))
plt.barh(y=dplot['attr'], width=dplot['importance'], color='cornflowerblue')
plt.title('Feature Importances', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()
  • According to the chart above, the most important features are Opioid Day 0 Supplies (opioid supplies on the first opioid naive day), Opioid Days Supplied (opioid supplied days before first opioid naive day), Pain (whether patient use pain drug), Opioid Agonists (opioid type), Specialty_Anesthesiology Pain Medicine (Specialty of doctor giving opioid on Day 0).
  • Nearly all of the important features are in opioid data, which totally makes sense because this category is most related to LTOT. Patients' behavior in the future highly depends on their historical opioid usage.

Recommendations based on the result

image.png

1. Opioid Data

  • The first two features are Opioid Day 0 Supplies, which refers to the opioid drug supplies on day 0 and Opioid Days Supplied, which means the sum of opioid days supplied. They convey the information about how deeply the patients relied on opioid drugs. If patients get a lot more opioid drugs on the first day, they are more likely to order more in the future, while opioid days supplied are a direct indicator of LTOT, which undoubtedly greatly contributes to the predictions.
  • Opioid Agonists and Opioid Combinations belong to opioid drug class, which implies that patients who order these two kinds of drugs are more likely to have LTOT. Also, in Opioid Combinations, Hydrocodone/Acetaminophen may be the most important brand.
  • As for the specialty on day 0, Specialty_Anesthesiology Pain Medicine and Speciaty_Family Practice stand out. Patients who receive opioids from doctors who specialized in Anesthesiology Pain Medicine and Family Practice have a higher probability of having LTOT.
  • Opioid Count is the count of opioid prescriptions. If a patient has more RX claims related to opioids, he/she may get more chances to have LTOT in the future.

2. Other Categories

  • Pain belongs to drug groups in other RX claims. It is quite reasonable because opioids are primarily used for pain relief medically.

Modeling

image.png

1. Load Data

In [0]:
trainDf = pd.read_csv("/content/trainDf.csv")
testDf = pd.read_csv("/content/testDf.csv")
In [0]:
testID = pd.read_csv("/content/testID.csv")
In [7]:
print(trainDf.shape)
print(testDf.shape)
(13997, 1010)
(5998, 1009)

2. Feature Scaling

  • Feature scaling basically helps to normalise the data within a particular range.It also helps in speeding up the calculations in logistic algorithm.
In [0]:
allData = pd.concat([trainDf.drop(['LTOP'], axis=1).set_index('id'), testDf.set_index('id')], axis=0)
In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
allData = pd.DataFrame(scaler.fit_transform(allData), columns=allData.columns)
In [0]:
X = allData.iloc[:13997, ]
X_testDf = allData.iloc[13997:, ]

y = trainDf.LTOP
In [13]:
print(X.shape)
print(y.shape)
print(X_testDf.shape)
(13997, 1008)
(13997,)
(5998, 1008)

3. Model Tuning

  • In the modeling process, we tried three algorithms: Logistic Regression, Random Forest, and Light Gradient Boosting. We used 10-fold cross-validation, and the AUC score to tune and evaluate our models.
In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2019)
In [0]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, auc, roc_auc_score
from sklearn.model_selection import KFold, cross_val_score
  • This function is created to show the confusion matrix and classification report of the model
In [0]:
def getAccuracy(model, threshold):
    
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:, 1]
    y_pred = np.where(probs >= threshold, 1, 0)
    
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\n")
    print("Classification Report:\n",classification_report(y_test, y_pred))
    return accuracy_score(y_test, y_pred)
  • This function is used for excuting a 10-fold cross validation
In [0]:
n_folds = 10

def cv_ROC_AUC(model):
    """
    Return the average AUC score.
    """
    # Set KFold to shuffle data before the split
    kf = KFold(n_folds, shuffle=True, random_state=42)
    
    # Get accuracy score
    roc_auc = cross_val_score(model, X, y, scoring="roc_auc", cv=kf)
    
    return roc_auc.mean()
  • This fuction is used for getting the ROC score of the model
In [0]:
def get_ROC_AUC(model):
    """
    Return AUC score
    """
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)[:,1]
    roc_auc = roc_auc_score(y_test, probs)
    return roc_auc
  • This function is created to visualize the ROC AUC and find the optimal threshold of the model
In [0]:
def plotROC(model):
    """
    1. Plot ROC AUC
    2. Return the best threshold
    """
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC AUC
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
    # Find optimal threshold
    rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
    rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
    optimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
    
    return optimalThreshold
  • This function is created to display the results of modeling
In [0]:
def evaluate(model):
    optimalThreshold = plotROC(model)
    print("Optimal Threshold: ", optimalThreshold)
    print("\n")
    print("Accuracy: ", getAccuracy(model, optimalThreshold))

3.1 Random Forest

In [0]:
from sklearn.ensemble import RandomForestClassifier
In [0]:
rfc = RandomForestClassifier(n_estimators=1000)
print("AUC score: ", get_ROC_AUC(rfc))
AUC score:  0.8643438495519058
In [0]:
evaluate(rfc)
Optimal Threshold:  0.539


Confusion Matrix:
 [[1301  293]
 [ 250  956]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.84      0.82      0.83      1594
         1.0       0.77      0.79      0.78      1206

    accuracy                           0.81      2800
   macro avg       0.80      0.80      0.80      2800
weighted avg       0.81      0.81      0.81      2800

Accuracy:  0.8060714285714285

3.2 Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
In [0]:
logit = LogisticRegression(penalty='l1', C=0.5, max_iter=1000)
print("AUC score: ", get_ROC_AUC(logit))
AUC score:  0.8342988112553085
In [0]:
evaluate(logit)
Optimal Threshold:  0.39689265894596626


Confusion Matrix:
 [[1232  362]
 [ 257  949]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.83      0.77      0.80      1594
         1.0       0.72      0.79      0.75      1206

    accuracy                           0.78      2800
   macro avg       0.78      0.78      0.78      2800
weighted avg       0.78      0.78      0.78      2800

Accuracy:  0.7789285714285714

3.3 LightGBM

In [0]:
from lightgbm import LGBMClassifier
In [22]:
# Model tuning
paraList = [500, 1000, 1500]
roc_auc_list = []

for i in paraList:
    lgb = LGBMClassifier(objective='binary',
                         learning_rate=0.049,
                         n_estimators=i, # 1462
                         num_leaves=8,
                         min_data_in_leaf=4,
                         max_depth=3,
                         max_bin=41,
                         bagging_fraction=0.845,
                         bagging_freq=5,
                         feature_fraction=0.24,
                         feature_fraction_seed=9,
                         bagging_seed=9,
                         min_sum_hessian_in_leaf=11)
    roc_auc = get_ROC_AUC(lgb)
    roc_auc_list.append(roc_auc)
    print("Parameter: ", i, "\tAUC score: ", roc_auc)

plt.plot(np.array(paraList),
         np.array(roc_auc_list))
Parameter:  500 	AUC score:  0.8708725298642713
Parameter:  1000 	AUC score:  0.871080607002628
Parameter:  1500 	AUC score:  0.8698815624928474
Out[22]:
[<matplotlib.lines.Line2D at 0x7fd6d3437320>]
In [23]:
# Best model chosen after tuning
lgb = LGBMClassifier(objective='binary',
                        learning_rate=0.03,
                        n_estimators=1130,
                        num_leaves=8,
                        min_data_in_leaf=1,
                        max_depth=10,
                        max_bin=80,
                        bagging_fraction=0.8,
                        bagging_freq=5,
                        feature_fraction=0.24,
                        feature_fraction_seed=9,
                        bagging_seed=9,
                        min_sum_hessian_in_leaf=12)
cv_ROC_AUC(lgb)
Out[23]:
0.8741430509656014
In [24]:
evaluate(lgb)
Optimal Threshold:  0.37778002376412156


Confusion Matrix:
 [[1206  388]
 [ 176 1030]]


Classification Report:
               precision    recall  f1-score   support

         0.0       0.87      0.76      0.81      1594
         1.0       0.73      0.85      0.79      1206

    accuracy                           0.80      2800
   macro avg       0.80      0.81      0.80      2800
weighted avg       0.81      0.80      0.80      2800

Accuracy:  0.7985714285714286

Predict

choose LightGBM as the final model

In [0]:
# Use test data to make predictions
lgb.fit(X, y)
predictions = lgb.predict_proba(X_testDf)[:, 1]
  • Calculate the probability of LTOT for each patient and rank them by descending order
In [27]:
output = pd.DataFrame({'ID': testDf.id,
                       'SCORE': predictions})

testID.columns = ['ID']
output = output.merge(testID, on='ID', how='outer').fillna(0)
output['RANK'] = output.SCORE.rank(method='min', ascending=False)
output = output.sort_values('SCORE', ascending=False)
output.head(10)
Out[27]:
ID SCORE RANK
5026 ID85903555148 0.995055 1.0
1365 ID3065072932 0.992463 2.0
4319 ID74889635841 0.989188 3.0
4261 ID74172758671 0.987919 4.0
5063 ID86509078362 0.986938 5.0
1977 ID40152538353 0.986662 6.0
644 ID19551240941 0.986612 7.0
735 ID21086130021 0.986352 8.0
2245 ID44040914598 0.986309 9.0
5192 ID88456216402 0.985301 10.0
In [28]:
pd.Series(np.where(output.SCORE >= 0.5, 1, 0)).value_counts()
Out[28]:
1    3019
0    2981
dtype: int64
In [29]:
y.value_counts()
Out[29]:
0.0    7670
1.0    6327
Name: LTOP, dtype: int64

image.png