Google Play Apps Rating Predictions¶

What kind of Andriod Apps are most likely to get high ratings?

alt text

Load Data¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

raw = pd.read_csv("/content/drive/My Drive/Home/bibina Google play/googleplaystore.csv")

raw.head()

raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

data = raw.dropna()

EDA¶

Category¶

First, let's take a look at the top 5 categories with the highest ratings

data.groupby('Category').mean().sort_values(by='Rating',ascending=False).head(5)

plt.figure(figsize=(16,8))
fig = sns.boxplot(x="Category",y="Rating",data=data,palette = "Set1")
fig.set_xticklabels(fig.get_xticklabels(),rotation=90)
plt.title('Distribution of Ratings in Each Category',fontsize = 20)

Text(0.5, 1.0, 'Distribution of Ratings in Each Category')

From this box plot, we can see that the ratings of "Events" and "Education" categories are all pretty steady. Even the lowest rating is above 3.5.
Ratings of "Business", "Dating", "Finance", "Lifestyle" and "Tools" categories vary a lot.

Reviews¶

pd.to_numeric(data.Reviews)

0           159
1           967
2         87510
3        215644
4           967
          ...  
10834         7
10836        38
10837         4
10839       114
10840    398307
Name: Reviews, Length: 9360, dtype: int64

plt.figure(figsize=(10,5))
sns.regplot(x=pd.to_numeric(data.Reviews),y=data.Rating,data=data)
plt.title('Rating vs Reviews',fontsize = 20)

Text(0.5, 1.0, 'Rating vs Reviews')

Reviews seem to be slightly correlated to ratings. More reviews, higher ratings.

Size¶

Take a rough look at the range of size.

data.Size.unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
       '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M',
       '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M',
       '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M',
       '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
       '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M',
       '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M',
       '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M',
       '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k',
       '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M',
       '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k',
       '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M',
       '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M',
       '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M',
       '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k',
       '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k',
       '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k',
       '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k',
       '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k',
       '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k',
       '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k',
       '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k',
       '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k',
       '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k',
       '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k',
       '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k',
       '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k',
       '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k',
       '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k',
       '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k',
       '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k',
       '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k',
       '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k',
       '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k',
       '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k',
       '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k',
       '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k',
       '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k',
       '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k',
       '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k',
       '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'],
      dtype=object)

Change 'k','M' and 'Varies with device' with numbers

data.Size.replace({'Varies with device':np.nan},regex=True,inplace=True)
data.Size = (data.Size.replace(r'[kM]+$', '', regex=True).astype(float) * \
             data.Size.str.extract(r'[\d\.]+([KM]+)', expand=False)
            .fillna(1)
            .replace(['k','M'], [1, 1000]).astype(int))

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:6746: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

plt.figure(figsize=(10,5))
sns.regplot(x=pd.to_numeric(data.Size),y=data.Rating,data=data)
plt.title('Rating vs Size',fontsize = 20)

Text(0.5, 1.0, 'Rating vs Size')

Size seems seldomly affect ratings, but apps with larger size have higher ratings on average.

Installs¶

data.Installs.unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+',
       '5+', '50+', '1+'], dtype=object)

ins = data.groupby('Installs').mean()

plt.figure(figsize=(16,8))
fig = sns.barplot(x=ins.index,y=ins['Rating'],data=data,palette = "Set1")
fig.set_xticklabels(fig.get_xticklabels(),rotation=60)
plt.title('Ratings vs Install Times',fontsize = 20)

Text(0.5, 1.0, 'Ratings vs Install Times')

Apps with very few installs as well as very large number of installs have relatively higher ratings.

Price¶

data.Price = data.Price.replace(r'[$]+', '', regex=True).astype(float)

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

plt.figure(figsize=(10,5))
sns.regplot(x=data.Price,y=data.Rating)
plt.title('Rating vs Price',fontsize = 20)

Text(0.5, 1.0, 'Rating vs Price')

plt.figure(figsize=(10,5))
sns.regplot(x='Price',y='Rating',data=data[data.Price<=50])
plt.title('Rating vs Price(less than $50)',fontsize = 20)

Text(0.5, 1.0, 'Rating vs Price(less than $50)')

Apps with higher price tend to have higher ratings. But there are also a lot of free apps getting full star ratings. Let's take a look at them.

fivestar = data[(data.Price==0)&(data.Rating==5.0)]

fivestar.shape

(246, 13)

plt.figure(figsize=(16,8))
fig = sns.countplot(x='Category', data=fivestar,palette='Set1')
fig.set_xticklabels(fig.get_xticklabels(),rotation=60)
plt.title('5 Star Free Apps in Each Category ',fontsize = 20)

Text(0.5, 1.0, '5 Star Free Apps in Each Category ')

There are nearly 60 five star free apps in "Family" category. "Medical" and "Lifestyle" category also did pretty well.

Predictive Analysis¶

From previous analysis, I found that around 1500 ratings were missing. So, next, I'm going to use other complete rows of information to predict those missing ratings.

Feature Engineering¶

Now, it's time to choose useful features for predicting models. I will drop some columns including:

App: ID like, should be excluded from the data
Last Updated: Intuition. Seem have nothing to do with Ratings
Current Ver: not consistent
Type: provide replicated information with Price
Genres: provide replicated information with Category

raw.drop(['App','Last Updated','Current Ver','Type','Genres'],axis=1,inplace=True)

raw = raw[raw.Size!='1,000+']

raw=raw[raw.Size!='Varies with device']
raw.Size = (raw.Size.replace(r'[kM]+$', '', regex=True).astype(float) * \
             raw.Size.str.extract(r'[\d\.]+([KM]+)', expand=False)
            .fillna(1)
            .replace(['k','M'], [1, 1000]).astype(int))

raw.Reviews = pd.to_numeric(raw.Reviews)

Divide features into numerical and categorical ones.

num_col = [cname for cname in raw.columns if raw[cname].dtype in ['int64','float64']]

cat_col = [cname for cname in raw.columns if raw[cname].dtype=='object']

cat_col

['Category', 'Installs', 'Price', 'Content Rating', 'Android Ver']

num_col

['Rating', 'Reviews', 'Size']

Perform One Hot Encoding on categorical variables.

raw.head()

new = pd.get_dummies(raw, prefix=cat_col, drop_first=True)

new

Set target variable: Rating

#filter out the test set
test = new[new.Rating.isna()]

label = new.Rating.dropna()

label

0        4.1
1        3.9
2        4.7
3        4.5
4        4.3
        ... 
10833    4.8
10834    4.0
10836    4.5
10837    5.0
10840    4.5
Name: Rating, Length: 7729, dtype: float64

data = new.dropna().drop('Rating',axis=1)

Modeling¶

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Standard Scaler

scaler = StandardScaler()
X = scaler.fit_transform(data)

Train Test Split

X_train, X_test, y_train, y_test = train_test_split( X, label, test_size=0.2, random_state=101)

XGBoost Regressor

from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

def getAccuracy_cv(model):
    
    model.fit(X_train,y_train)
    # get predictions
    preds = model.predict(X_test)
    # cross validation
    from sklearn.model_selection import cross_val_score
    scores = -1 * cross_val_score(model,X,label,cv=10,scoring = 'neg_mean_absolute_error')
    print(scores.mean())

# Hyperparameter Tuning
paraList = [500, 1000, 1500]
for i in paraList:
  model =XGBRegressor( 
  learning_rate =0.1,
  n_estimators=i,
  max_depth=5,
  min_child_weight=1,
  gamma=0,
  subsample=0.8,
  colsample_bytree=0.8,
  nthread=4,
  scale_pos_weight=1,
  seed=27)
  getAccuracy_cv(model)

[18:08:58] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:09:11] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:09:27] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:09:43] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:09:58] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:10:13] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:10:29] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:10:45] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:11:00] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:11:16] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:11:31] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.36100240595439537
[18:11:47] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:12:15] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:12:45] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:13:16] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:13:47] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:14:18] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:14:49] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:15:20] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:15:51] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:16:22] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:16:52] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.36911859160298965
[18:17:24] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:18:05] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:18:52] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:19:38] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:20:24] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:21:10] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:21:57] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:22:43] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:23:29] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:24:16] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:25:02] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.37501136239983757

final_model =XGBRegressor( 
  learning_rate =0.1,
  n_estimators=500,
  max_depth=5,
  min_child_weight=1,
  gamma=0,
  subsample=0.8,
  colsample_bytree=0.8,
  nthread=4,
  scale_pos_weight=1,
  seed=27)
getAccuracy_cv(final_model)

[18:28:25] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:28:39] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:28:55] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:29:10] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:29:25] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:29:41] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:29:56] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:30:12] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:30:27] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:30:43] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[18:30:58] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.36100240595439537

Predict¶

Repeat the preprocessing steps on test data

test = test.drop('Rating',axis=1)
test.head()

	Rating
Category
EVENTS	4.435556
EDUCATION	4.389032
ART_AND_DESIGN	4.377049
BOOKS_AND_REFERENCE	4.346067
PERSONALIZATION	4.335256

	Reviews	Size	Category_BEAUTY	Installs_10,000+	...	Price_0	Content Rating_Everyone	Content Rating_Everyone 10+	Content Rating_Mature 17+	Content Rating_Teen	Android Ver_2.3 and up	Android Ver_3.0 and up	Android Ver_4.1 and up
23	61	7000.0	0	0	...	1	1	0	0	0	0	0	1
113	182	5700.0	1	0	...	1	0	1	0	0	0	1	0
123	119	3700.0	1	0	...	1	1	0	0	0	0	0	1
126	654	7400.0	1	0	...	1	0	0	0	1	0	0	1
129	77	2900.0	1	1	...	1	0	0	1	0	1	0	0

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	19M	10,000+	Free	Everyone	Art & Design	January 7, 2018	1.0.0	4.0.3 and up
1	Coloring book moana	ART_AND_DESIGN	3.9	967	14M	500,000+	Free	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	8.7M	5,000,000+	Free	Everyone	Art & Design	August 1, 2018	1.2.4	4.0.3 and up
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	25M	50,000,000+	Free	Teen	Art & Design	June 8, 2018	Varies with device	4.2 and up
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	2.8M	100,000+	Free	Everyone	Art & Design;Creativity	June 20, 2018	1.1	4.4 and up

	Category	Rating	Reviews	Size	Installs	Content Rating	Android Ver
0	ART_AND_DESIGN	4.1	159	19000.0	10,000+	Everyone	4.0.3 and up
1	ART_AND_DESIGN	3.9	967	14000.0	500,000+	Everyone	4.0.3 and up
2	ART_AND_DESIGN	4.7	87510	8700.0	5,000,000+	Everyone	4.0.3 and up
3	ART_AND_DESIGN	4.5	215644	25000.0	50,000,000+	Teen	4.2 and up
4	ART_AND_DESIGN	4.3	967	2800.0	100,000+	Everyone	4.4 and up