Quarantine drives me nuts, during the last 2 months, I got obsessed with watching TouTube videos and spent around 5 hours everyday on YouTube. My favorite catogories are fitness, food tour and cooking. I am kind of curious about what are the most popular videos among people in the whole US, and why people get addicted to YouTube so easily!!!
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import json
raw = pd.read_csv('/content/drive/My Drive/Home/YouTube playplay bibina/USvideos.csv',index_col='video_id')
raw.head(3)
raw.trending_date = pd.to_datetime(raw.trending_date,format='%y.%d.%m',errors='ignore')
raw.columns
raw.drop(['publish_time','thumbnail_link', 'comments_disabled', 'ratings_disabled','video_error_or_removed']
,axis=1,inplace=True)
raw.category_id.max()
f = open('/content/drive/My Drive/Home/YouTube playplay bibina/US_category_id.json')
data = json.load(f)
f.close()
id = []
for i in range(len(data['items'])):
id.append(data['items'][i]['id'])
cat = []
for i in range(len(data['items'])):
cat.append(data['items'][i]['snippet']['title'])
cat_info = pd.DataFrame({'id':map(int,id), 'name':cat})
next = pd.merge(raw,cat_info,how='inner',left_on='category_id',right_on='id')
min(next.trending_date)
max(next.trending_date)
def plotbar(metric):
df = next.groupby('name').sum().reset_index().sort_values(by=metric,ascending=False)
plt.figure(figsize=(16,4))
sns.barplot(x='name',y=metric,data=df.head(10),palette='Set2')
plt.title("Top 10 Categories with the most "+metric, fontsize=20)
plotbar('views')
plotbar('likes')
plotbar('dislikes')
plotbar('comment_count')
channel = next.groupby('channel_title').sum().sort_values(by='views',ascending =
False).head(10).drop(['category_id','id'],axis=1)
channel
Note: ibighit is a k-pop music channel, Dude Perfect is a sport channel and FoxStarHindi is a Bollywood movie channel.
![]()
![]()
next[next.name=='Travel & Events'].groupby('channel_title').sum().sort_values(by='views',ascending =
False).head(10).drop(['category_id','id'],axis=1)
next['month']=next['trending_date'].apply(lambda m: m.month)
next['dayofweek']=next['trending_date'].apply(lambda m: m.dayofweek)
month = next.groupby('month').sum()
dayofweek = next.groupby('dayofweek').sum()
plt.figure(figsize=(10,4))
plt.title('Views in every month',fontsize=20)
sns.barplot(x=month.index,y=month.views,data=month,palette = 'Set1')
plt.figure(figsize=(10,4))
plt.title('Views in days of week',fontsize=20)
sns.barplot(x=dayofweek.index,y=dayofweek.views,data=month,palette = 'Set1')
import os
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string
import re
def word_length(data,t):
return data[t].apply(lambda s: len(str(s).split()))
def text_process(t):
"""
Takes in a string of text, then performs the following:
1. Remove all special characters and numbers
2. Remove all punctuation
3. Remove all stopwords
4. Returns a list of the cleaned text
"""
# Remove all the special characters and numbers
t = [re.sub(r"[^a-zA-Z]+", ' ', str(k)) for k in t.apply(lambda w: w.split())]
# Check characters to see if they are in punctuation
nopunc = [char for char in t if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Remove any stopwords
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
next['title_length'] = word_length(next,'title')
next['tag_length'] = word_length(next,'tags')
def plot_length(x):
plt.figure(figsize=(10,4))
plt.title('Relationship between '+x+' and views',fontsize=20)
sns.lineplot(x=x,y='views',data=next)
plot_length('title_length')
plot_length('tag_length')
title = pd.DataFrame(text_process(next['title'].apply(lambda s: s.lower())))
title.columns=['words']
wordlist1 = pd.value_counts(title.words,ascending=False).head(50)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import random
def grey_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
def getwordcloud(wl):
word = ' '
for i in wl.index:
word = word + i + ' '
wordcloud = WordCloud(width = 1600, height = 800,
background_color ='black',
min_font_size = 10).generate(word)
plt.figure(figsize = (10, 5), facecolor = None)
plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),
interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
getwordcloud(wordlist1)
tags = pd.DataFrame(text_process(next['tags'].apply(lambda s: s.lower())))
tags.columns=['words']
wordlist2 = pd.value_counts(tags.words,ascending=False).head(50)
getwordcloud(wordlist2)
Youtube is renowned for its amazing recommendation system. They basically use 2-stage deep neural network, one network generates recommendations, and a second network ranks these generated recommendations.
In the candidate generation process, the input features only include user history and content, I guess features like searching history, watching history, subscribe and save, time watching videos and users' demographic information should all be considered.
While in the ranking stage, the input features take in a much wider range of information, besides user history and content, it also cares about video features and other candidate sources. The purpose of this deep neural net is to score each video selected by the candidate generation process, and rank them based on the score.
Intuitively, we know that probablility is high for people to watch similar videos they have searched for. So, user history and content are able to provide the most relevant information so that the precision of the predictive model can be guaranteed. Precision indicates among all the predictions we predict as True, what is the proportion of actual True values. But what if we only predict very few Trues? That will definitely omit a lot of chances to get the exact match. Therefore, stage 2 neural network is used for improving recall. It’s okay for some recommendation to not be very relevant as long as you’re not missing the most relevant items.
It's time to check out the line on the top of the webpage. Does Youtube's recommendation system catch your heart?