EDA¶

min(next.trending_date)

Timestamp('2017-11-14 00:00:00')

max(next.trending_date)

Timestamp('2018-06-14 00:00:00')

Category¶

First, Let's find out the Top 10 Categories measured in different scale

def plotbar(metric):
  df = next.groupby('name').sum().reset_index().sort_values(by=metric,ascending=False)
  plt.figure(figsize=(16,4))
  sns.barplot(x='name',y=metric,data=df.head(10),palette='Set2')
  plt.title("Top 10 Categories with the most "+metric, fontsize=20)

plotbar('views')

We can see that there are 4 billion views for all the music videos in 8 months, which I am not suprised at all, because singers always release their music video on Youtube in the first place. Ariana Grande's single music video 'Side to Side' has a humungous number of views for more than 1.7 billion.

plotbar('likes')

Education is not in the Top 10 categories in most views, but it get relatively large amount of likes. It seems that people appreciate what they do not know about.

plotbar('dislikes')

plotbar('comment_count')

People & Blogs got more likes, dislikes and comments compared to views. Internet influencers are more likely to create topics. They have fanatical followers and also haters leading to more discussions under the videos.

Channel¶

channel = next.groupby('channel_title').sum().sort_values(by='views',ascending =
                                                              False).head(10).drop(['category_id','id'],axis=1)

The Top 10 popular channels are as below, Let's take a closer look of each of them

channel

Almost all of them are music and entertainment channels! We can easily recognized that there are some singers' and entertainment companies' official channels. They all have tremendous number of followers

Note: ibighit is a k-pop music channel, Dude Perfect is a sport channel and FoxStarHindi is a Bollywood movie channel.

Next, I would like to check out what I am interested about. I'm not quite sure what exactly do the category names stand for. So I'm gonna try "Travel & Events" which seems like what I would usually watch.

next[next.name=='Travel & Events'].groupby('channel_title').sum().sort_values(by='views',ascending =
                                                              False).head(10).drop(['category_id','id'],axis=1)

Ta-da! I found what I expected to see! I am a loyal fan of Strictly Dumpling and Food Insider, but no idea why turkish airline ranks the highest.

I highly suspect that when I watch one video from a particular category, YouTube would recommend the best blogger to attract me so that I can hardly get rid of it!!!

Time¶

next['month']=next['trending_date'].apply(lambda m: m.month)

next['dayofweek']=next['trending_date'].apply(lambda m: m.dayofweek)

month = next.groupby('month').sum()
dayofweek = next.groupby('dayofweek').sum()

plt.figure(figsize=(10,4))
plt.title('Views in every month',fontsize=20)
sns.barplot(x=month.index,y=month.views,data=month,palette = 'Set1')

<matplotlib.axes._subplots.AxesSubplot at 0x7f254e5edf98>

plt.figure(figsize=(10,4))
plt.title('Views in days of week',fontsize=20)
sns.barplot(x=dayofweek.index,y=dayofweek.views,data=month,palette = 'Set1')

<matplotlib.axes._subplots.AxesSubplot at 0x7f254e5edac8>

May got a lot more views compared to other months, and Monday got the most views during a week. However, I'm not sure about the reason...

Content¶

Now I'm going to use some NLP technique to deal with titles and tags

import os
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

def word_length(data,t):
  return data[t].apply(lambda s: len(str(s).split()))

def text_process(t):
    """
    Takes in a string of text, then performs the following:
    1. Remove all special characters and numbers
    2. Remove all punctuation
    3. Remove all stopwords
    4. Returns a list of the cleaned text
    """

    # Remove all the special characters and numbers
    t = [re.sub(r"[^a-zA-Z]+", ' ', str(k)) for k in t.apply(lambda w: w.split())]

    # Check characters to see if they are in punctuation
    nopunc = [char for char in t if char not in string.punctuation]
    
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Remove any stopwords
    return  [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

next['title_length'] = word_length(next,'title')
next['tag_length'] = word_length(next,'tags')

Will the length of titles and tags affect views?

def plot_length(x):
  plt.figure(figsize=(10,4))
  plt.title('Relationship between '+x+' and views',fontsize=20)
  sns.lineplot(x=x,y='views',data=next)

plot_length('title_length')

plot_length('tag_length')

As you can see, nearly all the titles are within 20 words, and too short or too long a title would both affect views of videos. The optimal number of words seem less than 10.
The length of tags has a larger variance, tags may contain more information Youtubers would like to convey. Also, more tags can relate to larger scale of topics and increase the probability of being noticed.

Next, I will dive into the content of titles and tags to see what specific things caught people's eyes.

title = pd.DataFrame(text_process(next['title'].apply(lambda s: s.lower())))

title.columns=['words']

wordlist1 = pd.value_counts(title.words,ascending=False).head(50)

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import random
def grey_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
def getwordcloud(wl):
  word = ' '
  for i in wl.index:
    word = word + i + ' '
  wordcloud = WordCloud(width = 1600, height = 800, 
                background_color ='black', 
                min_font_size = 10).generate(word) 
  plt.figure(figsize = (10, 5), facecolor = None) 
  plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),
           interpolation="bilinear") 
  plt.axis("off") 
  plt.tight_layout(pad = 0) 
  plt.show()

getwordcloud(wordlist1)

Hooray! I extracted the Top 50 most frequent words in Youtube video titles and put them into the word cloud. No doubt that the most popular categories are music and entertainment, there are plenty of words related to music videos and movies like "offial", "trailer", "teaser", "lyric" and of course "music" and "movie".

Besides these, we can also find some topics that attract people's attention intuitively like "makeup", "challenge", "best", "test", which are closely related to out lives.

Now do the same things to tags

tags = pd.DataFrame(text_process(next['tags'].apply(lambda s: s.lower())))
tags.columns=['words']
wordlist2 = pd.value_counts(tags.words,ascending=False).head(50)

Here comes the tags. Not much difference but we can see a clearer picture of what people like with more concrete descriptions.

getwordcloud(wordlist2)

Recommendation System¶

Youtube is renowned for its amazing recommendation system. They basically use 2-stage deep neural network, one network generates recommendations, and a second network ranks these generated recommendations.

In the candidate generation process, the input features only include user history and content, I guess features like searching history, watching history, subscribe and save, time watching videos and users' demographic information should all be considered.

While in the ranking stage, the input features take in a much wider range of information, besides user history and content, it also cares about video features and other candidate sources. The purpose of this deep neural net is to score each video selected by the candidate generation process, and rank them based on the score.

Intuitively, we know that probablility is high for people to watch similar videos they have searched for. So, user history and content are able to provide the most relevant information so that the precision of the predictive model can be guaranteed. Precision indicates among all the predictions we predict as True, what is the proportion of actual True values. But what if we only predict very few Trues? That will definitely omit a lot of chances to get the exact match. Therefore, stage 2 neural network is used for improving recall. It’s okay for some recommendation to not be very relevant as long as you’re not missing the most relevant items.

It's time to check out the line on the top of the webpage. Does Youtube's recommendation system catch your heart?

alt text

	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
video_id
2kyS6SvSYSE	17.14.11	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	2017-11-13T17:13:01.000Z	SHANtell martin	748374	57527	2966	15954	https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg	False	False	False	SHANTELL'S CHANNEL - https://www.youtube.com/s...
1ZAPwfrtAFY	17.14.11	The Trump Presidency: Last Week Tonight with J...	LastWeekTonight	24	2017-11-13T07:30:00.000Z	last week tonight trump presidency\|"last week ...	2418783	97185	6146	12703	https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg	False	False	False	One year after the presidential election, John...
5qpjK5DgCt4	17.14.11	Racist Superman \| Rudy Mancuso, King Bach & Le...	Rudy Mancuso	23	2017-11-12T19:05:24.000Z	racist superman\|"rudy"\|"mancuso"\|"king"\|"bach"...	3191434	146033	5339	8181	https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg	False	False	False	WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...

	views	likes	dislikes	comment_count
channel_title
ChildishGambinoVEVO	3758488765	96700818	6054434	10151289
ibighit	2235906679	199247121	3467306	31817464
Dude Perfect	1870085178	60275557	1501477	4009163
Marvel Entertainment	1808998971	55873344	1031250	6453560
ArianaGrandeVevo	1576959172	52170970	1931230	4295333
MalumaVEVO	1551515831	23278380	1757948	1227634
jypentertainment	1486972132	44900910	2482131	7575510
Sony Pictures Entertainment	1432374398	30106808	1414686	3533551
FoxStarHindi	1238609854	23762509	910745	1782776
BeckyGVEVO	1182971286	19185287	1616616	1176862

	views	likes	dislikes	comment_count
channel_title
Turkish Airlines	86666242	22641	1826	2433
Strictly Dumpling	69479392	1403067	49109	199585
The Food Ranger	33575837	733155	64278	161080
FOOD INSIDER	23971240	193669	27579	58686
Condé Nast Traveler	18099467	533200	25677	153842
INSIDER	16297287	306600	10681	28188
Sam Chui	11779203	102016	10193	9967
Horizon Productions SFL	11218319	68631	4059	18463
Mark Wiens	10950511	294676	13484	46535
Lyft	10113686	9969	417	944

Import Packages¶

Load and Transfer Data¶

EDA¶

Category¶

Channel¶

Time¶

Content¶

Recommendation System¶

Trending YouTube Videos (USA)¶

Import Packages¶

Load and Transfer Data¶

EDA¶

Category¶

Channel¶

Time¶

Content¶

Recommendation System¶