The data is from Kaggle, containing information of 911 calls (Fire, Traffic, EMS) for Montgomery County, PA.
The data contains the following fields:
The purpose of this project is to summarize the raw data and figure out what factors are likely to impact the number of 911 calls……(further discussion needed)
The first part is descriptive analysis, it gives an idea of the distribution of the data and identifies associations among variables.
Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read and display the information of raw data
df = pd.read_csv('911.csv')
df.info()
df.head(10)
First, I want to know people in which region called 911 more.
The output below shows the top 5 zipcodes for 911 calls. Among them, Norristown,PA(19401) has the largest number of emergency calls which is 6979.
df['zip'].value_counts().head()
The output below shows the top 5 townships for 911 calls. Among them, LOWER MERION has the largest number of emergency calls which is 8443.
df['twp'].value_counts().head()
The graph shows that the most common reason for a 911 call is EMS, traffic is less common, fire is the least.
df['reason']= df['title'].apply(lambda s:s.split(':')[0])
df['reason'].value_counts()
sns.countplot(x='reason',data=df)
Next, I will do a couple of things on "timeStamp" to find out that whether people would make more 911 calls in a particular time of a year.
Time information Cleaning up
type('timeStamp')
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
df['Hour'] = df['timeStamp'].apply(lambda t : t.hour)
df['Month']= df['timeStamp'].apply(lambda t : t.month)
df['Week'] = df['timeStamp'].apply(lambda t : t.weekday())
df['Week'] = df['Week'].map({0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'})
We start from days of week. Let's see 911 calls in different days of week for different reasons. There were fewer 911 calls on weekends compared to weekdays. There were the largest number of calls on Friday for EMS and on Tuesday for traffic. For fire, the number basically remained constant.
sns.countplot(x='Week',hue='reason',data=df, palette='deep')
plt.legend(bbox_to_anchor=(0, -0.15, 1, 0), loc=2, ncol=2, borderaxespad=0)
Now get the monthly data, apperently, viewing from a distributed aspect, the information of September, October and November is omitted. So, we should turn to other methods.
sns.countplot(x='Month',hue='reason',data=df,palette='deep')
plt.legend(bbox_to_anchor=(0, -0.15, 1, 0), loc=2, ncol=2, borderaxespad=0)
This graph shows the monthly trend of 911 calls. In the beginning of the year(January), the number of 911 calls was the largest. It gradually decreased till March, and then increased again. In July, it reached the peak before drastically decreased in August. There were no information in September, October and November. Finally, in December, the number was the smallest in the whole year.
byMonth = df.groupby('Month').count()
byMonth['lat'].plot(lw='3', color='purple')
plt.title('911 Calls per month')
From this graph, you can see the exact data points, and the overall trend of the calls.
byMonth = byMonth.reset_index()
sns.lmplot(x='Month',y='title',data=byMonth)
plt.set_ylabel='numbers of calls'
Then we take a look at the numbers of 911 calls on each day.
df['Date']=df['timeStamp'].apply(lambda t: t.date())
byDate = df.groupby('Date').count()
byDate = byDate.reset_index()
byDate['title'].plot()
plt.tight_layout()
Generally speaking, the distribution of 911 calls is quite regular, except for a very large number on a day in Janurary 2016. Now create 3 separate plots with each plot representing a reason for the 911 call.
Traffic = df[df['reason']=='Traffic']
byDate1 = Traffic.groupby('Date').count()
byDate1['Date'] = byDate.index
EMS = df[df['reason']=='EMS']
byDate2 = EMS.groupby('Date').count()
byDate2['Date'] = byDate.index
Fire = df[df['reason']=='Fire']
byDate3 = Fire.groupby('Date').count()
byDate3['Date'] = byDate.index
For Traffic
byDate1.plot.line(x='Date',y='title')
plt.title('Traffic')
For EMS
byDate2.plot.line(x='Date',y='title')
plt.title('EMS')
For Fire
byDate3.plot.line(x='Date',y='title')
plt.title('Fire')
df[['Hour','Week']].unstack(level=0)
dayHour = df.groupby(by=['Week','Hour']).count()['reason'].unstack()
Finally, I want to find out in what time of a day will have a larger number of 911 calls. I create two heatmaps to show the level, brighter color means more calls.
This graph shows the number of 911 calls in hours based on different days of week.
plt.figure(figsize=(10,6))
sns.heatmap(dayHour,linecolor='white',linewidths=1,cmap='magma')
This graph shows the number of 911 calls in hours based on different months
dayMonth = df.groupby(by=['Month','Hour']).count()['reason'].unstack()
plt.figure(figsize=(10,6))
sns.heatmap(dayMonth, linecolor='white',linewidths=1,cmap='magma')
Obviously, there are much more calls happening during daytime rather than at midnight or early in the morning.