I found a really neat data source online on unwanted robocalls that the FCC (Federal Communications Commission, a United States government agency) has created and published openly. The data source provides times and dates of unwanted robocalls that consumers have reported to the FCC. We can use this data source to find out all kinds of things, but today we will be content with just finding out the time of the day households are most likely to receive robocalls.
First, we need to fire up our trusty IPython notebook and download the data.
The data is freely available on the fcc.gov website and is encoded as a .csv
(comma separated values) file. Fortunately for us, we don’t even need to leave
IPython. We can execute bash commands right here!
%%bash
wget https://consumercomplaints.fcc.gov/hc/\
theme_assets/513073/200051444/Telemarketing_RoboCall_Weekly_Data.csv
Reading the Data
Using the convenient read_csv
method, we can automatically turn a .csv
file
into a pandas DataFrame.
from pandas import read_csv
s = read_csv('Telemarketing_RoboCall_Weekly_Data.csv')
Excellent! As we can see, the DataFrame contains a lot of useful information. There is the number of a caller, the type of call, the reason for reporting it, in which state of the US it happened and finally time and date.
Extracting Information
Now, in order to retrieve the time of day when the unwanted robocall was
received, we need to take a closer look at the Time of Issue
column.
s['Time of Issue']
The column is not well formatted, as ‘AM’ and ‘PM’ appear in different spellings. Sometimes irregular characters such as “?” and "" appear. Perhaps this is OCR gone awry. In order to extract useful information, we need to parse the time information first and ignore a lot of the spelling/encoding errors.
from datetime import datetime, time
def parse_time(raw_time):
if raw_time == '-':
return None
# This is why we can't have nice things
raw_time = raw_time.upper().replace(".", "").replace(",", "").replace(
":", "").replace(">", "").replace("?", "").replace(
" ", " ").replace("MM", "M").zfill(7)
if raw_time[:2] == "00":
raw_time = raw_time[1:]
# Need to use datetime, time has no "strptime"
dt = datetime.strptime(
raw_time,
"%I%M %p",
)
return dt.hour # All this sweat and labor for an hour! Oh heavens!
actual_times = s['Time of Issue'].apply(parse_time)
Wasn’t that fun? Dealing with messy data is a challenge, but can be mastered through continuous application of brute-force and being in denial about reality. A steady supply of coffee helps as well.
Now that we have the actual times, I would like to retrieve the most frequent hour. Let’s do this!
%matplotlib inline
actual_times.hist(bins=24)
Aha! Most robo calls get placed around 10-11 AM!