I believe that being a data scientist is about communication first, and data science second.
1. Introduction
A data scientist needs two abilities:
- The ability to gain insight and knowledge using data
- The ability to successfully communicate what I learn to stakeholders
I have always wanted to share some of my ideas for improving those abilities and have finally learned enough to know what to write for the first time. I have written down some of the most important lessons that I have learned on my path to becoming a better data science communicator.
Since Jupyter, IPython and Pandas are my main tools, the article is written from the perspective of a user of these tools. Three topics are covered in this article.
- Communication with Purpose
- Notebook structure
- Clean Processing
We discuss how to communicate with purpose. We see how to justify why a data science document, that is, a notebook, exists, and what question it answers. A notebook would not be complete without answering that question and providing tangible results.
Notebook structure plays a big role when discussing notebooks with your client. I share my experience with you and we learn how to structure notebooks effectively.
Clean processing can mean the difference between a good notebook, and a great notebook. It’s not only about the right method call to parse a CSV column, it’s also about documenting how you found the data, and what you did with it. Again, we see what to do and what to avoid.
2. Communicate with purpose
2.1. What question are you answering?
Writing happens for a reason. This includes writing IPython notebooks, and data science writing in general. Think about the last time you wrote a notebook. Why did you write it?
Let me name a few reasons why I create Data Science notebooks:
- A client approaches me with their data and a question. I answer their question using their data and guide them on a business decision making process.
- I develop a machine learning model and rate its performance. I want to explain how the model works and how it’s rated, and how confident I am in the result.
- I analyze data quality and want to summarize my findings. This lets me better estimate the quality of any data analysis based on that data.
- I want to understand and teach a new data science concept. It could be an addition to the Pandas API, it could be a new idea on Data Engineering, or it could be an interesting machine learning model.
I am sure there are many more reasons for using notebooks to share data science ideas and results. When you know the purpose of a document, you can decide what to include and what not to include.
For example, you can add an infinite amount of sub-segments when doing a segment analysis on user data. At the same time, looking at more data than necessary does not really help answer the original question of a document. Realizing that I am not working on solving the goal prompts me to get back to answering the core question in a document.
Perhaps I am analyzing how conversion rate of new and returning users are different, but get sidetracked by some interesting subsegment of returning users. That may be valuable by itself, but it also deserves its own independent treatment. In that instance, I like mentioning interesting findings to the people I work with and I suggest that these findings are worth being analysed further in a new document. This is always received with enthusiasm. Working closely with data gives you an edge at identifying further interesting topics to research.
Furthermore, I believe that focusing on one main purpose maximizes the impact of the notebook’s core result. Just providing one main chart that demonstrates an important finding has more value than a notebook containing interesting but weaker findings. The presence of many results dilutes the impact of one result.
Consider the opposite: a document could lack interesting findings and never get close to answering any question. In that case, returning to answering the core question and dedicating all effort to effectively sharing results should have the highest priority.
To sum it up, a reader should always know what was asked and what was answered in a notebook.
2.2. How well could you answer the question?
A notebook is an answer to a question. The way a notebook answers this question comes in the form of a result. This result may be a chart, a table, or even a simple yes or no. What all those results have in common is something intangible: quality. Not only does the result itself matter, but also the quality of that result. I would like to show a few examples of quality in notebooks, and where quality can still improve.
Delivering a histogram demonstrating an insight about user behavior has a strong quality. Furthermore, if the data is cleaned thoroughly, the “split, apply, combine” process is reproducible, the charts are well-formatted, and the explanation and summary are easy to understand, then the document is valuable for the business.
Delivering a weak yes to the question “Do our users exhibit behavior A” lacks quality. Maybe the data source contained format errors, maybe the event data could only partially be joined with the user data, and maybe the explanation and summary do not describe the process. All these factors together make such document not useful.
And yes, problems with data source quality can not always be avoided. Maybe the client changed the data storage format in the middle of the year, maybe a metric changed and it wasn’t properly documented.
Something can be done about this: Document data quality issues with the data at hand. Make the client aware that there are problems. But most importantly, present your results in such a way that quality issues and potential caveats are transparent.
We conclude that strong results are a product of
- Clean data processing,
- Reproducible and transparent computation, and
- Clear result communication.
That answers one half of how well could you answer the question?, the question of result quality.
2.3. Can it be answered?
Let us move on to the other half of answering how well could you answer the question?, the question of feasibility. Before we start working on answering a question, we need to ask ourselves whether
- the question can be solved using the tools and skills that are available, and
- whether the question is answerable at all.
It’s not always possible to know whether a question can be answered when you start working on it. At the same time, you should always question yourself how confident you are given what you have learned so far about the problem.
Here are some questions that a client may ask. Whether they can be answered needs to be critically examined.
Question: We have been collecting consumer buy data for 2 weeks now, and we would like to examine whether our users like chocolate or hard candy.
Answer: this is hard to answer since 2 weeks of data does not account for seasonality and sales campaigns.
Question: What do our users feel when they put an item in their shopping basket?
Answer: we do not yet have the technology to infer feelings from event logs.
Furthermore, there are questions that are unlikely to be answered successfully. This may not be clear from the outset.
Question: Can you predict sales date for Quarter 1/2018, given sales data for the past 10 Quarter 1 numbers?
Answer: it depends on whether an appropriate and robust model can be found. It also depends on whether the future can be predicted.
Question: We have had a data loss in August and September, can you still perform an A/B test analysis using partial data for October?
Answer: it depends on whether the amount of data is enough to get to a statistically significant result.
What’s our takeaway here?
- If a result could not be found: Be upfront about it
- If a result was found: Be transparent about result quality
- If you achieve a great result: Celebrate
Giving the client feedback when you have good results encourages them to improve their data quality even further. It also lets them know what kind of questions you can answer. This creates a positive feedback cycle between you and your client and ensures a long and fruitful collaboration.
3. Structure your notebook
3.1. Separate setup and analysis
Almost every notebook that I write can be divided in two big parts: the setup and the analysis.
The setup is concerned with
- reading in data,
- selecting rows by date ranges,
- joining user records,
- performing QA sampling of the data,
- counting the total amount of data that was read,
- setting up distributed workers,
- performing
groupby
s, - and so on.
To summarize: it’s everything that a good data scientist requires to become productive with data. What it should not contain is the actual analysis.
In my experience, someone who is just interested in results doesn’t immediately spot the difference between an IPython cell that produces an insightful result, and a line that just imports Matplotlib. This has nothing to do with a lack of tech skills. It has everything to do with the fact that a reader is interested in seeing the results first, and reading through code second. This is why I strongly believe that separating those two concerns in a document is important.
Where the setup step uses code to retrieve and prepare data, the analysis step uses data and code to generate insight. Here the density of code should be as low as possible. If you can restrict yourself to 1 line of code per insight that you generate, you produce a lean and easy to read notebook.
As a guideline, I recommend refactoring commonly used data transformations into setup subsections. For example, if you catch yourself writing
Analysis 1
df.query('purchase_value > 100').grouby('user_id').hist()
and soon thereafter
Analysis 2
df.query('purchase_value > 50').groupby('user_id').hist()
and so on, you can refactor this as follows:
Setup
def purchase_segment_hist(df, value):
df.query('purchase_value > @value').groupby('user_id').hist()
Analysis 1
purchase_segment_hist(df, 100)
Analysis 2
purchase_segment_hist(df, 50)
What you’ll gain is further clarity in the analysis phase for the small price of four lines of code in the setup phase.
I use the following rule of thumb: if you can, put all your code into the setup phase and all your insight into the analysis phase.
While no analysis can be performed without code, minimizing the amount of it increases readability of results. Having all critical code in the setup phase eases maintenance and understandability of data processing code.
3.2. Be generous with sections
Imagine this: you’re on the phone with a client, it’s 15 minutes until their next board meeting. They are in a hurry and are sifting through the notebook that you have provided for them. Just yesterday they have seen an important number and now they really need it.
On the phone, you start hearing your client flipping through a paper document and realize that they are looking at a printed version of your notebook. It would be useless to tell them to scroll down to the middle. What do you do? If you add sections and a table of content, nothing is easier than that. You can simply tell them that the important number is in section 2.3.4.
This happened to me countless times and I can quickly answer these questions when I structure my notebooks well.
To make everyone’s lives easier, it’s important to add the following structural elements to your data science notebook
- Table of Contents (TOC)
- Section Header for every major part of the notebook
- Subsection, or Sub-subsection for every result and other important insight or step
This has the following advantages:
- It’s easy to walk someone through a notebook when everyone can navigate it with ease
- It makes sharing a document and referring to parts of it easy
- Navigating a notebook on different media becomes easy (paper vs. HTML vs. Jupyter IPython)
- It lets you and your client compare notebook versions (“Where did histogram X move?” “It’s in section 3.2 now”)
Clear and easy to follow communication makes everyone happy.
4. Clean processing
One important step in creating transparent and easy to understand notebooks is taking extra care when loading and processing data. Throughout my work I have identified two important qualities that can be found in a clean notebook.
- Reproducibility: data is read in as cleanly as possible data and results are calculated deterministically.
- Transparency: Parse errors, imprecisions and compounding errors are transparently communicated
If our whole processing pipeline creates reproducible results, we have a system that we can trust and count on even when we are in a hurry. We can easily create further notebooks that use the same data, and we can reliably compare results from different notebooks using the same data foundation. It also lets us rerun notebooks, for example after adding a new chart, and trust that the results is going to look the same.
While reproducibility is concerned with how results are computed, transparency creates trust between a data scientist and a notebook’s reader. With Transparency, both parties can understand how results come to be and how to judge them. In a transparent document we explain what could be processed and what could not be processed.
We see how these three qualities can be implemented in the following sections.
4.1. What data was used? What period was observed?
Perhaps the most important factor in cleanly and transparently processing your data is to exactly document:
- The start and end date of the data you analyse.
- How you filter data by date ranges.
- File paths, S3 paths, and similar, that you import data from.
- The amount of data rows you read.
With this information you can compare notebook results to other notebooks that have been created with the same data.
I note at this point that one subtle way results might change is if dates are filtered inconsistently. The following line
df[start_date <= df]
is different from
df[start_date < df]
The first one looks at rows starting with and including the start_date
, the
second only considers rows with an index larger than the start_date
. While
both can be used, it’s important to communicate this with clarity. This lets
everyone independently verify these results.
Depending on confidentiality requirements, it’s not always possible to include file paths in notebooks. Documenting the exact source of data is useful. It lets the reader of a notebook verify if it uses outdated data. Perhaps the company has loaded new data into the data warehouse and did not yet have a chance to inform you. If your results deviate from what is expected, it becomes an easy fix.
Finally, a really simple marker, a sort of data fingerprint, is to write down the amount of rows that have been read into a Pandas DataFrame. Nothing is easier than writing a quick
len(df)
in your notebook. Should the number change from run to run, you know that something is up. This resembles canaries used in mining and lets you immediately know that there is trouble ahead. This also plays a role in merging and joining data, as you can see in the following section. Ideally, try printing row numbers after every important processing step.
4.2. How was the data cleaned?
Data cleaning needs to be reproducible. It’s crucial to be as precise and intentional as possible when selecting and processing data. A common process for data analysis is to pull CSV, JSON, or similar data from a Data Warehouse, pre-process it to a binary format like parquet, and then use it in the actual data analysis notebook. This is done for the reason that this intermediate result can then be reused in many other notebooks.
These are the critical points in this process:
Preprocessing could subtly change data that was analysed. If further data is added for a particular month without your knowledge, then rerunning the preprocessing step could potentially add data that changes the result in your data analysis notebooks. That’s why it’s important to version your data and document which version you use.
Parsing columns can make a big difference. Always be aware of the data format each column is supposed to have. Let your client know of any issues that you encounter while parsing data. Document how you parse columns and be explicit about number formats. Never leave parsing integers or floating point numbers to chance, not even once.
Dropping rows with n/a values or similar needs special attention and care.
Sometimes you might accidentally drop too many rows because there was a parse
error. Instead of dropping those rows, stop to investigate why the rows are
incorrectly parsed. Other times it’s perfectly acceptable to drop rows, for
example when only rows with a non-empty URL
should be considered. In either
case, document row counts before and after.
4.3. How was data joined?
Many times you are reading user data from an event log and join it with user rows from a database. We want to make sure that this process is reproducible.
It’s vital to record the following things:
- Which columns were used to join?
- How many rows on each side could be joined?
- How many rows on each side could not be joined?
Sometimes there is an issue with user data being incomplete. If the full user data set becomes available, the amount of user data that can be joined with event rows suddenly rises. This can have an effect on results, and that’s why properly documenting the amount of joined rows is important.
4.4. What is the quality of data?
Not only do we need to get a feel for the data important to understand and analyze it, it’s also a perfect opportunity to review the quality of the underlying input data. We want to be as transparent as possible in our evaluation.
Even assuming that all data could be parsed without any errors, we might still encounter the following problems:
- A column containing category data might only contain one category because of a data source problem.
- A numerical column might contain test data. Sometimes a developer
accidentally leaves in a
0xBEEF
. - A
url
column may contain User Agent strings. - A string column might contain test values like
qux
,foo
, andbar
.
None of these problems are obvious in the beginning. These problems may sneak in as subtle errors that can cause big cumulative errors in the end. I recommend you take the following precautions:
- Look at category value distributions using a simple Pandas
.values_count()
. If you see only one value, you know that something is wrong. - For numerical data, create a histogram and understand their distribution. If you know how the data is supposed to be distributed, you can immediately spot errors.
- Try to parse string columns that are supposed to follow some syntax. In the
case of URLs,
urllib.parse
in Python 3 can quickly help out there. This is helpful even if you only work on a small uniformly sampled subset of the data you are working with. - Test values in string columns are difficult to handle. Manually sampling 100 to 200 rows is often necessary in this case. Again, it’s important to take a uniform sample.
I am confident that there are many more useful techniques for quality evaluation. It’s always good to keep a lookout for problems with your data. Never take results for granted and be ready to justify every single step. This makes your data processing pipeline as robust as possible.
5. Conclusion
How do you write notebooks and what strategies do you use? What do you use to communicate your intent and your results in your notebooks? I would be excited to learn more.