Building a Data Pipeline from Scratch

This is the story of my first project as a Data Scientist: fighting with databases, Excel files, APIs and cloud storage. If you ever had to build something like this you know exactly what I’m talking about.

‘Why are you using data and pipeline in the same sentence?’

For those who don’t know it, a data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources.

Extract, Transform, Load

It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database. This is known as a “job”, and pipelines are made of many jobs.

Why we need an automated pipeline, you say?

  • First, we will have most of the data we care about in one place and in the same format
  • Second, we don’t have to do it ourself every once in a while
  • Third, it is reproducible (more on reproducibility later)

Business needs != Research Questions

Big Data, Machine Learning, AI and Data Science are just buzzwords, right? Mmh. Probably not. Or better: yes. The gist is that as long businesses will invest on these “things” people will work on them. Full stop.

The fact is that these buzzwords are often misused especially in business context. We can have small Big Data (just use HBase for less than terabytes of data), we can have “Artificial Stupidity” (apply the same algorithm to all business problems) and we we can have Data Ignorance (bad communication is enough to let ignorance spread as a disease).

In my opinion the important thing is to not lose focus: tools are means to an end, and not viceversa. Hadoop is neither bad nor good per se, it is just a way to store and retrieve semi unstructured data. Saying the opposite it would be like saying that chainsaws are bad because someone uses them to kill people.

We have come to the point. Questions are the important thing, and the tools are the mean to reach the answer to these questions.

From http://xkcd.com/

Businesses are facing tough times not only because of an elusive lack of talent, but also because they need to adopt processes and practices typical of academia. This is disruptive for them. For decades business people used to mock academia: it’s slow, it’s bureaucratic, it isn’t cost effective, etc.

Now they’re mocking them less, or at least they should. Businesses are starting to understand that there isn’t only one question: “How do we make more money?”, but there are many connected to the latter which is an objective.

“What’s the cost connected to the retention of churning customers?” “Can we reduce it focusing on customers before they will churn?” “Which customers are going to churn?”. These are questions that can be answered with data, but many people are not used to state issues in this way.

So the first problem when building a data pipeline is that you need a translator. This translator is going to try to understand what are the real questions tied to business needs. This can be a slow and long process, especially if starting from scratch.

Infrastructure: or how I learned to stop worrying and love SQL

On the internet you’ll find countless resources about pipeline and warehouse infrastructure possibilities. You won’t find as many resources on the process to follow or on best practices. This is bad.

A bit dated, but always good

Here’s the second risk: many will focus on tools and technology, not on ends. This is bad as well.

Instead of thinking what would be the easiest, fastest and most efficient solution to get an answer to business questions, some people are going to focus on raw power and huge amounts of storage.

Fun fact: working with big data is relatively easy (once it’s all setup), the very difficult thing is working with small data. Or worse with non-existing data.

Usually you’ll have someone sitting at a table talking about scalability and about the limitless possibilities of some tool. It’s easy to be dragged in the discussion without thinking that in order to have Big Data one has to have, well, a lot of data.

For using Neural Networks you need a huge amount of cats…aehm, data…

It can happen that some information is lost in businesses because is not collected at all or not consistently. When you find out that some data you would like to analyze are not collected consistently you’ll have a small data problem (or worse a zero data problem).

Sure, you can envision a new data collection process, but for some time you’ll be stuck with no or few data. In this case tools won’t be able to help you. Either you’re able to perform robust (literally robust, not the methods) statistical analysis or you’ll have to wait.

The point is that the infrastructure must be simple and efficient: one can even think to integrate missing information through a new collection process directly within the pipeline.

Process — Project Every Tiny Detail

The process is the most important step. You will define what, where and how data are collected, transformed and loaded. Though we keep hearing everyday of AI and its endless possibilities there is still at least one thing they cannot do yet: decide the pipeline process.

This means that you’ll need to manually pick every field, table, data source, transformation, join, etc. The good news is that if you do it right you’ll have to do it just once. Afterwards everything will be automated.

Vincent Vega unable to reproduce some calculation in Excel

Why automation is so important? Well, of course because you won’t have to do the same thing over and over, saving a lot of time. But probably the most important reason is that to have automation you need to think, plan and write down somewhere in some sort of language the whole process.

This makes it reproducible, which means it can be reproduced by almost anyone and nearly everywhere (only if they have access to the data, of course). Security and the possibility to backup the process are important keys, but the major feature is that you can debug it.

In this case debugging is not only referred to the code, but to the whole process. What if you have a transformation of a categorical variable from 0–1 to Male-Female, but later on you find out that is not 0 = Male, but 0 = Female?

The proper reaction when you find out the coding of the variable is wrong

If you have a well built and structured pipeline, you can go to check where is the wrong transformation, change it and that’s it. If you don’t have a pipeline, either you go changing the coding in every analysis, transformation, merging, data whatever, or you pretend every analysis made before is to be considered void.

There are clear issues with both “no-pipeline-no-party” solutions. Businesses must understand that is much better losing a bit more time before, when building the pipeline, than risking to lose months and/or even money because of wrong decisions later on.

Data Driven Decision Making: a matter of culture

What comes first: the evidence based decision making or the data culture?

It’s an usual chicken and egg problem and it can be tricky to solve. The usual answer is that the the data culture has to come from somewhere. It’s not unusual to find people in many businesses who were never exposed to structured data and they never had to take decisions based on evidence.

For these people even talking about averages, medians, distributions and other simple descriptive statistics can be overwhelming. During the startup phase it’s important to not overload people with data: there’s a reason downloading a database to a file is called “dump”.

You have to carefully decide which metrics really matter for every business area and which one has the largest chance to click something in people’s heads. To find it out the only way is to talk to them, with both managers and employees, expect the unexpected, especially if you’re undertaking this process from zero.

It’s almost certain you’ll find duplication of data, as well as the lack of potentially vital information. During this phase you can also start to present what are your objective and let people start to enter in a new mindset.

This is important because the best solutions are those that make employees virtually independent accessing, visualizing and analyzing data. Independence is a goal, to reach it you’ll have to guide and help people. As for the pipeline, it will take more time at the beginning but it is going to pay off in the long run.

For many people this would be a legit working dashboard

At the beginning don’t use fancy modeling, simple insights and descriptive statistics will be more than enough to uncover many major patterns.

I find that explaining carefully to people the difference between mean and median and why they should almost always use the median is already enough as one of the first steps. For some this will already be pretty difficult to understand, since we have means ingrained very deeply in our reasoning.

To spur, a data culture must be raised slowly but firmly: the first point is to let people trust data. Accuracy is always better than precision, don’t be afraid to give ranges and confidence intervals and explain them carefully.

Finally! Some Analysis eventually

The whole pipeline process must be thought in function of the analysis you would like to perform and present. Visualization is also an important goal, and I can never stress enough how important it is to present data in a good way.

A good example of what you shouldn’t do

You can use the fanciest models, the latest convolutional neural network and get the best possible results, but if you’re unable to communicate them effectively you’ll have a tough time convincing people of their value.

Besides, most of the business can be run efficiently with very simple metrics and usually more advanced models are left for production or one shot analysis. Just few of them will ever enter in the pipeline and it is a good practice to put them as close as possible to the last step.

The reason is that you will change and tweak them many times, and the less they are ingrained in the pipeline, the better.

After analysis the process will restart, it is a loop. You’ll find something interesting and you’ll want to dig deeper. After discovering a new trend or correlation you’ll want to monitor it constantly, and a new process in the pipeline is born.

When you will reach this step you’ll congratulate yourself for all the hard work spent on building a pipeline!

If you enjoyed this you can let me know in the comments below, or by spreading this post. You can also follow the blog and/or subscribe to the newsletter. I’ll never try to sell you anything, you’ll just receive new posts as soon as they are online.