Avi's Virtual Enclave

View Original

Common Pitfalls in Data Science

When I first started learning data science, I focused a good amount of attention on learning the finer points of machine learning to answer all the theoretical questions that tend to come up in interviews. To practice with those concepts, I primarily tinkered with building models standard datasets from kaggle.com, such as the dataset of Titanic survivors. And since a lot of machine learning algorithms are already implemented in libraries like scikit-learn, it felt like “plug and chug” at first.

As I attended more talks and started my first paid roles, I realized that build machine learning models is about a lot more than just knowing the algorithms and which one to use. And when I only focused on the algorithmic aspect of it, there were a number of key factors that I could overlook.

Here, I wanted to discuss some of the common pitfalls I’ve observed so as to demonstrate awareness so as to avoid them going forward.

Skewed Distributions

It’s pretty common to look at what happens “on average” for a number of datasets. And for most people, that equates to the arithmetic mean.

I currently work in financial technology (fintech), so that often means handling data for monetary transactions; and if there’s one thing I learned, it’s that monetary data is astoundingly skewed by outliers. On any financial exchange, there are “whale traders” who spend 100x+ as much as other traders; and that amount push. In other words, this small proportion of traders handles such large amounts of capital that it affects the behavior of the entire population.

One alternative to using the mean is to use the median, which is robust to this type of imbalance. It’s the exact same reason that house values are reported by the median as well.

Another technique I’ve used is to consider the logarithm of the rather than the value itself, since that mathematically compresses the distance between larger values. For instance, the distance between $1000 and $1,000,000 is 1000x the distance between $1000 and $1, but when taking the logarithm of all the values, the two distances are the same.

Data Leakage

Data leakage occurs when models are trained on data that is not accessible at the point when a prediction is made using the model. This can be data about the target variable, or data about the holdout set that’s used to evaluate the quality of the model.

For instance, loan and job applications have several stages in their respective processes. And if predictions on applicant quality are made at Stage 2, then it’s not appropriate to use data from Stage 3.

The consequences are generally in terms of model performance. A leaky model that trains well will have poor results once it’s deployed to production.

This article documents more examples of data leakage. One option for minimizing data leakage is to timestamp data points, pinpoint the time of the model’s prediction, then remove data from after that timestamp.

Data leakage is an issue because we can’t truly predict the future if we’re using data from the future. In other disciplines, that would be called “cheating”.

Personal Bias

Often times, there’s pressure to deliver certain results. Nobody likes to deliver bad news.

Consequently, that can manifest as subconscious bias towards interpreting the data in ways that deliver good results. This can happen even if the data scientist in question has the best intentions and is aware of all the other pitfalls.

Theranos was an extreme example of this, where they regularly cherry-picked data in order to pass quality control tests.

As a data scientist, it’s important to support the long-term interests of the organization over the short-term interests of any single project. And to do that requires the ability to take a stand against project stakeholders.

Data scientists can be thought of as advisors that base their recommendations on numerical patterns. And the best advisors tell the truth, rather than what the interested party wants to hear.

Parting Thoughts

Because very few people will (or even can) check your work thoroughly, I’ve noticed a tendency for numbers reported by data scientists to be trusted almost blindly, even by other data scientists. As a colleague of mine put it, “who will guard the guards?”

There’s ample opportunity for data scientists to drive the narrative within an organization. But there’s also a lot of trust in that.

Ultimately, data science is about storytelling, and the most convincing stories are ones without plot holes.