3 Data Problems You Might Not Even Know You Have (and How to Fix Them)
Photo by Carlos Muza on Unsplash

3 Data Problems You Might Not Even Know You Have (and How to Fix Them)

According to the Economist and industry leaders such as Andrew Ng and Google’s CEO Sundar Pichai, data has become the new oil and machine learning the new electricity.

This has encouraged organizations to launch their own data science initiatives in order to make sense of the treasure trove of data they’ve built over the years and apply it in profitable ways. However, in many cases, there is so much data that organisations struggle to assess whether they are holding oil or mud.

This short article will outline how to solve three core data problems and turn your organization into a value-generating data refinery.

Many data problems stem from organizations using inputs that are either ill-structured for machine learning or simply don’t contain any predictive power.

For example, if you are not consistently collecting the same customer data (e.g. occupation) machine learning models cannot learn consistent relationships between inputs and the target you are predicting (e.g. sales).

It might also be that you are collecting data that isn’t relevant and therefore does not need to be used for the modeling. It might not be immediately apparent which data sources are irrelevant, but exploring the feature relevance of your models can help you identify and eliminate those sources.

Although machine learning models are good at handling many dimensions, increasing the relevance to noise ratio of your features will significantly improve the efficiency and performance of your models, allowing you to extract more predictive power and ultimately, better accuracy.

Dashboards are extremely useful for getting a feel of your data problems and deriving model insights, but can easily cause the opposite to happen if they are not designed and used properly. More is not better, which is why it is vital to be selective on what you choose to visualise and how refined those outputs will be.

For example, viewing summary statistics of your data (e.g. histograms) is more useful than seeing a screen of tabs. However, many summary statistics offer the same insight, which is why you only need to visualise the one you find most useful.

It’s important to remember that dashboards should be reactive. They visualize what has been collected and what the previous state of that data was. Yes, dashboards can unlock insights by showing you correlations and patterns, but this sort of analysis can also deter you from unlocking the greater potential of collecting data.

When you have visualizations in front of you, it is unlikely you will be able to see beyond them to make greater connections. This is where the power of machine learning comes in, helping you detect what further actionable insights your goldmine of data holds.

This can be achieved by visualizing machine learning models through performance statistics and interpretability metrics. There are also many out there, which is why you should stick to the ones which offer distinct but complementary insights. For example, Mind Foundry provides a weighting of the model’s feature relevance as well as partial dependence plots of these features, which highlight how their values impact the forecasts.

Collecting irrelevant inputs is a curse for machine learning models, but so is failing to leverage the predictive power of relevant data!

The best way to find out if your data has any predictive power is to ask questions and train a machine learning model to provide you with answers. It might be that only one of your questions can be answered, which is why you shouldn’t discard the data after your first attempt.

For example, if you are trying to predict multiple customer subscription outcomes (e.g. cancelation, non renewal, renegotiation, renewal), you might not have enough predictive power in your customer data to accurately forecast each outcome. However, by breaking down the questions and grouping the customers (those who cancelled or didn’t renew in a “churn” class, versus remaining customers in a “didn’t churn” class) you will most likely improve the performance of your model because the data problem has been simplified.

The same approach can be achieved in life sciences, where you might want to predict multiple patient outcomes. A simplification of questions could for example predict whether the clinical trial was positive or negative for the patient.

Using machine learning tools to augment this process is a cost-effective way to ask questions and refine them in real time, efficiently validating the predictive power of your data and tackling the biggest data problems head-on.

Mind Foundry is an Oxford U. Company. Operating at the intersection of innovation, research, and usability we empower teams with AI built for the real world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store