Addressing machine learning's dirty little secret

The following is a guest article from Eric Johnson, CIO of Talend.

If an organization is using machine learning and hoping to be profitable, fair warning: A garbage in, garbage out approach could plague them if they're not taking care of the data fed into algorithms.

Machine learning + bad data = Public enemy No. 1.

Investment in machine learning and artificial intelligence is growing, as 60% of organizations are expected to increase their investments in AI by 50%, according to a 2018 study by Constellation Research.

Despite the popularity, projects are stalled or deemed incomplete due to challenges that organizations tend to ignore — including data quality. If an organization is working with bad data, then their machine learning is going to also make bad decisions.

What may seem like an obvious issue to address is also a huge obstacle that impacts the accuracy needed for machine learning and automation.

Ensuring data quality isn't just an IT-driven initiative, but rather something that all lines of business need to have a vested interest in. There should be a partnership between IT and business users to ensure all data is trustworthy, accurate and safe for use in algorithms.

Getting data clean, trustworthy and healthy is a never-ending process — once it starts, it should never stop, but rather evolve.

False starts

The future of machine learning is critically dependent on data quality as it becomes more pervasive in decision making.

Despite this, many organizations still struggle with measuring the success of machine learning algorithms and confirming the data they feed them is good. Without a process to ensure clean and quality data you can trust, organizations have no power or backing for reports and insights behind the best data-driven decisions.

Organizations are beginning to leverage machine learning more frequently to support digital transformation. But, many times, they fail to realize if the data they're using to feed the machine learning is dirty or untrusted, they're actually more likely to slow or even stop digital transformation initiatives.

Many data ecosystems today are composed of multiple components and silos. Meanwhile, externally available data is exploding across global networks. This means organizations need to design flexibly and modularly, maintaining a view of the wider ecosystem.

Without first creating an overview of data silos and external data resources, highlighting inefficiencies, there will be an organization-wide machine learning problem right out of the gate.

Data quality programs are hard to drive cross-functionally, as data assets are often used cross-functionally. This makes it difficult to come to a consensus on data ownership.

Additionally, organizations attempt to address the data quality challenge as a technology challenge, failing to put an equal focus on the process and the people.

Data quality is not a discrete project with a start and an end date, but rather, it is an ongoing program that needs to be prioritized and staffed to drive long-term success.

Proper steps

Data quality is just one piece of the larger machine learning puzzle. Organizations also need to define the data they have, determine and assign ownership, and sort out data governance issues.

To make machine learning work properly, the strategy needs to be comprehensive. It's important to grasp where you currently are and begin to capture opportunities for improvement and formulate a plan to mature from there.

For example, data modeling is a great starting point. This highlights the most critical high-level concept underlying data from a stakeholders' standpoint and precisely documents the contents, data types and rules governing individual columns in a database.

It is an integral part of the planning stage for machine learning projects.

From there, work to solve the data governance issues. A modern approach should be able to cope with the velocity and variety of data internally and externally.

Organizations need to set up a framework to provision trusted data, otherwise they will see none of the benefits.

This framework should organize people, processes and technologies and create a paradigm of how data is managed, ultimately bringing clarity, transparency and accessibility to all data assets.

To get on track with data quality, organizations and all lines of business need to provide context on why data is important — for example, how data impacts key business operations — and what the current data quality looks like.

This helps key stakeholders understand why data quality is important and the scope of the challenges the organization is facing related to it. Use this knowledge to build a team of cross functional stakeholders who are committed to the data quality program and can drive it within their respective functions.

This is not an IT program. This is an enterprise program championed across the business.

The last thing organizations should focus on when implementing machine learning is the technology. An agreed to data program is the first priority and will help everyone to understand what they will ultimately feed into the machine learning solution.

When done right, machine learning has incredible power. Implementing it can help a business identify previously unknown correlations in data that can provide a competitive advantage — for example, the propensity to purchase more or chance of churn before it happens.

Getting predictive insights before an event happens allows organizations to drive the right steps that ensure overall success of every program.