Four Steps To Keep Data Scientists In Their Happy Place

Imagine someone has thrown the pieces of five different jigsaw puzzles into a single pile. Before you can start solving the puzzles, you must first separate all the pieces first. Then you discover key pieces are missing. You know they are somewhere in the house, but you must go hunting for them.

Data scientists working in traditional IT environments face this kind of challenge every day. Before they can even begin their analytical work, they must complete four basic tasks:

• Gather data: hunting for it across multiple systems.

• Validate data: ensuring they have the right data from the right data set, and that it is accurate, complete and up to date.

• Connect data: structuring and combining disparate data types in ways that answer business questions consistently and accurately.

• Select data: creatively combining data sets in the search for powerful predictive models (also known as feature engineering).

This process can be painstaking, repetitive and often highly manual: A survey of data scientists found that 76% thought of data preparation as the least enjoyable part of their job.

As a software engineer at Facebook, I faced exactly this challenge as we tried to stitch together predictive scores with customer profile information to optimize ad delivery. ETL-based data-wrangling solutions could help, but only to a certain extent. So I kept asking myself what capabilities would be required for data scientists to locate the latest, greatest data instantly and start modeling it right away. These are the steps that I believe are required as a result:

1. Gather data instantly.

Finding data sources across multiple systems, each with its own protocol, is painstaking. Some businesses consolidate data in data lakes or enterprise data warehouses. However, both approaches are of limited help to data scientists gathering the specific data they need. Data lakes lack any inherent organization, so finding data is very difficult. By contrast, enterprise data warehouses store data with such rigid order that data becomes hard to find without a deep understanding of the warehouse’s design.

An alternate path begins with a schemaless database approach, allowing raw, granular data to be ingested at speed without requiring complex upfront ETL work. However, the data obviously requires some kind of order if data types are going to be transparent (and usable) for data scientists. So how do you balance these opposing requirements?

I believe it is possible to create a structure of data that is both highly flexible and transparent. In the case of customer data, it requires storing data in the same way marketers think about that data. You can make customer IDs serve as the prime logical unit, to which you can associate a clearly defined array of data types that help define that customer. This is an approach that makes it easy for data scientists to find and precisely access the data sets and attributes they need.

2. Validate data instantly.

Businesses are constantly collecting new types of data and introducing new data sources. However, these changes are often invisible to a data scientist, making their data validation process a nightmare. They must be on guard for such changes, or risk having faulty findings.

However, you can sidestep this risk with a data store that can separate out business definitions from the raw data provided by source systems. For example, the business definition of a valid purchase typically requires two different pieces of source data: purchase information, of course, but also return information. However, this definition may change over time. If data scientists are not aware of these underlying definitional changes, they may end up comparing apples and oranges.

On the other hand, if the business definition of a valid purchase remains in a separate layer, it is easy to build in automatic warnings to let data science teams know that there has been a change in the business definition so that they can take corrective measures accordingly, and so that results remain consistent over time.

3. Connect data instantly.

Data scientists must be able to work with data generated by multiple systems using multiple models that cannot “talk” to each other. This means the data science team must structure disparate data sets in ways that logically connect them before any business analysis is possible.

Data scientists can reduce the volume and complexity of customer data into two essential dimensions: 1) attributes that define an individual customer, including profile information such as an email or a physical address, as well as demographic (gender, neighborhood, age, etc.); and 2) customer behaviors, such as purchases, store visits, web browsing history, etc., each of which can be associated with one of those individual customers.

With this logic in place in the data store, the data scientist can instantly connect data sets in a logical way, with no further data manipulations required.

4. Select data on the fly.

When building predictive models, data scientists can’t know exactly which data sets are relevant until they begin interacting with them. They do their best work when they can quickly interact with and creatively combine a wide variety of data sets until they find the best possible combination — a process known as feature engineering.

The capabilities we have discussed above solve this problem, too. The extreme flexibility of the model makes it easy to get all the relevant data into one place. And the business-specific order makes it possible to select any number of variables and iterate potential combinations very quickly.

When it comes to business data, you always want to cut out the intermediary and put the right data into the right hands at the right time. When you can do this for data scientists, you don’t just remove the cost and complexity of data prep; you also can unleash the creativity of data scientists, making them both happier and more productive.

Originally published by the Forbes Technology Council
Return to Blog

Are you ready to unlock your customer data?