What is Data Preparation?
Data preparation is the process of collecting data from a number of (usually disparate) data sources, and then profiling, cleansing, enriching, and combining those into a derived data set for use in a downstream process. A typical use case could be to combine data from disparate sources for use in a BI tool or Excel. Data preparation is also often referred to as data prep or data wrangling. Data wrangling is seen as a subset of data prep as it is focused on shaping a single raw dataset into a consumable dataset for analytical use.
Self-service data preparation is when the data preparation capability is designed to be done by non-technical users such as data analysts or power users. By providing a visual, interactive interface with built-in algorithms to help profile or join datasets, self-service data prep tools empower end users to perform these tasks by themselves and not have to rely on scarce IT resources. IT teams also win as it allows them to focus on more strategic IT initiatives and not get bogged down in iterative business requests for data.
Self-Service Data Preparation Common Use Cases
A common misconception is that data prep is only relevant for analytical or BI use cases. Below are some of the more common use cases and patterns we see emerging.
Self-service data preparation for analytics at scale
Traditional analytical use cases were done mostly by analyzing historical data (e.g. retail transactions) and also through a series of predefined answers, for example “how much did we sell last month”? The modern analytical styles are more exploratory. For example, “what caused the spike in online sales on Wednesday last week? Was it raining?” Approaching this style of analytics using our old techniques, tools, and people has led to a scenario where 80% of the analytical effort is spent on finding, cleansing, and preparing the dataset for analytical purposes.
Self-service data preparation tools overcome this by allowing business users, with the right business context and data understanding, to perform this task by themselves. The resulting dataset usually gets published to be used in a BI tool or Excel. Self-service analytics start with self-service data prep.
Accelerating value from data lakes
Big data and Apache Hadoop have brought tremendous advances in both the processing ability of massive volumes of data as well as doing so at a cost efficiency not matched by traditional technologies.
While the adoption of big data technology grows very fast, the business success from the same has been limited according to industry analysts. A key challenge is the very technical nature of these environments and the lack of skills. Self-service data preparation tools can bring the same business user self-service to the data lake, providing the tool can access this data and can do so at the data volumes required. The resulting datasets often gets published to be used in a BI tool or Excel. Increasingly though, data scientists will use this process as well and publish the dataset for using a notebook or analytical workbench.
360-degree views and mastering customer, product, employee data
Beyond analytics, many organizations are beginning to look for opportunities to use the business acumen and context to perform more advanced data tasks like combining data from multiple sources to create a single view a customer, product, or employee. Master Data Management platforms are notoriously hard and expensive to implement.
Self-service data preparation tools can assist if they have intelligent facilities built in to help match data attributes from disparate datasets to combine them intelligently. For instance, if you have columns for FIRST NAME and LAST NAME in one dataset and another dataset has a column called CUSTOMER that seem to hold a FIRST and LAST NAME combined. Intelligent algorithms should be able to determine a way to match these and join the datasets to get a singular view of the customer. The resulting data could still be used for BI but often could go into another application or a marketplace where this new dataset is used by external parties (e.g. a product catalog that is published for suppliers).
When replacing an older application and implementing a new application, moving the data from the old to the new is often a massive undertaking. Especially if the effort is the result of M&A or replacing three older systems with one new app. Relying on IT staff to do the migration by themselves lead to a very time consuming process.
Self-service data preparation tools can be a major value in this regard and will leverage most of the capabilities previously covered. Intelligently matching common data such as customer or product is key. Profiling data to understand value distributions and intelligent ways to resolve problems or duplicates are critical. Typical migration projects can be accelerated with higher data quality.
Data marketplaces or data monetization
Many organizations want to find ways to monetize their data. In some cases, data could be the primary business (e.g. publishing product catalogs or performing financial analytics). Often this leads to a need for onboarding data from external parties into your catalog or where one needs or provides the facility for external parties to use data you provide and easily combine it with some of their own data to perform risk analysis.
Self-service data preparation tools can be a major enabler of these styles of operations. By providing the self-service data preparation tool access to the external party (customer, supplier) with proper security and governance, can add tremendous value to both parties.