By: Piet Loubser
For many organizations, the thirst for better and faster data to support business decisions cannot be quenched with traditional IT centric data integration tools, Excel, or BI tools which all fall short and lack critical capabilities. A self-service data preparation application that empowers business analysts and data scientists with a visual, interactive experience to find, prep, and publish data by themselves is a must have. However, many data prep tools, while easy enough to use, lack the sophistication to work with large volumes of data. Product-enforced small data samples only goes so far as a proxy for data irregularities, and then, it breaks down into a headache of multi-iteration workflows. Having user choice and control to define your interactive data window size based on your use case, resource constraints, or time sensitivity can provide a powerful way to right-size your data prep solution and cost of ownership. Adaptive Workload Management for data prep is a way to bring the right processing and interactive data window configurations to your organization.
Product Defined Sample Sizes or User Choice?
Most of the self-service data prep tools in the market rely on sampling the data set; the samples are fixed and hard-wired into the product. The challenge with this approach is that the user is forced to rely on an iterative effort of work on a small sample, only fix what you see, ask IT to code and run your script on the full dataset, re-sample, fix what you see now, run on full dataset, re-sample…you get the process.
A far better approach is to let the user or administrators define a reasonable interactive data windows size that matches your use case. If you are working on something where you need to get an output in the next day, then you do not want to work on a small sample that requires three iterations to fix. You want it all in one go. And, if you work on a lower priority use case or where you really know the data well, then a subset can be a good approach and will not require all the supporting infrastructure.
Paxata Introducing Adaptive Workload Management
In our latest release, Paxata Fall 2018, we are releasing a new capability called Adaptive Workload Management (AWM). AWM allows you to define varying interactive window sizes. By allowing user control over defining the interactive data windows, you can optimize for specific user groups, use cases, or infrastructure configurations.
The additional benefit of Adaptive Workload Management is the ability to automate your Paxata projects to run immediately (or scheduled) on the full dataset — without the need to request assistance from IT or admin staff. The required batch resources will automatically spin up on your cluster for the duration of the job, and then terminate, which allows those resources to be freed up upon completion. This can result in massive cost benefits in terms of infrastructure consumption, especially when running in private cloud environments.