Adaptive Data Preparation™

While it seems every vendor in the data or Business Intelligence market uses the same language about self-service data preparation, Gartner recently published a Market Guide for Self-Service Data Preparation for Analytics which puts vendor tools in two categorizes:

  • Those with data prep capabilities specifically designed to support the core BI and analytics tool and
  • Stand-alone data preparation solutions, which are independent of analytic tools being used

With that definition in mind, there are other considerations which help distinguish various offerings:

  • Who is the primary user, and what skills do they need to have?
  • What sort of data is being worked on? How big, how varied, how complex?
  • How critical is automation,  governance and re-usability?

Paxata is a stand-alone self-service data preparation solution designed for business analysts who work with highly varied datasets for multiple analytical tools and uses cases – without coding, scripting, modeling or sampling.

Our approach is differentiated by the five pillars of Paxata Adaptive Data Preparation™:

  • Automated data integration capabilities which eliminate the need to write code
  • Semantic data quality exposed through a visually interactive graphical interface
  • Contextual enrichment to make AnswerSets™ richer for each analytic use case
  • Ad-hoc collaboration that promotes IT and Business sharing and visibility across every data prep project
  • Transparent governance, including data lineage and tracking, with the ability replay, reuse or undo every step in a data prep project

What does “Adaptive” mean?
The notion of “adaptive” addresses two aspects of data preparation.

The first is that, with Paxata, business analysts can now adjust more rapidly to the iterative business requests that come in on a daily basis. It is very typical that, as data is being understood and analyzed, users need more data to complete their questioning. So they go back to their analysts to repeat the data prep cycle again  At that point, analysts spend weeks and months in spreadsheets and home-grown data marts trying to combine their clean data with additional data from raw or outside sources, hoping that this next answer set will be what the business needs. Until now, that back-and-forth was the most painful part of every analytic exercise.

The second describes the most powerful aspect of the Paxata solution: the machine learning that leverages proven technologies from consumer search and social media, namely intelligent indexing, textual pattern recognition, and statistical graph analysis. By applying proprietary, patent-­pending algorithms to the linguistic content of both structured and unstructured data, the Paxata solution automatically builds a comprehensive and flexible data model in the form of a graph, reflecting similarities and associations amongst data items. The system uses associations between the data to detect and resolve both syntactic and semantic data quality issues, rapidly improving the quality of large data sets. As more data sources are added, the expanded associations amongst the data are leveraged to further improve the quality of the data.

There are key capabilities required to be “Adaptive,” including:

Emergent Data Models

The first is a system that is largely schema and model-free. By having a model-free and schema-free system, as business requirements change, the structure of data can too, using algorithmic techniques to semantically type data, to semantically enrich data, and to semantically join data together.

Automatic relationship and pattern detection

The next important innovation is what we call automatic relationship detection. If we know what type of data we have and what other attributes we want to bring in, then the next thing we want to do is combine or fuse data on the fly to assemble an AnswerSet quickly. With no pre-determined models and algorithms, we can combine information together very quickly and build that AnswerSet. So, we can fuel accurate and complete analytics in tools like Tableau, Qlik, Excel, and others.

Contextual Data Enrichment

The next algorithmic innovation is to be able to actually enrich data. Once you know the meaning of your data, you may need additional data sets that provide attributes you would like to see in your data set, based on the context of the answer set that you’re trying to assemble. For example, if an analyst is working with geographic information, and they are  doing market sizing, they might want to bring in population demographics from the US Census to add that context. So, enrichment is another key innovation that allows us to adaptively increase the AnswerSet we have to make our analytics perspective complete.

Getting smarter over time

The final capability is what we call reinforcement learning, the interaction between humans and the algorithms of our computers, in order to get to the right AnswerSet as quickly as possible. Every time we build a project, every action is captured, creating a transparent governance “recipe” which  can be reused, replayed, re-ordered or undone as needed.