Adaptive Data Preparation™

What does “Adaptive” mean?
The notion of “adaptive” addresses two aspects of data preparation.

The first is that, with Paxata, business analysts can now adjust more rapidly to the iterative business requests that come in on a daily basis. It is very typical that, as data is being understood and analyzed, users need more data to complete their questioning. So they go back to their analysts to repeat the data prep cycle again  At that point, analysts spend weeks and months in spreadsheets and home-grown data marts trying to combine their clean data with additional data from raw or outside sources, hoping that this next answer set will be what the business needs. Until now, that back-and-forth was the most painful part of every analytic exercise.

The second describes the most powerful aspect of the Paxata solution: the machine learning that leverages proven technologies from consumer search and social media, namely intelligent indexing, textual pattern recognition, and statistical graph analysis. By applying proprietary, patent-­pending algorithms to the linguistic content of both structured and unstructured data, the Paxata solution automatically builds a comprehensive and flexible data model in the form of a graph, reflecting similarities and associations amongst data items. The system uses associations between the data to detect and resolve both syntactic and semantic data quality issues, rapidly improving the quality of large data sets. As more data sources are added, the expanded associations amongst the data are leveraged to further improve the quality of the data.

There are key capabilities required to be “Adaptive,” including:

Emergent Data Models

The first is a system that is largely schema and model-free. By having a model-free and schema-free system, as business requirements change, the structure of data can too, using algorithmic techniques to semantically type data, to semantically enrich data, and to semantically join data together.

Automatic relationship and pattern detection

The next important innovation is what we call automatic relationship detection. If we know what type of data we have and what other attributes we want to bring in, then the next thing we want to do is combine or fuse data on the fly to assemble an AnswerSet quickly. With no pre-determined models and algorithms, we can combine information together very quickly and build that AnswerSet. So, we can fuel accurate and complete analytics in tools like Tableau, Qlik, Excel, and others.

Contextual Data Enrichment

The next algorithmic innovation is to be able to actually enrich data. Once you know the meaning of your data, you may need additional data sets that provide attributes you would like to see in your data set, based on the context of the answer set that you’re trying to assemble. For example, if an analyst is working with geographic information, and they are  doing market sizing, they might want to bring in population demographics from the US Census to add that context. So, enrichment is another key innovation that allows us to adaptively increase the AnswerSet we have to make our analytics perspective complete.

Getting smarter over time

The final capability is what we call reinforcement learning, the interaction between humans and the algorithms of our computers, in order to get to the right AnswerSet as quickly as possible. Every time we build a project, every action is captured, creating a transparent governance “recipe” which  can be reused, replayed, re-ordered or undone as needed.