Adaptive Data Preparation™

Over the last ten years, technology has accelerated our ability to gather and share information in ways we never expected. Instead of bringing data to our desktops, the Web has enabled us to go to the data, wherever it is. Mobile devices make it possible for everyone to access information, wherever they are. And Cloud-based application delivery means we can work however we want to. These new capabilities have spawned a generation of individuals of all ages who want to exploit the data around us to work smarter, move faster and live better.

In response to that movement, innovations have taken place in most parts of the analytic supply chain. Big Data technologies like Hadoop and MongoDB have completely disrupted the data management world, making it possible to collect and store more raw data than we ever imagined, at a fraction of the cost and speed we once lived with. At the same time, a relentless focus on the end user has driven companies like Tableau and Qlik to deliver new ways for everyone in the business to digest and explore information how they want to. With all of this at our fingertips, why do so many organizations still rely on intuition or gut-driven decisions?

Adaptive Data Prep

The modern Adaptive Data Preparation platform 

Data preparation is the next big area of innovation of enterprise analytics. Paxata was built from the ground out to address data prep – and only data prep. We don’t want to be another front-end analytics tool, and we don’t want to dabble in any other part of the BI stack. Our focus areas:

  • Semantic data integration capabilities powered by IntelliFusion™
  • Semantic data quality exposed through a visually interactive graphical interface
  • Semantic enrichment so data can be supplemented with 3rd party data sets easily
  • Unified data sharing so teams can collaborate in the entire process
  • Emergent governance, including data lineage/tracking and history so you can see who made what changes throughout the life of an AnswerSet. You can even roll it back, and re-run it on new data sets!

What does "Adaptive" mean?

The notion of "adaptive" addresses two aspects of data preparation.

The first is that, with Paxata, business analysts can now adjust more rapidly to the iterative business requests that come in on a daily basis. It is very typical that, as data is being understood and analyzed, users need more data to complete their questioning. So they go back to their analysts to repeat the data prep cycle again  At that point, analysts spend weeks and months in spreadsheets and home-grown data marts trying to combine their clean data with additional data from raw or outside sources, hoping that this next answer set will be what the business needs. Until now, that back-and-forth was the most painful part of every analytic exercise.

The second describes the most powerful aspect of the Paxata solution: the machine learning that leverages proven technologies from consumer search and social media, namely intelligent indexing, textual pattern recognition, and statistical graph analysis. By applying proprietary, patent-­pending algorithms to the linguistic content of both structured and unstructured data, the Paxata  solution  automatically  builds  a  comprehensive  and  flexible  data  model  in  the  form  of  a  graph, reflecting  similarities  and associations  amongst  data items. The system uses associations between the data to detect and resolve both syntactic and semantic data quality issues, rapidly improving the quality of large data sets.  As more data sources are added, the expanded associations amongst the data are leveraged to further improve the quality of the data.

Nenshad Bardoliwalla, Co-Founder and VP of Products at Paxata, spoke at Strata Santa Clara 2014. After his session, several folks came up to talk about the Paxata concept of “adaptive.” Those conversations led him to write a post and make on video on what “adaptive” means from a technology perspective. You'll find both below. Please ask any questions you have about adaptive data in the comments section  at the bottom of this page. 

What is Adaptive Data Preparation - Video Transcript

From a business perspective, Adaptive Data Preparation is the notion that, as the dynamics of the business are changing, as the market is changing, as competitors are moving very quickly, as economic structures are rearranging themselves in front of us, we can very rapidly create the data we need in order to be able to answer analytical questions in the visualization tools of choice.

However, in the world of technology, adaptive starts with understanding the data in a new way. What we've seen over the last 30 years is that we've approached information from the point of the database's perspective. We type information in terms of dates or integers or strings, which make a lot of sense for the database, but they don't make a lot of sense for humans.

In new systems, like Paxata, we now start to type information in a way that makes sense to people. When we look at a column of data, we can say, "This is a postal abbreviation," or, "This is a product," or, "This is a customer name." The fact that our systems can recognize what type of data we have in our data sets – and the meaning of that data – is very transformative in providing a much higher level of intelligence and flexibility in the way people prepare their data today.

The other general point to note is that the notion of a pre-governed, pre-determined, pre-defined world is very rapidly becoming obsolete. What we find instead is that models, models of data, models of the transformations on data, models on the meaning of data are actually now all becoming emergent. They need to come out of the various activities people are doing with their data sets, and leverage the power and flexibility of technologies, like being able to impose schemas on reading raw data sets versus imposing schemas before any data is actually loaded into a data management system.

What we have found, working with our customers, is that these two approaches – sophisticated data typing and emergent data modeling – allows us to then look at the relationships and patterns across the various data sets that they bring to us rapidly. For example, a customer might upload data from an SAP system. They might upload their data from one of their Oracle systems, from their SalesForce system, from third party data from Nielsen and D&B, and even from the spreadsheets on their desktop. What they want is to very quickly have Paxata identify the patterns in the data. What are the types of data? How do those data types cluster or relate to each other? But where we see the true transformation in adaptivity is really around the notion of using machine-learning semantic technologies, text analytics, and statistics…all woven together to make the systems far more intelligent.

Having adaptive technology really focuses on a couple of fundamental elements.

Emergent Data Models

The first is a system that is largely schema and model-free. By having a model-free and schema-free system, as your business requirements change, you can change the structure of your data too, using algorithmic techniques to semantically type data, to semantically enrich data, and to semantically join data together.

On-the-fly Data Enrichment

The next algorithmic innovation is to be able to actually enrich data on the fly. Once you know the meaning of your data, you want to have the system actually recommend to you additional attributes that you would like to see in your data set, based on the context of the answer set that you're trying to assemble. For example, if I have geographic information, and I'm doing some marketing work, I might want to get population demographics from the US Census fed to me automatically because the system knows that I'm a professional who's in marketing, and it knows that I'm working with types of data that would be well enriched by working with that type of information. So, semantic enrichment is another key innovation that allows us to adaptively increase the answer set that we have to make it richer with context.

Automatic relationship and pattern detection

The next important innovation is the notion of what we call automatic relationship detection. If I know what type of data you have and what other attributes I want to bring in, then the next thing I want to be able to do is combine or fuse data on the fly to assemble my answer set as rapidly as I need to. With no pre-determined models and algorithms, I can fuse my information together very quickly and build that answer set. So, we can fuel accurate and complete analytics in tools like Tableau, Qlik, Excel, and others.

Getting smarter over time

The final capability is what we call reinforcement learning. It is the interaction between us as people and the algorithms of our computers, in order to get to the right answer set as quickly as possible. When the system recommends to me additional attributes that I might want to see in an enrichment scenario, what I want to do is be able to say, "Yes. I would like this population information, but perhaps I don't want any weather information or geographic information." So, I am reinforcing to the computer what I am really interested in seeing in my answer set. In the same way, I am also adapting because as I'm giving feedback to the computer, it's giving me new, more relevant recommendations, a lot like how Amazon recommends to me other interesting books that I'd like to read. So, this combination of technologies allows us to provide a completely different way to prepare data… so that just as your business needs to be ready, your data needs to be ready as well.

The importance of having an adaptive data preparation solution is that the ability to iterate, the ability to have flexibility in the way models emerge from the data itself, the meaning that we can derive from those data sets, and the interaction between us as people and the computer all combine and fuse to allow very rapid cycles of data preparation in the life cycle of analytics.

Comments

What are the differences between data preprocessing and adaptive data preparation? Are you really just doing advanced data preprocessing?

Data pre-processing is a step in the traditional data mining process, usually managed by a highly technical team. They first look at the data and decide how best to "rationalize" it into a clean, verified data set. The process imposes a schema on the data, which means you need to know how you plan to use the data to begin with. 

In the agile BI world, many times business teams are doing analytics to try and answer questions they didn't even know they would have when they started the discovery process. Tableau, Qlik and other tools make it easy to visualize that data in powerful ways, but almost always, the minute they see the data, they realize they will need more data. Adaptive data preparation makes it possible to bring data into the ad-hoc analytics work stream in real time.

You don't have to stop the querying process to submit the data into a pre-processing system, and specific to the technical team how you are going to incorporate that data with other data being used. The Paxata system automatically tells you what the data is about in a very visual and interactive way...then recommends ways you can combine those data sets - on the fly - as you continue down the path of analyzing the data.

  • Pre-processing: Technical team must run this step
  • Adaptive data prep: Business analyst can do this without technical skills
  • Pre-processing: Data schema will need to be defined 
  • Adaptive data prep: System will provide visual understanding of what the data is about and you decide what you want to use
  • Pre-processing: Business team must explain how the data will get used 
  • Adaptive data prep: Business analyst can look at all the recommended ways data should be cleaned, merged, etc. and take actions
  • Pre-processing: Time-to-completion could be slow if there are resource bottlenecks
  • Adaptive data prep: This can be done on the fly, without breaking the ad-hoc analytics work stream

Hope this helps address the differences...

 

I think I get the difference now. Adaptive data preparation seems to be more about data preparation for analytics, not data pre-processing for data mining. So as a business analyst you consider the data cleaning I do myself as data preparation and your software automates that process for me. Very interesting.

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.