Journey to Apache Spark

Tags: , , , , , , , , , , , , , ,

In a previous blog post, I mentioned that Shachar Harussi and I discussed the lessons we learned building the Apache Spark-based architecture for the Paxata platform at DataVersity in Chicago this week. I can’t wait to talk about this week at Strata NYC. Come by booth #301 and find me to ask me more about these topics!

Here is a second peek into what we talked about – refer to this post to get some context.

Why use Apache Spark for data preparation?

Data preparation includes gathering and exploring millions of records, cleaning up missing data and invalid formats, transforming and combining data with other datasets, correcting mistakes and understanding outliers, filtering and segmenting datasets down to the data that matters.

So, why use Spark? When Paxata was founded, we knew that there were important things to consider for how people want to prepare their data:

  1. Business analysts need to see their data in rows and columns, not a bunch of boxes and arrows.
  2. Business analysts need to prepare all of their data for their eventual analysis, not just some of it.
  3. As business analysts prepare their data (cleaning, combining, fixing, re-shaping) they need to see changes reflected in the data immediately.

So, we needed to build an experience that would show lots of data, react quickly, calculate changes across huge datasets, and be scalable to million (billions!) of rows. We designed an interactive spreadsheet-like experience to address the first point. Now, to be smart, scalable, and speedy we needed  to investigate available open-source projects in distributed computing.

The image below demonstrates the interactive experience of the Paxata self-service data preparation app with  semantic column profiles, numerical range filtering, and text searching for millions of records. 


Spark vs the Other Distributed Systems Projects

We considered projects along the spectrum of storage to pure computing.


Distributed file systems like Alluxio (previously Tachyon) and Ceph addressed scalability concerns for “big data” problems, but did not address an analyst’s needs to interact with the data and make changes.

Databases like Apache HBase and Apache Cassandra offered scalable data storage as well as real-time querying and filtering, which was an improvement on pure-storage. These projects showed potential; they could accommodate an analyst zooming around their data, scrolling across millions of records with relative ease. But what would happen if data had to be reshaped, aggregated, and pivoted?

The image below demonstrates the interactive experience of the Paxata self-service data preparation app with a number of shaping options, including deduplicate, group by, transpose, pivot, and depivot. Depivot is shown here, with new values dynamically calculated on the fly based on the chosen parameters. 


After continued investigation, we decided on Apache Spark. Not only is it scalable to the data volumes we anticipated, but it also has a simple, robust, and flexible way of expressing computations as an immutable stream of data, the Resilient Distributed Dataset RDD. For more information about Apache Spark, I highly recommend reading about it on the Databricks website or in this tutorial.

The Paxata team will be available at Strata NYC. Come by booth #301 to find out more!

Please note that Apache HBase, Apache Cassandra, Apache Spark are trademarks of the Apache Software Foundation.