Opinion Tips for making your data lake thrive
Originally published on Information Management
Manan Goel is a senior director at Paxata
Big data offers tremendous opportunities to outsmart your competition and obtain insights on your business. By transforming big data into actionable information, you can open your organization up to new opportunities by identifying additional markets and customer segments, and by capitalizing on product innovation.
One of the leading principles for enabling enterprise big data is to develop a data lake strategy using a data structure based on Apache Hadoop. With the right information management principles in place, data lakes can deliver the scale, flexibility, and cost-effectiveness you need to manage your big data.
But why have data lakes emerged as a viable IT architecture pattern to capture and provision big data? It really comes down to three things:
- Commodity hardware-based scale-out architecture
- Massive parallel processing
- Schema-free data management
However, despite the promise of big data, significant challenges remain before big data can deliver any economic benefits. Most enterprises think of data lakes as cheap storage for bulky low-value data; consequently, raw data tends to get dumped as-is, leaving the data consumer to having to assess data quality and cleanse the data before it can bet turned into useful information.
To guarantee your data lakes can effectively manage the complexity, volume, and variety of your critical business data, it’s important to understand the primary data lake information management challenges and solutions:
1. Data access, exploration, and discovery: Finding useful information in a typical data lake can feel like finding a needle in a haystack. Fortunately, several data lake vendors offer specialized technical tools to help in this process. These tools provide business consumers with interactive, visual, click-based solutions powered by metadata to simplify information access, exploration, and discovery.
Start by investing in self-service solutions that enable business metadata definition with tagging, annotations, descriptions but be sure they feature search, visual data lake exploration, and assisted intelligence. Machine learning, text, and semantic analytics are other key criteria as they facilitate quick data access, exploration, and discovery.
Pervasive data lake adoption also requires easy access for both data experts and business consumers alike. Increase the utility of your data lake with visual access interfaces such as workspaces, libraries, catalogs, and integration with visualization tools.
2. Data preparation: Data lakes forgo data pre-processing to facilitate fast ingest. The result is unusable data flowing into the data lake.
To prevent raw and unsafe data in your data lake, provide data prep solutions that work with the scale and complexity of your data. For faster time to value, make sure that the data prep solution offers out-of-the-box options to explore, profile, clean, shape, combine, enrich, classify and publish data.
3. Semantics and assisted intelligence: Over 90 percent of big data is derived from social media, sensors, logs, clickstreams, mobile, and cloud applications. Machine-generated unstructured data is unmanageable by people or manual processes.
To make sense of this data, be sure to leverage artificial intelligence and algorithmic techniques that automatically manage complex data at scale. Techniques like machine learning and natural language processing drive semantic and syntactic data understanding. With assisted intelligence, business consumers can quickly clean, combine, curate, and contextualize different multi-structured data for higher business value.
4. Governance: Data lakes must balance seamless information ingest, management, and access with governance and control.
To ensure governance, offer comprehensive tracking, lineage, auditing, and versioning to ensure accuracy and traceability. Implement multi-layer authorization regimes to balance access with control. After-all, governance drives trustworthiness which drives data lake adoption.
5. Security: Data lakes are unused and useless waiting to move from experimentation phase to production use.
Enterprise-grade security is the key to production usage, including authentication, authorization, and encryption. Data lakes need to co-exist with existing IT investments so adopt enterprise standards for security such as LDAP, SAML, PKI, SSO, and encryption for seamless integration into existing IT infrastructure.