Data Lake and Data Warehouse

4
5627
What is Data Lake?

Introduction

Data volumes are growing and the pace of this growth is unprecedented. The volume, variety, velocity and veracity of these data coming from Sensor, Social media, and other sources are far outstripping traditional data warehousing approach. With all this new data connecting us, we should be sailing smoothly. Unfortunately, we are drowning in our own data. Forward-looking organizations are trying to harness these new sources in a productive way to achieve unprecedented value and competitive advantage.

What is Data Lake?

As of now, there is no Clear industry definition for Data Lake. For some, Data Lake is a repository for large quantities and varieties of data, both structured and unstructured. For others, data lake is an architectural strategy and an architectural destination. But the concept of data lake is emerging as a popular way to organize and build the next generation of systems to face the big data challenges. The need for data lake arose because a new type of data needed to be captured and exploited by the organizations.

Capabilities and Salient Features of Data Lake: 

  • Capture and store huge amount of raw data at low cost. Can be scaled very easily.
  • Supports Advance Analytics. Utilizes the large quantities of coherent data and facilitates the use of various algorithms (e.g. deep learning) for analytics.
  • Allows Schema-Less Write and Schema based Read. This is very handy at the time of data consumption.
  • No compulsion of data modeling at the time of Data Ingestion. It can be done at the time of consumption.
  • Can store data from diverse sources and in various formats e.g. Sensor data, social media data, XML and more.
  • Accommodate high-speed data in conjunction with additional tools like Kafka and Flume.
  • Perform single subject analytics base on specific use cases.
  • Data Lake with Hadoop 2.0 with YARN overs comes the limitation of Batch -oriented and only single means for user interaction with data.

Data Lake vs. Data warehouse

Both have their own sweet spot. The enterprise data warehouse was designed to create a single version of the truth, that can be reused again and again. The model is based on schema on write, thereby demanding a lot of time during design and modeling. This makes it less flexible. On the other hand, if you need Fast response time, high concurrency consistent performance, easily consumable data and Cross-functional analysis – Enterprise data warehouse is the option to go ahead with.

 Let’s try to summarize few of the differences between Data Lake and Data Warehouse:  

 

Data Lake

Data warehouse

Data

Raw, Structured, Unstructured, semi-structured.

Structured, processed

Storage

Low cost storage 

Expensive for Large data volumes

Processing

Schema -on -read 

Schema – on -write

Agility

Configurable and reconfigurable as and when needed. 

Fixed configuration

Security

Work in progress 

Mature

User

Meant for Data scientists 

Business and Technical users.

Analytics Support

Excels at utilizing the large volume of coherent data 

Limited.

AS-IS data format

Data modeling not required at time of Ingestion can be done at the time of consumption.

Typically, Data is modeled as cube during ingestion.

Access Methods

Data Accessed through programs created by developers, SQL-like systems. No standard of the prefixed way.

Data Accessed through standard SQL and BI tools.

With few complimentary set of features and property, the data lake concept has impacted the organization traditionally using the only data warehouse. One of the visible extension of data lake role in such organization is using data lake for preparing data for analysis in a data warehouse.

It can be used as “scale-out ETL” environment for big data and get the data into a form that can be loaded into a warehouse for wider use. By doing so organizations are not only running ETL against data from enterprise application but also from big data sources at the same time.

Many organizations owning both Data Lake and Enterprise Data Warehouse are using both the environments in distributed fashion for Analytics. Media files like video, audio, images etc are stored in the filesystem of data lake and are exposed to various analytics tools to extract insights. The other data which could include unstructured or semi-structured are also stored in the filesystem but are exposed to separate sets of analytics tools. Once processed the results of analytics are distilled further and moved to Enterprise Data Warehouse for a wider audience.

In short, Organizations are trying to use Data Lake and Enterprise Data warehouse as a hybrid unified system which can full fill their data discovery and data exploration needs, thereby allowing them to visualize the data as and in the form they want. Hybrid solution provides users to take what is relevant and leave the rest.

 Fig: Generic landscape for Data Lake (please ignore the company specific flavors )

Also Read: Overview of Analysis for Microsoft Excel – SAP BI analysis and Reporting Tool

Points to consider while creating a Data Lake:

Depending upon our current situation, the road to data lake may differ. As always we first need to answer – “Where do we stand Now and Where do we want to Go with the data lake? “. The general recommendation is to follow your Data.

Steps

Parameters

 

Step 01 (Start Point)

 

Know the volume, variety, velocity and veracity of Data. For any organization, it’s important to learn and make sure that Hadoop works the way they desire (in their context). This is very important from a future perspective. Normally at this stage organization should indulge in simple analytics.

Step 02

The focus moves from learning to improving on Analytics capability. In this stage organization should looks for suitable tools and skillset to acquire more data and build an application on top of it. Transforming the data and co-creation of hybrid scenarios along with data warehouse should also be explored and worked onto.

Step 03

Democratization data, provide access to as many people as possible. Data Lake and Enterprise data warehouse start playing the respective roles.

Step 04 (Long Running phase)

Apply Governance compliance and Auditing. Depending upon the maturity level of your data Lake, you can apply the Governance concepts.

Data Lake Maturity: 

The data lake will fill with new data slowly and will not impact the existing models. The data lake foundation includes a big data repository, metadata management, and an application framework to capture and contextualize end user feedback. The increasing value of analytics is then directly correlated to increases in user adoption across the enterprise.

  1. Consolidation and categorized raw Data
  2. Attribute -level Metadata Tagging and linking
  3. Data Set extraction and Analysis.
  4. Business-specific tagging, synonym identification, and links.
  5. Convergence of Meaning within Context.

There is another school of thought which defines the Data Lake maturity in four step model:

Stage 1 – Evaluating Technology
Stage 2 – Reactionary
Stage 3 – Proactive
Stage 4 – Core Competency

As the organization progresses from stage 1 to 4, your data lake transforms from Technology infrastructure to Business Value. In course of the transformation, your organization gains IT efficiency, Analytical capabilities and hybrid usage of Enterprise Data warehouse and Data lake.

Conclusion

Data lake has already been accepted across the industry as viable and integrated component to consider in data strategy. Though the rate of adaptation has not been as high as was expected. The reason for slow adaptation rate can be attributed to the absence of a clear definition of Data Lake and its components. Governance and security still remain a key concern for most of the organization and challenge which needs to be addressed in coming days. Success stories at the big and small organization will definitely be a boost to the concept and adaptation of it.

Please Note: this is a compilation of my research/reading that I did for one of my project and assignment. Most of the content is not my own but the experience is what I am sharing. This includes similarity in issues that I faced.

If you want to get such useful articles directly to your inbox, please SUBSCRIBE. We respect your privacy and take protecting it seriously.

If you liked this post, please hit the share buttons and like us on facebook.

Do you have anything to add to this article? Have you worked in Data Lake? Do you want to share any real project requirement or solutions? Please do not hold back. Please leave your thoughts in the comment section.

Thank you very much for your time!!

You might like to check the Popular and Effective Tutorials in SAPYard

1. ABAP for SAP HANA Tutorials
2. ABAP Web Dynpro Tutorials
3. SAP Adobe Forms Tutorials
3. GOS Tutorials
4. OOPs ABAP Tutorials
5. HANA Tutorials

Call for Guest Authors and Contributors to write SAP Articles on our page and get noticed.

Do you have any tips or tricks to share? Do you want to write some articles at SAPYard? Please REGISTER and start posting and sharing your knowledge to the SAP world and get connected to your readers.

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here