Data Lake and Data Warehouse

December 29, 2016

5627

Introduction

Data volumes are growing and the pace of this growth is unprecedented. The volume, variety, velocity and veracity of these data coming from Sensor, Social media, and other sources are far outstripping traditional data warehousing approach. With all this new data connecting us, we should be sailing smoothly. Unfortunately, we are drowning in our own data. Forward-looking organizations are trying to harness these new sources in a productive way to achieve unprecedented value and competitive advantage.

What is Data Lake?

As of now, there is no Clear industry definition for Data Lake. For some, Data Lake is a repository for large quantities and varieties of data, both structured and unstructured. For others, data lake is an architectural strategy and an architectural destination. But the concept of data lake is emerging as a popular way to organize and build the next generation of systems to face the big data challenges. The need for data lake arose because a new type of data needed to be captured and exploited by the organizations.

Capabilities and Salient Features of Data Lake:

Capture and store huge amount of raw data at low cost. Can be scaled very easily.
Supports Advance Analytics. Utilizes the large quantities of coherent data and facilitates the use of various algorithms (e.g. deep learning) for analytics.
Allows Schema-Less Write and Schema based Read. This is very handy at the time of data consumption.
No compulsion of data modeling at the time of Data Ingestion. It can be done at the time of consumption.
Can store data from diverse sources and in various formats e.g. Sensor data, social media data, XML and more.
Accommodate high-speed data in conjunction with additional tools like Kafka and Flume.
Perform single subject analytics base on specific use cases.
Data Lake with Hadoop 2.0 with YARN overs comes the limitation of Batch -oriented and only single means for user interaction with data.

Data Lake vs. Data warehouse

Both have their own sweet spot. The enterprise data warehouse was designed to create a single version of the truth, that can be reused again and again. The model is based on schema on write, thereby demanding a lot of time during design and modeling. This makes it less flexible. On the other hand, if you need Fast response time, high concurrency consistent performance, easily consumable data and Cross-functional analysis – Enterprise data warehouse is the option to go ahead with.

Let’s try to summarize few of the differences between Data Lake and Data Warehouse:

	Data Lake	Data warehouse
Data	Raw, Structured, Unstructured, semi-structured.	Structured, processed
Storage	Low cost storage	Expensive for Large data volumes
Processing	Schema -on -read	Schema – on -write
Agility	Configurable and reconfigurable as and when needed.	Fixed configuration
Security	Work in progress	Mature
User	Meant for Data scientists	Business and Technical users.
Analytics Support	Excels at utilizing the large volume of coherent data	Limited.
AS-IS data format	Data modeling not required at time of Ingestion can be done at the time of consumption.	Typically, Data is modeled as cube during ingestion.
Access Methods	Data Accessed through programs created by developers, SQL-like systems. No standard of the prefixed way.	Data Accessed through standard SQL and BI tools.

With few complimentary set of features and property, the data lake concept has impacted the organization traditionally using the only data warehouse. One of the visible extension of data lake role in such organization is using data lake for preparing data for analysis in a data warehouse.

It can be used as “scale-out ETL” environment for big data and get the data into a form that can be loaded into a warehouse for wider use. By doing so organizations are not only running ETL against data from enterprise application but also from big data sources at the same time.

Many organizations owning both Data Lake and Enterprise Data Warehouse are using both the environments in distributed fashion for Analytics. Media files like video, audio, images etc are stored in the filesystem of data lake and are exposed to various analytics tools to extract insights. The other data which could include unstructured or semi-structured are also stored in the filesystem but are exposed to separate sets of analytics tools. Once processed the results of analytics are distilled further and moved to Enterprise Data Warehouse for a wider audience.

In short, Organizations are trying to use Data Lake and Enterprise Data warehouse as a hybrid unified system which can full fill their data discovery and data exploration needs, thereby allowing them to visualize the data as and in the form they want. Hybrid solution provides users to take what is relevant and leave the rest.

Fig: Generic landscape for Data Lake (please ignore the company specific flavors )

Also Read: Overview of Analysis for Microsoft Excel – SAP BI analysis and Reporting Tool

Points to consider while creating a Data Lake:

Depending upon our current situation, the road to data lake may differ. As always we first need to answer – “Where do we stand Now and Where do we want to Go with the data lake? “. The general recommendation is to follow your Data.

Steps	Parameters
Step 01 (Start Point)	Know the volume, variety, velocity and veracity of Data. For any organization, it’s important to learn and make sure that Hadoop works the way they desire (in their context). This is very important from a future perspective. Normally at this stage organization should indulge in simple analytics.
Step 02	The focus moves from learning to improving on Analytics capability. In this stage organization should looks for suitable tools and skillset to acquire more data and build an application on top of it. Transforming the data and co-creation of hybrid scenarios along with data warehouse should also be explored and worked onto.
Step 03	Democratization data, provide access to as many people as possible. Data Lake and Enterprise data warehouse start playing the respective roles.
Step 04 (Long Running phase)	Apply Governance compliance and Auditing. Depending upon the maturity level of your data Lake, you can apply the Governance concepts.

Data Lake Maturity:

The data lake will fill with new data slowly and will not impact the existing models. The data lake foundation includes a big data repository, metadata management, and an application framework to capture and contextualize end user feedback. The increasing value of analytics is then directly correlated to increases in user adoption across the enterprise.

Consolidation and categorized raw Data
Attribute -level Metadata Tagging and linking
Data Set extraction and Analysis.
Business-specific tagging, synonym identification, and links.
Convergence of Meaning within Context.

There is another school of thought which defines the Data Lake maturity in four step model:

Stage 1 – Evaluating Technology
Stage 2 – Reactionary
Stage 3 – Proactive
Stage 4 – Core Competency

As the organization progresses from stage 1 to 4, your data lake transforms from Technology infrastructure to Business Value. In course of the transformation, your organization gains IT efficiency, Analytical capabilities and hybrid usage of Enterprise Data warehouse and Data lake.

Conclusion

Data lake has already been accepted across the industry as viable and integrated component to consider in data strategy. Though the rate of adaptation has not been as high as was expected. The reason for slow adaptation rate can be attributed to the absence of a clear definition of Data Lake and its components. Governance and security still remain a key concern for most of the organization and challenge which needs to be addressed in coming days. Success stories at the big and small organization will definitely be a boost to the concept and adaptation of it.

Please Note: this is a compilation of my research/reading that I did for one of my project and assignment. Most of the content is not my own but the experience is what I am sharing. This includes similarity in issues that I faced.

If you want to get such useful articles directly to your inbox, please SUBSCRIBE. We respect your privacy and take protecting it seriously.

If you liked this post, please hit the share buttons and like us on facebook.

Do you have anything to add to this article? Have you worked in Data Lake? Do you want to share any real project requirement or solutions? Please do not hold back. Please leave your thoughts in the comment section.

Thank you very much for your time!!

You might like to check the Popular and Effective Tutorials in SAPYard

1. ABAP for SAP HANA Tutorials
2. ABAP Web Dynpro Tutorials
3. SAP Adobe Forms Tutorials
3. GOS Tutorials
4. OOPs ABAP Tutorials
5. HANA Tutorials

Call for Guest Authors and Contributors to write SAP Articles on our page and get noticed.

Do you have any tips or tricks to share? Do you want to write some articles at SAPYard? Please REGISTER and start posting and sharing your knowledge to the SAP world and get connected to your readers.

4 COMMENTS

Amutha Vasudevan September 27, 2017 At 1:09 am

It is neet and good explanation. it helps me lot. thanks a lot. i appreciate more and more.

- SAP Yard September 27, 2017 At 6:23 am
  
  Thanks Amutha. Please keep visiting and providing your feedback.
  
  Regards,
  Team SAPYard.
  
Voice December 30, 2016 At 9:22 am

Very good article. Good read and nice information.

- SAP Yard December 30, 2016 At 2:34 pm
  
  True Bhav. Vinay’s topics are very niche and interesting.. Heard this term “Data Lake” for the first time.. 🙂

Data Lake and Data Warehouse

4 COMMENTS

LEAVE A REPLY Cancel reply

EDITOR PICKS

SAP Build Build Process Automation 4 – How to Create a...

SAP Build Build Process Automation 3 – How to connect SAP...

How to Move SAP Transport Request to New Server

POPULAR POSTS

OData and SAP Netweaver Gateway. Part II. Create your first OData...

OData and SAP Netweaver Gateway. Part I. Introduction

ABAP on SAP HANA. Part I. First Program in ABAP HANA

POPULAR CATEGORY