Jump to content

User:Bhushanvc

From Wikipedia, the free encyclopedia

What is DataLake?

It is a method of storing data (i.e. structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, and newer formats like JSON), unstructured data (emails, documents, PDFs) and even binary data namely images, audio and video, thus creating a centralized data store accommodating all forms of data), usually stored on Hadoop, Azure Storage and Amazon S3.

The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.

The earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Pig & Hive (which by themselves were batch-oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet another resource negotiator), new processing paradigms like Streaming, interactive, on-line have become available via Hadoop and the Data Lake.