A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Microsoft, or Google).
A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.
James Dixon, then chief technology officer at Pentaho, coined the term to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers (PwC) said that data lakes could "put an end to data silos." In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." Hortonworks, Google, Oracle, Microsoft, Zaloni, Teradata, Impetus Technologies, Cloudera, MongoDB, and Amazon now all have data lake offerings.
Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at Cardiff University is a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data. An earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (MapReduce) and was the only processing paradigm associated with it. Interacting with the data lake meant one had to have expertise in Java with map reduce and higher level tools like Apache Pig, Apache Spark and Apache Hive (which by themselves were batch-oriented).
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data". PwC was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics,
We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there.
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.
They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization. Another criticism is that the concept is fuzzy and arbitrary. It refers to any tool or data management practice that does not fit into the traditional data warehouse architecture. The data lake has been referred to as a particular technology. The data lake has been labeled as a raw data reservoir or a hub for ETL offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question.
While critiques of data lakes are warranted, in many cases they are overly broad and could be applied to any technology endeavor generally and data projects specifically. For example, the term “data warehouse” currently suffers from the same opaque and changing definition as a data lake. It can also be said that not all data warehouse efforts have been successful either. In response to various critiques, McKinsey noted that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.
- "The growing importance of big data quality". The Data Roundtable. Retrieved 1 June 2020. CS1 maint: discouraged parameter (link)
- "What is a data lake?". aws.amazon.com. Retrieved 12 October 2020. CS1 maint: discouraged parameter (link)
- Campbell, Chris. "Top Five Differences between DataWarehouses and Data Lakes". Blue-Granite.com. Retrieved 19 May 2017. CS1 maint: discouraged parameter (link)
- Olavsrud, Thor. "3 keys to keep your data lake from becoming a data swamp". CIO. Retrieved 5 July 2017.
- Woods, Dan (21 July 2011). "Big data requires a big architecture". Tech. Forbes.
- Dixon, James (14 October 2010). "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James. Retrieved 7 November 2015.
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.CS1 maint: discouraged parameter (link)
- Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (PDF) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.
- Weaver, Lance (10 November 2016). "Why Companies are Jumping into Data Lakes". blog.equinox.com. Retrieved 19 May 2017. CS1 maint: discouraged parameter (link)
- Tuulos, Ville (22 September 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances".
- Walker, Coral; Alrehamy, Hassan (2015). "Personal Data Lake with Data Gravity Pull". 2015 IEEE Fifth International Conference on Big Data and Cloud Computing. pp. 160–167. doi:10.1109/BDCloud.2015.62. ISBN 978-1-4673-7183-4. S2CID 18024161.
- Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 1 November 2015.
Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.
- "Are Data Lakes Fake News?". Sonra. 8 August 2017. Retrieved 10 August 2017.
- "A smarter way to jump into data lakes". McKinsey. 1 August 2017.