lakeFS

From Wikipedia, the free encyclopedia
lakeFS
Original author(s)Einat Orr
Oz Katz
Developer(s)Treeverse
Initial releaseAugust 3, 2020
Stable release
0.104.0
Repositoryhttps://github.com/treeverse/lakeFS
Written inGo
TypeData version control
LicenseApache 2.0
Websitelakefs.io

lakeFS is a free and open-source software developed by Treeverse.[1][2] It provides scalable and format-agnostic version control for data lakes,[3] using Git-like semantics to create and access different data versions.[1][2]

First released in August 2020, its features include data version tracking, isolated development and testing, repository rollback, continuous data integration and deployment.

History[edit]

lakeFS was developed by Oz Katz and Einat Orr in 2020.[4][5]

Its first public release, v0.8.1, was provided by Treeverse in August 2020. This version provided Git-like operations for any file format and AWS S3 storage compatibility, featuring a versioning engine based on MVCC.[6]

In 2021, the versioning engine transitioned to Graveler, increasing its handling capacity to billions of objects with a limited performance impact.[7]

In July 2021, Treeverse, the parent company of lakeFS, received an investment of $23 million in a Series A funding round, led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[5][8][9]

In June 2022, lakeFS Cloud was introduced as a managed service to facilitate versioning in cloud data lakes.[1][3] This service helps mitigate challenges related to tracking data changes and reverting to previous versions.[3]

Software[edit]

Overview[edit]

lakeFS is a data versioning engine that manages data in a way similar to code. By using operations such as branching, committing, merging, and reverting, which resemble those found in Git, it facilitates the handling of data and its corresponding schema throughout the entire data life cycle.[10]

Features[edit]

lakeFS is an interface made for interaction with object stores such as S3 as well as data management systems, such as AWS Glue and Databricks.[1] The system assigns the task of actual data storage to backend services such as AWS, while it handles branch tracking and supports multiple storage providers.[1]

lakeFS simplifies branch creation, tracking, and merging.[1] It removes the need for complete dataset duplication during testing phases, thereby isolating experimental modifications.[1] It also streamlines branch operations, supporting the creation, merging, or deletion of branches as required.[1] Furthermore, it integrates with continuous integration and deployment pipelines via webhooks.[1]

When dealing with arbitrary object storage, lakeFS processes data blocks via API calls.[1] It stores branching information as metadata, enabling efficient subsequent object management as needed.[1]

lakeFS hooks[edit]

lakeFS hooks enable specific checks and validations before key lifecycle events.[10] Unlike Git Hooks, these hooks activate remote servers to run tests.[10] They can be configured to assess table schemas when merging data from development or test branches into production; if validation fails, the merge is blocked.[10] This function serves as a tool for schema enforcement and standardized rule application across various data sources and producers.[10]

Events that can trigger these hooks may include change commits, branch merges, new branch creations, or alterations in tags.[11] In the context of a merge, a pre-merge hook operates on the source branch before the finalization of the merge.[11]

References[edit]

  1. ^ a b c d e f g h i j k Wayner, Peter (June 27, 2022). "LakeFS brings branching to data lakes". VentureBeat.
  2. ^ a b Borck, James R. (October 18, 2021). "The best open source software of 2021". InfoWorld.
  3. ^ a b c Kerner, Sean Michael (22 June 2022). "Treeverse set to launch lakeFS cloud data lake service". TechTarget. Retrieved 2023-06-27.
  4. ^ Goldberg, Niva (July 29, 2021). "Israeli Startup Treeverse Secures $23 Million for Open Source Technology". Jewish Business News.
  5. ^ a b Sawers, Paul (28 July 2021). "Treeverse raises $23M to bring Git-like version control to data lakes". VentureBeat. Retrieved 2023-06-27.
  6. ^ "v0.8.1". Github. Retrieved 2023-06-27.
  7. ^ "lakeFS Architecture".
  8. ^ Orbach, Meir (28 July 2021). "Treeverse raises $15 million Series A to leverage lakeFS". Calcalist.
  9. ^ Martin, Noga (28 July 2021). "Open source technology lakeFS secures $23M in funding". Israel Hayom.
  10. ^ a b c d e Hemo, Yaniv Ben (3 February 2023). "How To Avoid "Schema Drift"".
  11. ^ a b Avneri, Iddo (27 June 2023). "Managing Schema Validation in a Data Lake Using Data Version Control".