Data Versioning: All You Need to Know

Manage your data like you manage code

Bex T.

Towards Data Science

· ~10 min read · December 8, 2020 (Updated: December 9, 2022) · Free: No

Introduction

There is a need for a better system for versioning massive amounts of data. It has been around for years.

While Git does an excellent job of managing codebases, it sucks at versioning binary files. Even the creator of Git, Linus Torvalds, admits this:

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Once you upload a large binary file to a repository, it is there for everyone to download. Anyone who clones or forks the repo will have a copy of the file. It gets saved on the disk, not to mention all the duplication for many branches.

This topic has been discussed at length by many; you can find discussions on this StackOverflow thread and from here.

These problems are challenging in many areas of data. For example, machine learning engineers work with massive datasets through ML pipelines, and introducing reproducibility to their workflow in an easy manner is hard.

Specialists try to use technologies such as DVC or MLflow to solve their versioning issues but these tools do not provide enough stability to implement them at an enterprise level. So, what is the solution?

Join Medium with my referral link - BEXGBoost

Get exclusive access to all my ⚡premium⚡ content and all over Medium without limits. Support my work by buying me a…

medium.com

Get the best and latest ML and AI papers chosen and summarized by a powerful AI — Alpha Signal:

Alpha Signal | The Best of Machine Learning. Summarized by AI.

Stay in the loop without spending countless hours browsing for the next breakthrough; our algorithm identifies the…

alphasignal.ai

lakeFS to the Rescue

lakeFS provides all the solutions for data versioning problems. It prides itself in introducing git-level manageability of data with no duplication.

The lakeFS ecosystem introduces solutions for many challenging issues such as managing data access for multiple users, providing safe data ingestion and experimentation environments for both data consumers and engineers, running ML pipelines with any complexity level, and many more.

It also works seamlessly with many cloud-based object storage systems such as Amazon S3 and GCP. I especially like how lakeFS provides both UI and CLI interfaces to work with which is a combination not yet available in its competitors. You can learn the many more benefits of the tool from the official blog and the documentation.

Today, we will talk about the basics of lakeFS to give you an idea of how it works by interacting with local repositories with lakeFS Command Line Interface. If you want to try it out using data stored on the cloud, refer to this page of the docs.

Getting Started With lakeFS, Docker Installation

From personal experience, installing lakeFS to run local instances, Docker Compose is the best solution.

First of all, ensure that you have Docker installed with compose version 1.25.04 or higher. If you don't have Docker installed, here are links for installation guides: macOS, Windows, Linux Distros.

You can verify that you have correctly installed Docker by running docker version on the shell:

>>> docker version
Client: Docker Engine - Community
 Cloud integration: 1.0.2
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:00:27 2020
 OS/Arch:           windows/amd64
 Experimental:      false

Next, start lakeFS instance with a single command:

curl https://compose.lakefs.io | docker-compose -f - up

If your output is anything like this, you are on the right track:

demonstrative screenshot showing the desired output

You can also see the image running on your Docker Desktop Console:

Next, start the lakeFS instance with a single command:

curl https://compose.lakefs.io | docker-compose -f - up

If your output is anything like this, you are on the right track:

You can also see the image running on your Docker Desktop Console:

When you run the docker-compose command for the first time, you should set up a user admin by opening http://127.0.0.1:8000/setup on your browser. It will open up this page:

Enter a username of your choice, and it will give you one-time-only credentials. You should store them securely in a file somewhere because we will need it later.

Next, proceed to http://127.0.0.1:8000/login, where you will be able to log in using your credentials. As soon as you log in, you will land on your repositories page. Think of it like your GitHub account but with lakeFS:

This page is your UI for interacting with all of your repos and your user account.

However, for peeps who love the shell, lakeFS provides an even more powerful Command Line Interface, which we will cover in the next sections.

Installing the lakeFS CLI (Command Line Interface)

lakeFS CLI is installed using its CLI binary. First, go to this GitHub releases page of lakeFS. Click on the latest release and scroll to the bottom. You will find download options depending on your OS:

Download yours and place it somewhere in your PATH. If you want to run the CLI for a single project, you can extract it directly into the root directory of your project:

Before running the CLI commands, check that you are still running the lakeFS image from your Docker Console. Then, run this to check if the CLI is working:

lakectl help

If the shell displays a help page, congratulations, you are running the CLI on your local machine!

lakeFS namespace

Before we move on, it is important that you know the lakeFS namespace. Different operations all reference components of lakeFS repositories through the lakefs:// keyword. Here is a reference list of patterns for referring to different components:

Repositories: lakefs://<repo-name>
Commits: lakefs://<repo-name>@<commit-id>
Branches: lakefs://<repo-name>@<branch-id>
Files (objects): lakefs://<repo-name>@<branch-id>/<object path>

Important note for later sections: When you work with file and directory paths, make sure to end them with a forward slash / for the commands to work as expected.

`lakectl` authorization

To start interacting with repositories under your account, you should first authorize your session (done every time you start a new one). Start by running lakectl config which should show this output:

Copy and paste your Access Key ID you saved from the earlier section. Do the same for your secret key:

Screenshot of output after entering Access Key ID

It does not ask twice for each field; it is just displayed like that. The Server Endpoint URL you see is http://127.0.0.1:8000/api/v1. After you enter these values into the fields, you will be authorized and will be able to control pretty much everything related to lakeFS.

You can check if you are authorized by running this command:

lakectl repo list

It should give you an empty table since we did not create any repos. So, shall we?

Working With Repos in General

To create a repository, we will use the `lakectl repo` command, which gives access to all commands to control repositories:

lakectl repo create lakefs://example local://storage

The above command will create a repo named example in the local storage since we are using local:// keyword. The storage word is arbitrary:

From now on, this repository can be referenced only with lakefs://exampleURI (Uniform Resource Identifier). If we run lakectl repo list, we should be able to see it now:

lakectl repo list

Just like Git, each repo has a default master branch when created.

You can delete repositories with delete keyword:

lakectl repo delete lakefs://repo-name

For full repository commands, check out the CLI reference of lakeFS.

Loading Data To Repositories

At this point, we start to interact with our data. Remember our main aim of using lakeFS. We want to manage data with any magnitude just like we do our code. So, what you will find useful is to integrate lakeFS with git itself.

The idea is that we control any-data related changes through lakeFS and manage our code with git. To achieve this, you should put all the file extensions in .gitignore file which won't be tracked by git afterward.

Now, say we want to upload some audio files to our lakeFS repo which are stored inside data directory:

Since they have .wma extension, make sure you add *.wma as a new line to .gitignore.

Let's upload all the files in data. Just like lakectl repo command, lakectl fs gives access to manipulate files and objects. We will use the upload command which has this pattern:

lakectl fs upload --recursive --source path/ lakefs://repo-name@branch-name/extra-path/

The above command works for uploading both single or many files from a given directory. You should provide the path after --source flag. For the destination, you must include the repository name followed by a branch. It is also very important to end both source and destination paths with a / otherwise, the command fails.

Here is the sample command to upload the 4 audio files:

lakectl fs upload --recursive --source data/ lakefs://example@master/data/

lakectl fs upload is an equivalent to git add. To list the contents of a directory, we can run:

lakectl fs ls lakefs://repo@branch/path/

The above command is equivalent to the shell's ls command. When you give the pathname, just like the others it should start with lakefs:// followed by repository name, branch name, and path.

Making Commits With `lakectl`

We just uploaded new files to our repository. Notice we did not write any code so we do not need to make a commit through git. However, to save the changes to our lakeFS repo, we should make one through lakectl.

lakectl commit commands generally follow this pattern:

lakectl commit lakefs://repo-name@branch-name --message "Commit Message"

But before committing, it is usually helpful to see the changes we have made since the last commit. Just like git diff, there exists similar command for lakectl and follows this pattern:

lakectl diff lakefs://repo-name@branch-name

The diff command shows all the uncommitted changes made to the lakeFS repository on the branch you specify.

Now, after making sure everything is good, we can commit our changes:

We will get a success message once the changes are committed. It also gives some details such as commit ID and timestamp.

You can see the list of commits on your repo with this command:

lakectl log lakefs://repo-name@branch

Working With Branches

The real power of lakeFS can be seen in the instance of branches. Creating branches in git allows you to duplicate your codebase and work with it in an isolation to try out experiments and features. However, doing this for repositories with an enormous amount of data is not feasible both storage-wise and time-wise.

lakeFS solves this problem elegantly. If you create a branch for your lakeFS repository, the task is performed instantaneously and without. Creating a branch at a particular point of your repo's commit history will create a snapshot of the repo's state at that particular commit, again without duplication. The official website says that it is all about metadata management under the hood.

Before we get to creating branches, I will upload some more data and make a few commits for example purposes.

Screenshot of additional data and commits

Next, we will create a new branch at the head, meaning from the latest commit. First, get yourself acquainted with the command to create branches:

lakectl branch create lakefs://repo-name@new-branch-name/ --source lakefs://repo-name@source-branch

When you create a branch you should specify both the new branch's name and the one it should branch out from. This means that you can create branches from existing ones, it does not have to be the master.

lakectl branch create lakefs://example@new --source lakefs://example@master

This will create a branch named new and you can list out existing ones with this command:

lakectl branch list lakefs://example

Now, suppose you did some experimentation with your data and tested out new features. Once satisfied, you may want to merge this newly-created branch back to master.

First of all, make sure that you commit any unsaved changes to your new branch:

lakectl commit lakefs://example@new --message "Tested some new features"

Before merging, you may want to see what is getting modified when you merge two branches. In this scenario, you can use diff command again. The below usage of the command will yield the difference between two branches:

lakectl diff lakefs://example@master lakefs://example@new

Now merge the branches with:

lakectl merge lakefs://example@new lakefs://example@master

Note that any uncommitted changes will be committed when two branches merge.

One final point for working with branches: If you are unsatisfied with the changes in any branch, you can always revert them with lakectl. The CLI provides 4 options depending on the situation. I won't list them out here but you can always learn about them from the CLI reference.

#data-engineering #artificial-intelligence #data-science #machine-learning #programming