Introduction
There is a need for a better system for versioning massive amounts of data. It has been around for years.
While Git does an excellent job of managing codebases, it sucks at versioning binary files. Even the creator of Git, Linus Torvalds, admits this:
And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.
Once you upload a large binary file to a repository, it is there for everyone to download. Anyone who clones or forks the repo will have a copy of the file. It gets saved on the disk, not to mention all the duplication for many branches.
This topic has been discussed at length by many; you can find discussions on this StackOverflow thread and from here.
These problems are challenging in many areas of data. For example, machine learning engineers work with massive datasets through ML pipelines, and introducing reproducibility to their workflow in an easy manner is hard.
Specialists try to use technologies such as DVC or MLflow to solve their versioning issues but these tools do not provide enough stability to implement them at an enterprise level. So, what is the solution?
Get the best and latest ML and AI papers chosen and summarized by a powerful AI — Alpha Signal:
lakeFS to the Rescue
lakeFS provides all the solutions for data versioning problems. It prides itself in introducing git-level manageability of data with no duplication.
The lakeFS ecosystem introduces solutions for many challenging issues such as managing data access for multiple users, providing safe data ingestion and experimentation environments for both data consumers and engineers, running ML pipelines with any complexity level, and many more.
It also works seamlessly with many cloud-based object storage systems such as Amazon S3 and GCP. I especially like how lakeFS provides both UI and CLI interfaces to work with which is a combination not yet available in its competitors. You can learn the many more benefits of the tool from the official blog and the documentation.
Today, we will talk about the basics of lakeFS to give you an idea of how it works by interacting with local repositories with lakeFS Command Line Interface. If you want to try it out using data stored on the cloud, refer to this page of the docs.
Getting Started With lakeFS, Docker Installation
From personal experience, installing lakeFS to run local instances, Docker Compose is the best solution.
First of all, ensure that you have Docker installed with compose version 1.25.04 or higher. If you don't have Docker installed, here are links for installation guides: macOS, Windows, Linux Distros.
You can verify that you have correctly installed Docker by running docker version on the shell:
>>> docker version
Client: Docker Engine - Community
Cloud integration: 1.0.2
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:00:27 2020
OS/Arch: windows/amd64
Experimental: false
Next, start lakeFS
instance with a single command:
curl https://compose.lakefs.io | docker-compose -f - up
If your output is anything like this, you are on the right track:
You can also see the image running on your Docker Desktop Console:
Next, start the lakeFS instance with a single command:
curl https://compose.lakefs.io | docker-compose -f - up
If your output is anything like this, you are on the right track:
You can also see the image running on your Docker Desktop Console:
When you run the docker-compose command for the first time, you should set up a user admin by opening http://127.0.0.1:8000/setup on your browser. It will open up this page:
Enter a username of your choice, and it will give you one-time-only credentials. You should store them securely in a file somewhere because we will need it later.
Next, proceed to http://127.0.0.1:8000/login, where you will be able to log in using your credentials. As soon as you log in, you will land on your repositories page. Think of it like your GitHub account but with lakeFS:
This page is your UI for interacting with all of your repos and your user account.
However, for peeps who love the shell, lakeFS provides an even more powerful Command Line Interface, which we will cover in the next sections.
Installing the lakeFS CLI (Command Line Interface)
lakeFS CLI
is installed using its CLI binary. First, go to this GitHub releases page of lakeFS
. Click on the latest release and scroll to the bottom. You will find download options depending on your OS:
Download yours and place it somewhere in your PATH. If you want to run the CLI for a single project, you can extract it directly into the root directory of your project:
Before running the CLI commands, check that you are still running the lakeFS image from your Docker Console. Then, run this to check if the CLI is working:
lakectl help
If the shell displays a help page, congratulations, you are running the CLI on your local machine!
lakeFS namespace
Before we move on, it is important that you know the lakeFS
namespace. Different operations all reference components of lakeFS
repositories through the lakefs://
keyword. Here is a reference list of patterns for referring to different components:
- Repositories:
lakefs://<repo-name>
- Commits:
lakefs://<repo-name>@<commit-id>
- Branches:
lakefs://<repo-name>@<branch-id>
- Files (objects):
lakefs://<repo-name>@<branch-id>/<object path>
Important note for later sections: When you work with file and directory paths, make sure to end them with a forward slash
/
for the commands to work as expected.
lakectl
authorization
To start interacting with repositories under your account, you should first authorize your session (done every time you start a new one). Start by running lakectl config
which should show this output:
Copy and paste your Access Key ID you saved from the earlier section. Do the same for your secret key:
It does not ask twice for each field; it is just displayed like that. The Server Endpoint URL you see is http://127.0.0.1:8000/api/v1. After you enter these values into the fields, you will be authorized and will be able to control pretty much everything related to lakeFS.
You can check if you are authorized by running this command:
lakectl repo list
It should give you an empty table since we did not create any repos. So, shall we?
Working With Repos in General
To create a repository, we will use the `lakectl repo` command, which gives access to all commands to control repositories:
lakectl repo create lakefs://example local://storage
The above command will create a repo named example
in the local
storage since we are using local://
keyword. The storage
word is arbitrary:
From now on, this repository can be referenced only with lakefs://exampleURI (Uniform Resource Identifier). If we run lakectl repo list, we should be able to see it now:
lakectl repo list
Just like Git, each repo has a default master
branch when created.
You can delete repositories with delete
keyword:
lakectl repo delete lakefs://repo-name
For full repository commands, check out the CLI reference of lakeFS.
Loading Data To Repositories
At this point, we start to interact with our data. Remember our main aim of using lakeFS
. We want to manage data with any magnitude just like we do our code. So, what you will find useful is to integrate lakeFS
with git
itself.
The idea is that we control any-data related changes through lakeFS
and manage our code with git
. To achieve this, you should put all the file extensions in .gitignore
file which won't be tracked by git
afterward.
Now, say we want to upload some audio files to our lakeFS
repo which are stored inside data
directory:
Since they have .wma
extension, make sure you add *.wma
as a new line to .gitignore
.
Let's upload all the files in data
. Just like lakectl repo
command, lakectl fs
gives access to manipulate files and objects. We will use the upload
command which has this pattern:
lakectl fs upload --recursive --source path/ lakefs://repo-name@branch-name/extra-path/
The above command works for uploading both single or many files from a given directory. You should provide the path after --source
flag. For the destination, you must include the repository name followed by a branch. It is also very important to end both source and destination paths with a /
otherwise, the command fails.
Here is the sample command to upload the 4 audio files:
lakectl fs upload --recursive --source data/ lakefs://example@master/data/
lakectl fs upload
is an equivalent to git add
. To list the contents of a directory, we can run:
lakectl fs ls lakefs://repo@branch/path/
The above command is equivalent to the shell's ls
command. When you give the pathname, just like the others it should start with lakefs://
followed by repository name, branch name, and path.
Making Commits With lakectl
We just uploaded new files to our repository. Notice we did not write any code so we do not need to make a commit through git
. However, to save the changes to our lakeFS repo, we should make one through lakectl
.
lakectl
commit commands generally follow this pattern:
lakectl commit lakefs://repo-name@branch-name --message "Commit Message"
But before committing, it is usually helpful to see the changes we have made since the last commit. Just like git diff
, there exists similar command for lakectl
and follows this pattern:
lakectl diff lakefs://repo-name@branch-name
The diff
command shows all the uncommitted changes made to the lakeFS repository on the branch you specify.
Now, after making sure everything is good, we can commit our changes:
We will get a success message once the changes are committed. It also gives some details such as commit ID and timestamp.
You can see the list of commits on your repo with this command:
lakectl log lakefs://repo-name@branch
Working With Branches
The real power of lakeFS
can be seen in the instance of branches. Creating branches in git
allows you to duplicate your codebase and work with it in an isolation to try out experiments and features. However, doing this for repositories with an enormous amount of data is not feasible both storage-wise and time-wise.
lakeFS
solves this problem elegantly. If you create a branch for your lakeFS
repository, the task is performed instantaneously and without. Creating a branch at a particular point of your repo's commit history will create a snapshot of the repo's state at that particular commit, again without duplication. The official website says that it is all about metadata management under the hood.
Before we get to creating branches, I will upload some more data and make a few commits for example purposes.
Next, we will create a new branch at the head, meaning from the latest commit. First, get yourself acquainted with the command to create branches:
lakectl branch create lakefs://repo-name@new-branch-name/ --source lakefs://repo-name@source-branch
When you create a branch you should specify both the new branch's name and the one it should branch out from. This means that you can create branches from existing ones, it does not have to be the master
.
lakectl branch create lakefs://example@new --source lakefs://example@master
This will create a branch named new
and you can list out existing ones with this command:
lakectl branch list lakefs://example
Now, suppose you did some experimentation with your data and tested out new features. Once satisfied, you may want to merge this newly-created branch back to master
.
First of all, make sure that you commit any unsaved changes to your new branch:
lakectl commit lakefs://example@new --message "Tested some new features"
Before merging, you may want to see what is getting modified when you merge two branches. In this scenario, you can use diff
command again. The below usage of the command will yield the difference between two branches:
lakectl diff lakefs://example@master lakefs://example@new
Now merge the branches with:
lakectl merge lakefs://example@new lakefs://example@master
Note that any uncommitted changes will be committed when two branches merge.
One final point for working with branches: If you are unsatisfied with the changes in any branch, you can always revert them with lakectl
. The CLI provides 4 options depending on the situation. I won't list them out here but you can always learn about them from the CLI reference.