January 26, 2018

Quick Guide: Git Large File Storage (LFS) for Excel

Posted by Björn Stiel Comments

As a distributed version control system, Git always copies down the entire repository history to the client when you do a git clone or git fetch. For repositories containing large files and/or long commit histories, this clone process can take a long time, as every file version that ever existed has to be downloaded.

For Excel workbook repositories, it is only a question of time before this becomes a real problem: If your repository contains one 5MB-sized workbook which you commit once every business day, you end up with roughly 250 x 5MB = 1.3GB worth of workbook versions after one year. This figure is exaggerated as Git uses file compression, but it gives you an idea of how much data you have to download if you (or someone else) performs a git clone or git fetch; and it will only grow bigger as time goes by.

What is Git Large File Storage?

Git Large File Storage (LFS) is an open-source Git extension developed by Atlassian, GitHub, and other contributors. Git LFS reduces the impact of large files by downloading them lazily: Files, that are tracked by LFS, are only downloaded when you check out a specific version instead of downloading every file version that ever existed during the clone or fetch process.

This helps tremendously when managing our Excel workbook repositories: When you do a git clone or git pull, Git only downloads the head version so that you can get to work straight away (instead of having to wait for all those old versions that you are probably not interested in anyway at that point).

How does Git LFS work?

Git LFS handles large files by replacing them with tiny pointer files. These pointer files act as references to the actual files which are stored somewhere else. For your normal Git operations, you never get to see these pointer files as Git LFS handles them automatically:

  1. When you add a file via git add, Git LFS replaces its contents with a pointer, and writes the file contents to a local Git LFS cache.

  2. When you push new commits to the server, any Git LFS files referenced by the newly pushed commits are transferred from your local Git LFS cache to the remote Git LFS store tied to your Git repository.

  3. When you checkout a commit that contains Git LFS pointers, they are replaced with files from your local Git LFS cache, or downloaded from the remote Git LFS store.

In your local working copy you only see your actual file content. You can use git checkout, git add and git commit as normal, there is no change to your normal Git workflow. git clone and git pull operations are faster because Git only downloads the versions of large files referenced by commits you check out.

To use Git LFS server-side, you need a Git LFS aware host such as GitHub, GitHub Enterprise, Bitbucket Cloud, Bitbucket Server, hosted GitLab and self-hosted GitLab.

Depending on your Git server system, Git LFS is either enabled by default or you might need to enable Git LFS manually. Please check the documentation for your Git host system.

Install Git LFS

Git LFS is a Git extension that you only need to install once. Once installed and initialised, Git LFS will bootstrap itself automatically when you clone a Git LFS repository.

Download the Git Large File Storage extension from the Git LFS project website and install it by double-clicking git-lfs-windows-<version>.exe (or follow the Mac installation instructions if you are on a Mac).

Run git lfs install to initialize Git LFS:

C:\Users\Bjoern>git lfs install
Git LFS initialized.

Create a new Git LFS repository

To create a new Git LFS-aware repository from scratch, you need to run git lfs install after creating the repository:

C:\Users\Bjoern\Developer\workbooks>mkdir lfs

C:\Users\Bjoern\Developer\workbooks>cd lfs

C:\Users\Bjoern\Developer\workbooks\lfs>git init
Initialized empty Git repository in C:/Users/Bjoern/Developer/workbooks/lfs/.git/

C:\Users\Bjoern\Developer\workbooks\lfs>git lfs install
Updated git hooks.
Git LFS initialized.

git lfs install sets up a handful of special Git hooks in your repository (pre-push, post-checkout, post-commit, post-merge). These hooks take care of the Git LFS relevant operations so that they get executed when using the standard Git commands.

Track Excel files with Git LFS

Git LFS is now initialised for your repository. Next step is to tell LFS using git lfs track that we want to track Excel files (workbooks and addins):

C:\Users\Bjoern\Developer\workbooks\lfs>git lfs track "*.xls*"
Tracking "*.xls*"

C:\Users\Bjoern\Developer\workbooks\lfs>git lfs track "*.xla*"
Tracking "*.xla*"

Note that the quotes around “.xls” and “.xla” are important (otherwise the wildcard will be expanded by your shell, and individual entries will be created for each .xls* file in your current directory).

The patterns supported by Git LFS are the same as those supported by .gitignore. For instance, if you store all your Excel workbooks in a Workbooks subfolder within your repository, you can simply track the entire subfolder using:

C:\Users\Bjoern\Developer\workbooks\lfs>git lfs track "Workbooks/*

After running git lfs track, you will notice a new file named .gitattributes:

C:\Users\Bjoern\Developer\workbooks\lfs>git status
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        .gitattributes

nothing added to commit but untracked files present (use "git add" to track)

Git LFS uses .gitattributes, which itself is a special Git file. You do not need to worry about the content of .gitattributes, however you need to make sure that you always add, commit and push .gitattributes (if you use git add . this is already being taken care of automatically).

Commit and push

Create a remote repository (for example on GitHub cloud which is LFS-enabled by default) and configure the remote of your local repository. I use our example Excel LFS repository in the following example. Feel free to do the same for cloning and pulling, or simply fork the repository so that you can write to it, too.

C:\Users\Bjoern\Developer\workbooks\lfs>git remote add origin https://github.com/ZoomerAnalytics/workbooks-lfs.git

You can commit and push as normal to a repository that contains Git LFS content. The only difference you see is some additional output from git push as the Git LFS content is transferred to the server. Create a workbook named Book1.xlsb, git add, git commit and git push it all:

C:\Users\Bjoern\Developer\workbooks\lfs>git add .

C:\Users\Bjoern\Developer\workbooks\lfs>git commit -m "First commit"
[master (root-commit) 13cfa90] First commit
warning: CRLF will be replaced by LF in .gitattributes.
The file will have its original line endings in your working directory.
 3 files changed, 7 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 Book1.xlsb
 create mode 100644 Book2.xls

C:\Users\Bjoern\Developer\workbooks\lfs>git push -u origin master
Git LFS: (2 of 2 files) 14.72 MB / 14.72 MB
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 568 bytes | 0 bytes/s, done.
Total 5 (delta 0), reused 0 (delta 0)
To https://github.com/ZoomerAnalytics/workbooks-lfs.git
 * [new branch]      master -> master
Branch master set up to track remote branch master from origin.

As you can see from the output, we pushed 14.72MB worth of workbooks to the remote server. In order to see the difference Git LFS makes, let’s delete most of the workbook content to bring down the workbook file size and push a new commit:

C:\Users\Bjoern\Developer\workbooks\lfs>git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   Book1.xlsb
        modified:   Book2.xls

no changes added to commit (use "git add" and/or "git commit -a")

C:\Users\Bjoern\Developer\workbooks\lfs>git add .

C:\Users\Bjoern\Developer\workbooks\lfs>git commit -m "Deleted a lot of stuff"
[master 3453320] Deleted a lot of stuff
 2 files changed, 4 insertions(+), 4 deletions(-)

C:\Users\Bjoern\Developer\workbooks\lfs>git push origin master
Git LFS: (2 of 2 files) 205.84 KB / 205.84 KB
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 561 bytes | 0 bytes/s, done.
Total 4 (delta 0), reused 0 (delta 0)
To https://github.com/ZoomerAnalytics/workbooks-lfs.git
   13cfa90..3453320  master -> master

Clone a Git LFS repository

The head commit now contains two relatively small workbook files, whereas its parent commit is nearly 14MB big. Without LFS, a git clone downloads the entire repository with both versions. As we have now Git LFS installed, Git should only download the significantly smaller head commit by default, resulting in a much faster git clone.

Clone the remote repository into lfs-2 (or any other available folder). You can start straight away with our example Excel LFS repository:

C:\Users\Bjoern\Developer\workbooks>git clone https://github.com/ZoomerAnalytics/workbooks-lfs.git lfs-2
Cloning into 'lfs-2'...
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0
Unpacking objects: 100% (9/9), done.
Checking connectivity... done.
Downloading Book1.xlsb (8.0 KB)
Downloading Book2.xls (203 KB)

Voilà! As expected, Git LFS makes git clone download only the Excel workbook files in the head commit. It’s worth mentioning though, that there is also an explicit git lfs clone command. This gives you an even better performance, if you’re cloning a repository with a large number of LFS files:

C:\Users\Bjoern\Developer\workbooks>git lfs clone https://github.com/ZoomerAnalytics/workbooks-lfs.git lfs-3
Cloning into 'lfs-3'...
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 9 (delta 0), reused 9 (delta 0), pack-reused 0
Unpacking objects: 100% (9/9), done.
Checking connectivity... done.
Git LFS: (2 of 2 files) 205.84 KB / 205.84 KB

The git lfs clone command waits until the checkout is complete, and then downloads any required Git LFS files as a batch. This takes advantage of parallelised downloads, and dramatically reduces the number of HTTP requests and processes spawned. This is especially important for improving performance on Windows.

Pull and check out

Just like git clone, you can pull from a Git LFS repository using the normal git pull command. Any required Git LFS files will be downloaded as part of the automatic checkout process once the pull completes. No explicit commands are needed to retrieve Git LFS content. Should the checkout fail for an unexpected reason, you can download any missing Git LFS content for the current commit with git lfs pull.

Next steps

Git LFS is an essential tool for managing and maintaining a healthy Excel workbook repository. It ensures a stable, predictable and smooth user experience and integrates seamlessly into your existing workflow. If you do not use Git LFS for your Excel repositories, I highly recommend it.

If you want to migrate from an existing non-LFS repository to a LFS-aware repository, please check back in here as we will cover this in depth in one of our next blog posts. If you have any questions, please comment below or get in touch: bjoern.stiel@zoomeranalytics.com.

Do you want more free Git Excel tips?
About Us
Free Products
Contact Details

Zoomer Analytics GmbH
Eichbühlstrasse 19
8004 Zurich
Switzerland

info@xltrail.com

© 2018 Zoomer Analytics LLC. All rights reserved.