Integrating DVC and Git LFS via libgit2 filters

Learn how the latest DVC release (version 3.31.0) now supports reading Git LFS objects, allowing users to import files from platforms like Hugging Face without additional dependencies. The implementation utilizes the Dulwich and pygit2 libraries, providing compatibility with Git LFS within DVC and enhancing its capabilities in managing datasets across Git repositories.

  • Peter Rowlands
  • January 03, 20245 min read
Hero Picture

DVC, Dulwich and libgit2

One of the main features provided by DVC is the ability to import and download files from any Git repository. In prior releases this came with the caveat where projects which use Git LFS were unsupported. As of version 3.31.0, DVC now supports reading Git LFS objects, so you can now dvc import upstream datasets from platforms like Hugging Face which use Git LFS, without needing to install any additional dependencies! Read on for an overview on how the DVC Git LFS client was implemented.

To get started using DVC with Hugging Face, please refer to the DVC integrations documentation

DVC builds on top of Git's versioning capabilities using the open source libraries Dulwich and pygit2 (which provides Python bindings for the C library libgit2). Using these libraries provides DVC with access to Git functionality without requiring a traditional command line Git installation, which can be particularly useful in containerized environments. When integrating support for Git LFS support into DVC, we wanted to keep the same approach, so DVC users could simply install DVC, and then import and download files from any Git repository, regardless of whether or not that repository uses Git LFS. Neither Dulwich nor libgit2/pygit2 support Git LFS natively, but libgit2 does provide an API for the low level Git filters functionality used by Git LFS. We have contributed to pygit2 so that pygit2 users (like DVC) can now write libgit2 filters purely in Python, without needing to use the lower level libgit2 C API.

DVC's Git client library (which wraps Dulwich and pygit2) is available here

Git filters overview

Git supports using attribute filters to manipulate how objects are stored internally in Git compared to how they are stored in your workspace. One commonly used built-in filter is the CRLF filter, which will adjust line endings in text files. The CRLF filter is typically used to ensure that files are checked out into the workspace using the appropriate line endings for the user's platform (linefeed on Unix and carriage return + linefeed on Windows), but are only stored in Git with Unix-style line endings.

Git LFS also works by using Git filters. When you add a file with the filter=lfs attribute to Git, The Git LFS filter generates a "pointer" for Git to store internally. The LFS pointer is a small text file containing a SHA256 LFS object ID for the original file. The Git LFS filter places the original file in Git LFS storage, and then outputs the pointer to Git (instead of the original file). Upon checkout, Git passes the pointer to the Git LFS filter, which then reads the LFS object ID and checks out the appropriate original file into your workspace.

version https://git-lfs.github.com/spec/v1
oid sha256:b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
size 4

Example Git LFS pointer

libgit2 and pygit2 filters

When saving objects in Git and when checking them back out to the workspace, libgit2 runs a chain of registered filters. Each filter in the chain modifies the object data as needed, and then passes the modified result into the next filter. While writing a libgit2 filter in C is fairly complex and requires implementing multiple levels of callback structs for handling the underlying buffered write streams in addition to the filter itself, this is simplified by our newly contributed support for Python filters in pygit2. The low level libgit2 APIs are abstracted away, and a subclassed pygit2.Filterimplementation only needs to implement three basic methods, check(), write() and close().

  • Filter.check() is called prior to processing any object with Git attributes which match the registered filter, and the filter can verify whether or not it should be used with the given object, or indicate that the filter does not need to be applied.
  • Filter.write() is called one or more times and is used to “write” input data chunks to the filter.
  • Filter.close() is called after all of the input data has been written to the filter.

The filter can send output data chunks to the next filter in the chain as needed via the write_next() callback.

Note: in Git, smudge filters are run when checking out objects from the Git object database into the workspace, and clean filters are run when saving objects from the workspace into the Git object database. In libgit2 and pygit2, a single filter is registered which is used in both cases, and the direction is indicated by the mode parameter.

The scmrepo Git LFS filter

Thanks to this higher level abstraction in pygit2, implementing the Git LFS smudge filter in Python is straightforward:

 def check(self, src: "FilterSource", attr_values: List[str]):
    if attr_values[0] == "lfs":
        if src.mode != GIT_FILTER_CLEAN:
            self._smudge_buf = io.BytesIO()
            self._smudge_root = src.repo.workdir or src.repo.path
            return
    raise Passthrough

In check(), the first element in attr_values will contain the object’s filter Git attribute. We verify that the object has filter=lfs set and that we are in smudge mode (our filter is currently read-only and does not need to implement clean mode). When in smudge mode we initialize an internal buffer which will be used for reading the pointer data from Git, as well as storing the original Git repository root path (which will be needed later).

def write(
    self, data: bytes, src: "FilterSource", write_next: Callable[[bytes], None]
):
    …
    self._smudge_buf.write(data)

In write() we append the input chunk to our buffer and then return. We do not write to the next filter, since Git LFS smudge depends on reading the entire pointer input before we can output any data.

def close(self, write_next: Callable[[bytes], None]):
    …
    self._smudge(write_next)

def _smudge(self, write_next: Callable[[bytes], None]):
    …
    self._smudge_buf.seek(0)
    with Git(self._smudge_root) as scm:
        try:
            url = get_fetch_url(scm)
        except InvalidRemote:
            url = None
        fobj = smudge(scm.lfs_storage, self._smudge_buf, url=url)
        data = fobj.read(io.DEFAULT_BUFFER_SIZE)
        try:
            while data:
                write_next(data)
                data = fobj.read(io.DEFAULT_BUFFER_SIZE)
        except KeyboardInterrupt:
            return

In close(), we get the configured Git LFS remote URL (if it is set) and then run our actual smudge() implementation. scmrepo’s smudge() method will return a Python file-like object stream for the original file (and not the internal pointer). We then just need to do a series of chunked reads and writes to send the original file data to the next filter in the chain.

Since Git LFS smudge behavior is well defined by the Git LFS specification we will not go into a detailed explanation of our Python implementation here. In short, smudge() verifies that the input data is a valid Git LFS pointer, reads the Git LFS object ID from the pointer, and then loads the appropriate object from Git LFS storage. If the specified object ID is not available in the local Git LFS storage, it will be fetched from the remote Git LFS server.

The complete source code for our scmrepo Git LFS filter is available on Github: filter.py, smudge.py

Conclusion

This recent update to DVC marks a significant milestone by eliminating the prior limitation associated with Git LFS incompatibility. With version 3.31.0, DVC users can seamlessly import files from Git repositories, including platforms like Hugging Face, without needing extra dependencies. The integration of Git LFS support, facilitated by the Dulwich and pygit2 libraries, streamlines managing datasets and large objects in a Git repository.

This reinforces DVC's commitment to providing a versatile and user-friendly open-source version control solution for diverse Git repositories.

Back to blog