Version Control for Data Science in 5 Minutes

Amongst our open source products is DVC, or Data Version Control. Data versioning for data science can be explained quickly and efficiently with no code needed.

Machine-learning models are unique in that they are a reflection of both the code used to specify them and the dataset they are trained on. Due to the reflection of both components, every possible combination of dataset processing, changing the dataset and changing the code represents its own experiment. The complexity of a machine learning project like this as a whole becomes massive when considering all of the models that can be created as possibilities.

This complexity of course makes it extremely difficult to log, reproduce or share what has been done, therefore requiring the need for tools to deal with that. When seeking out a flexible solution, ideally it would work with any programming language, machine learning framework, and so on, while staying straightforward for others to learn in conjunction with the tools they already have. Using common household items, we will demonstrate.

Imagine a stack of index cards to represent project files, including a script to pull data from storage, a script to clean it, a script to process and feature it and a script to model it. Index cards (aka these project files) are “lightweight,” or easy. Git commits are a snapshot built for small files, so we will want to use Git to take a snapshot of the project at any point in time to create a copy of everything in the project at certain place points. This enables us to always revert back to a previous place in time. This works wonderfully until you get really big files. In data science, data sets and models are big, effectively turning a handful of index cards in our demonstration into a 300+ page novel.

This novel isn't going to fit into a Git repository, so what now? A simple solution is to take one of the index cards - or, a simple Git commit - and include the information where the data set will be stored on that Git file. Now you have a simple way to access your dataset without fitting it into your Git repository.

DVC: An Open Source Project

That is the basic gist of DVC. While people have their own preferences for doing this, this is a fairly popular approach.

The point of the DVC open source project is to standardize the procedure so people don't need to engineer it themselves each time and they can use Git to version and keep track of large files such as models and datasets. DVC involves more than one capability:

Pipelines - Tie together datasets to scripts to models, and version as a pipeline
Metrics - Use to compare model performance across commits
Plots - Use to visualize how a model has changed across commits
Continuous Integration - A foundational idea in DevOps for automating frequent tests of your project or machine learning models.

DVC is an open source community with a lot of active development. Check out the project repository for updates on what is being made to continue to mitigate the challenges of producing large files, datasets, machine learning models, metrics and code to make ML models shareable and reproducible.

Connect with us for questions, feedback or to share your own ideas surrounding DVC.

Version Control for Data Science Explained in 5 Minutes (No Code!)

DVC: An Open Source Project

Ready to get started?