Every month we share news, findings, interesting reads, community takeaways, and everything else along the way. Look here for updates about DVC, our journey as a startup, projects by our users and big ideas about best practices in ML and data science.
Welcome to the March Heartbeat! Here are some highlights from our team and community this past month:
DVC is STILL growing! In February, Senior Software Engineer Guro Bokum joined DVC. He's previously contributed to the core DVC code base and brings several years of full-stack engineering expertise to the team. Welcome, Guro!
New feature alert. We've received many requests for
monorepo support in DVC. As of DVC
release 0.87.0, users can version
data science projects within a monorepo! The new
dvc init --subdir
functionality is designed to allow multiple DVC repositories within a single Git
repository. Don't forget to upgrade and
check out the latest docs.
First, there's an intriguing discussion evolving in the DVC repo about how machine learning hyperparameters (such as learning rate, number of layers in a deep neural network, etc.) can be tracked. Right now, hyperparameters are tracked as source code (i.e., with Git). Could we use some kind of abstraction to separate hyperparameters from source code in a DVC-managed project? Read on and feel free to jump into this discussion, largely helmed by software developer and DVC contributor Helge Munk Jacobsen.
Another discussion we appreciated happened on Twitter:
We give tools like Slack and Zoom a lot of credit for making remote work possible, and I think Git and every hosted DVC system should equally get the same credit. Imagine life for a second without version control. Think about that.— Celestine (@cyberomin) February 1, 2020
Elsewhere on the internet, DVC made the cut in a much-shared blog, Five Interesting Data Engineering Projects by Dmitry Ryaboy (VP of Engineering at biotech startup Zymergen, and formerly Twitter). Dmitry wrote:
To be honest, I’m a bit of a skeptic on “git for data” and various automated data / workflow versioning schemes: various approaches I’ve seen in the past were either too partial to be useful, or required too drastic a change in how data scientists worked to get a realistic chance at adoption. So I ignored, or even explicitly avoided, checking DVC out as the buzz grew. I’ve finally checked it out and… it looks like maybe this has legs? Metrics tied to branches / versions are a great feature. Tying the idea of git-like branches to training multiple models makes the value prop clear. The implementation, using Git for code and datafile index storage, while leveraging scalable data stores for data, and trying to reduce overall storage cost by being clever about reuse, looks sane. A lot of what they have to say in https://dvc.org/doc/understanding-dvc rings true.
Check out the full blog here:
One of the areas that DVC is growing into is continuous integration and continuous deployment (CI/CD), a part of the nascent field of MLOps. Naturally, we were thrilled to discover that CI/CD with DVC is taught in a new Packt book, "Learn Python by Building Data Science Applications" by David Katz and Philipp Kats.
In the authors words, the goal of this book is to teach data scientists and engineers "not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards." Needless to say, we are considering starting a book club. Grab a copy here:
Last year in Mexico, DVC contributor Ramón Valles gave a talk about reproducible machine learning workflows at Data Day Monterrey—and a video of his presentation is now online! In this Spanish-language talk, Ramón gives a thorough look at DVC, particularly building pipelines for reproducible ML.
Finally, DVC data scientist Elle (that's me!) released a new public dataset of posts from the Reddit forum r/AmItheAsshole, and reported some preliminary analyses. We're inviting anyone and everyone to play with the data, make some hypotheses and share their findings. Check it out here: