January '22 Community Gems

A roundup of technical Q&A's from the DVC and CML communities. This month: DVC Studio data, DVC for non-ML projects, getting started with CML, and more.

  • Milecia McGregor
  • January 31, 20223 min read
Hero Picture

Is it possible to stream objects to and from remote caches?

Thanks for asking @mihaj!

You can stream files using the DVC API. There are two methods that you'll likely want to check out. First there's dvc.api.open(). This opens a file tracked by DVC and generates a corresponding file object. Here's a quick example:

import dvc.api

with dvc.api.open(
        'get-started/data.xml',
        repo='https://github.com/iterative/dataset-registry'
        ) as fd:
        # do things with the file object here

The simplest way to return the contents from a DVC tracked file would be to use dvc.api.read(). The returned content can be a bytearray or string. Here's a little example of this being used:

import pickle
import dvc.api

model = pickle.loads(
    dvc.api.read(
        'model.pkl',
        repo='https://github.com/iterative/example-get-started'
        mode='rb'
        )
    )

One of the steps in my DVC pipeline uses a pip installed package. What is the best way to make sure that DVC re-runs the steps that depend on that package?

Thanks for the question @alphaomega!

The best way to handle any package dependencies is to include a requirements.txt file with the specific versions your pipeline needs.

Another approach you can take is having a stage that dumps the package version as an intermediate output. It doesn't have to be saved in Git or DVC because it's easily reproduced and DVC should be able to take care of detecting that the package didn't change. Here's an example of a stage that does this.

stages:
  package_version:
    cmd: pip freeze | grep "package_name==" > package_name_version.txt
    outs:
      - package_name_version.txt
  train:
    cmd: python train.py
    deps:
      - package_name_version.txt

Does DVC save dependencies which are in the dvc.yaml pipeline to the cache?

Thanks for another great question @rie!

DVC doesn't track the pipeline dependencies in the cache or storage, only the outputs. If you want DVC to track a pure data dependency that's not an output of a different stage, you need to track it with dvc add ...

The output of a pipeline might be something like data.dvc, while a pure dependency might be a file that's just a part of the project, like script.py. That's why you'll need to use the dvc add command to track this.

What is the difference between Kubeflow pipelines and DVC pipelines?

This is a fantastic question! Thanks for asking @ramakrishnamamidi!

A major difference is that DVC focuses primarily on ML development and adding lightweight functionality on top of existing projects, which may be reusable in deployment in some cases.

Kubeflow focuses on deployment and building on top of Kubernetes, which could be used during development but requires more up-front effort.

Could DVC be a good alternative to LFS for game development?

Thanks for such an interesting question @CB!

Yes! We have community members that use DVC to handle their large files in game development.

There are several other use cases we've seen for DVC outside of machine learning and data science. Some people have used DVC to track build artifacts for deployment systems and to track performance data alongside design iterations and simulation tools.

You should check out our #beyond-ml Discord channel to stay up to date with the other use cases the community is coming p with!

Does DVC run on JSON/YAML configuration files for all things?

This is a great question about large projects with a lot of dependencies from @SolemnSimulacrum!

All of the dependencies you list in dvc run are in fact configured in the dvc.yaml file. dvc run is a convenience for adding a pipeline stage to this file and then doing dvc repro on that stage. It's completely acceptable and even encouraged to directly edit dvc.yaml if that's easier.

For example, if you are currently executing a command like this:

$ dvc run -n prune \
        -o model.pt \
        -d ./DepFiles_0/ \
        -d ./DepFiles_1/ \
        -d ./DepFiles_2/ \
        -d ./src/.py \
        -d ./packages/.py \
        -d ./scripts/.py \
        -d ./data/.npy \
        python script.py

You could add those directly to the dvc.yaml like this:

stages:
  prune:
    cmd: python script.py
    deps:
      - ./DepFiles_0/
      - ./DepFiles_1/
      - ./DepFiles_2/
      - ./src/.py
      - ./packages/.py
      - ./scripts/.py
      - ./data/.npy
    outs:
      - model.pt

I'm setting up MLOps at my company from scratch and we use GitLab and Cloudera DS workbench. What are the best resources to get started with CML?

This is a great question from @dvc!

We recommend you start with the CML docs website.

You can find some tutorials on our blog.

Or you can check out the videos on our YouTube channel

And of course, you can always ask questions in the Discord community!

I understand that DVC Studio is a discoverability layer over my DVC repo in GitHub. Will any of my data be stored on your servers?

This is a great question about DVC Studio from @johnnyaug!

DVC Studio only stores metrics, plots, and metadata about your pipelines in the databases to be able to serve this as a table. We don't read actual data and we don't store code.

An important thing to note is that if you have plots from dvc plots show that are images, JSON files, or vega specs, those could be saved on our end as well to serve them to UI.

We're working on documentation for this as well!


Join us in Discord to get all your DVC and CML questions answered!

Back to blog