December '21 Community Gems

A roundup of technical Q&A's from the DVC community. This month: comparing experiments, working with data, working with pipelines, and more.

  • Milecia McGregor
  • December 21, 20214 min read
Hero Picture

I'm using Google Drive as a remote storage and accidentally entered the verification from the wrong Google account. How can I edit that?

No problem @fireballpoint1! This happens sometimes.

You should be able to run the following command in your terminal and then re-enter your credentials.

$ rm .dvc/tmp/gdrive-user-credentials.json

That should give you a chance to enter the correct credentials when you try to dvc pull again.

Can I add a dvc remote which refers to NAS by IP so I don't have to mount on every computer?

That's a new question for us @Krzysztof Begiedza!

If you enable the SSH service on your NAS, you can configure DVC to use it as an SSH remote with dvc remote add.

There should also be DSM (Synology DiskStation Manager) packages for webdav as well, if you prefer that over SSH. Just make sure that when you run dvc remote add -d storage <URL>, your remote storage URL looks similar to this.

webdav://<ip>/<path>

Can you selectively dvc pull data files?

Great question @Clemens!

You would run dvc pull <file> to get the files you want. You could also use the --glob option on dvc pull and DVC will only pull the relevant files.

The command for that pull would be similar to this.

$ dvc pull path/to/specific/file

You could also make a data registry and use dvc import in other projects to get a specific dataset. That way you don't have to do a granular pull.

What is the fastest way to get the specific value of a metric of an experiment based on experiment id?

That's a really good question @Kwon-Young!

You can always look through experiment metrics with dvc exp show and this shows you all of the experiments you've run.

To get the metrics for a specific experiment or set of experiments, you'll need the experiment ids and then you can use the Python API like this example.

from dvc.repo import Repo

dvc = Repo(".")  # or Repo("path/to/repo/dir")
metrics = dvc.metrics.show(revs=["exp-name1", "exp-name2", ...])

This returns a Python dictionary that contains what gets displayed in dvc metrics show --json except you're able to specify the experiments you want.

Is it possible to run the whole pipeline but only for one element of the foreach?

Another great question from @vgodie!

If your stages look something like this for example:

stages:
  cleanups:
    foreach: # List of simple values
      - raw1
      - labels1
      - raw2
    do:
      cmd: clean.py "${item}"
      outs:
        - ${item}.cln
  train:
    foreach:
      - epochs: 3
        thresh: 10
      - epochs: 10
        thresh: 15
    do:
      cmd: python train.py ${item.epochs} ${item.thresh}

You should try the following command:

$ dvc repro cleanups@labels1

This will run your whole pipeline, but only with labels1 in the cleanups stage.

Is it possible to pull experiments from the remote without checking out the base commit of those experiments?

Thanks for the question @mattlbeck!

You should be able to do this with dvc exp pull origin exp-name.

If you have experiments with the same name on different commits, using exp-name won't work since it defaults to selecting the one based on your current commit if there are duplicate names.

To work around this, you can use the full refname, like refs/exps/e7/78ad744e8d0cd59ddqc65d5d698cf102533f85/exp-6cb7, to specify the experiments that you want to work with.

How should I handle checkpoints in PyTorch Lightning with DVCLive?

This is a really good question that came from one of our Office Hours talks! Thanks Ilia Sirotkin!

We have an open issue we encourage you to follow for more details and to even contribute!

Python Lightning handles checkpoints differently from other libraries. This affects the way metrics logging is executed and how models are saved.

You can write a custom callback to control saving everything and track it with DVC and this is the workaround we suggest. You can implement the after_save_checkpoint method and save the model file.

The way this works is by breaking your training process into small stages. You should specify the stage’s checkpoint as the output of the stage and set it as a dependency for the next stage. That way if something breaks, the dvc repro command will resume your experiment from the last stage.

Your pipeline might look something like this:

stages:
  stage_0:
    cmd: python train.py
    outs:
      - checkpoints/checkpoint_epoch=0.ckpt
  next:
    foreach:
      - prev: 0
        next: 1
      - prev: 1
        next: 2
    do:
      cmd: python train.py --checkpoint ${item.prev}
      deps:
        - checkpoints/checkpoint_epoch=${item.prev}.ckpt
      outs:
        - checkpoints/checkpoint_epoch=${item.next}.ckpt

Then you'll need to reuse the ModelCheckpoint that is included in pytorch_lightning to capture the checkpoints. Here's a snippet of what that could look like in your training script:

# set checkpoint path
ckpt_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "checkpoints"))

# checkpoints will be saved to checkpoints/checkpoint_epoch={epoch_number}.ckpt
cp = pl.callbacks.model_checkpoint.ModelCheckpoint(
    monitor="train_loss_epoch",
    save_top_k=1,
    dirpath=ckpt_path,
    filename='checkpoint_{epoch}')

Is there a feature for DVC to only sample and cache a subset of the tracked dataset, e.g. 10000 lines of a large file?

Really great question @Abdi!

You should be able to use the streaming capability of the DVC API to achieve this goal.

Here is an example of a Python script that would do this:

from dvc.api import open as dvcopen

with dvcopen('data',f'{repo_url}') as fd:
    for line in fd:
        print(line)

At our January Office Hours Meetup we will be looking at machine learning workflows and Neovim-DVC plugin! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!

Join us in Discord to get all your DVC and CML questions answered!

Back to blog