November '21 Community Gems

A roundup of technical Q&A's from the DVC and CML community. This month: CML runners, working with data, DVC Studio, and more.

  • Milecia McGregor
  • November 30, 20214 min read
Hero Picture

What would be the cleanest, most Pythonic way to run DVC commands from inside a Python script if we want to avoid calling the subprocess library?

That's a really good question @mihaj!

If you want to run DVC commands in a Python script, you have a couple of options.

You can work with the main module from the dvc library. This is the more CLI-like option. An example of running an experiment would look something like this.

from dvc.main import main

main(["exp", "run"])

The other option you have is to use the Repo API. This API is largely undocumented at the moment, but it closely mirrors the CLI commands. One exception is that they will return internal data structures instead of exit codes.

Here's an example of running an experiment with the Repo API.

from dvc.repo import Repo

repo = Repo()
repo.experiments.run()
repo.experiments.show()
# etc...

How can you check if a DVC tracked directory has changes?

Good question from @edran!

You can check which directories have been changed by running:

$ dvc status

This will give you an output similar to this in your terminal:

train:
  changed deps:
    modified: src/train.py
  changed outs:
    deleted: model.pkl
evaluate:
  changed deps:
    deleted: model.pkl

We're working on adding granularity support for this command and should have a release for this in the next few months.

Is there a way to look at all of the experiments I've run and see the metrics and parameters associated with them?

Thanks for asking @GuyAR! This is a common question that comes up.

You can see all of your experiments and the associated metrics and parameters in a table in the terminal by running the following command:

$ dvc exp show

This will give you a table that looks similar to this with all of this information.

 ────────────────────────────────────────────────────────────────────────────────────────────
  neutral:**Experiment**                metric:**step**       metric:**acc**   metric:**val_acc**      metric:**loss**   metric:**val_loss**   param:**lr**      param:**momentum**
 ────────────────────────────────────────────────────────────────────────────────────────────
  **workspace**                    **3**   **0.91389**      **0.87**   **0.20506**    **0.66306**   **0.001**   **0.09**
  **data-change**                  **-**         **-**         **-**         **-**          **-**   **0.001**   **0.09**
  │ ╓ 9405575 [exp-54e8a]      3   0.91389      0.87   0.20506    0.66306   0.001   0.09
  │ ╟ 856d80f                  2   0.90215   0.87333   0.27204    0.61631   0.001   0.09
  │ ╟ 23dc98f                  1   0.87671      0.86   0.35964    0.61713   0.001   0.09
  ├─╨ 99a3c34                  0   0.71429      0.82   0.67674    0.62798   0.001   0.09
  │ ╓ 3b3a2a2 [exp-23593]      3   0.86885      0.46   0.31573     3.7067   0.001   0.09
  │ ╟ 93d015d                  2   0.83197   0.41333   0.36851     3.4259   0.001   0.09
  │ ╟ d474c42                  1   0.79918   0.43333   0.46612      3.286   0.001   0.09
  ├─╨ 1582b4b                  0   0.52869      0.39   0.94102     2.5967   0.001   0.09
 ────────────────────────────────────────────────────────────────────────────────────────────

Great question @MadsO!

This works the exact same as when you've added data with dvc add. So to remove data, you would run this command:

$ dvc remove

When using a CML, are GitHub Actions, GitLab, and BitBucket the only options for CI?

Currently, cml runner does not support CircleCI or droneCI self–hosted runners and you would have to deploy them manually.

You can still use cml send-comment, cml pr, and the other CML tools with any CI platform.

Thanks for this awesome question @tpietruszka!

When I run the dvc remove command, does it only remove .dvc files?

A really good question from @flowy!

That is correct. Running dvc remove only removes DVC tracked files and directories. It will also remove the entry from .gitignore and handles the dvc.yaml.

For example, if you run something like dvc remove folder_name/file.dvc, only the .dvc file will be removed. So your updated directory would likely still have folder_name/file since that was the file being tracked.

If you wanted to remove the tracked file as well, you would need to run dvc remove --outs. This command removes the outputs of any target.

If there is nothing else in the folder, you'll be left with an empty directory. You can remove it and stop tracking in Git with a command like:

$ git rm -r folder_name

Can DVC Studio be connected to a self-managed GitLab repo?

Very good question about Studio @Sra!

Right now this only works if it's an on-premises network or a private VPC network.

We are working on bringing custom-domain GitLab as a feature very soon! You can follow this GitHub issue and leave comments for anything you'd like to see!

Is there a way to extend default job execution time for a CML runner?

There is definitely a way to do this!

You can extend the max time in your CI by adding something like this:

train:
  timeout-minutes: 5000

If you're using GitLab, the same update would look similar to this:

train:
  timeout: 72 hours

Thanks for this question @evergreengt!


At our December Office Hours Meetup we will be doing a new feature demo you won't want to miss! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!

Join us in Discord to get all your DVC and CML questions answered!

Back to blog