A roundup of technical Q&As from the DVC and CML community. This month: CML updates, working with multiple datasets, using DVC stages, and more.
Great question from @tina_rey!
dvc repro command doesn't automatically push any outputs or data to your
remote. The outputs are stored in the cache until you run
dvc push, which then
pushes them from your cache to your remote.
dvc dagbased on
outs, so that a stage that depends on the output of another stage will always be executed after the former has finished?
This is a good question from @johnysku!
That is correct! If the pipelines are independent or the stages are independent, they may run in any order. Without explicit dependency linkage, stages could be executed in an unexpected order.
dvc repro, is there a way I can use glob patterns to create the list DVC needs to iterate over?
Another interesting question from @copah!
If you have
mystage which uses
foreach, you can do
dvc repro to
to iterate over every
Really good question from @Meme Philosopher!
DVC will fail when you try to pull files that have been deleted from the remote and notify you that those files are missing in remote storage.
Thanks for the question @Atsu!
This is supported out-of-the-box! Here's how it works:
cml runner --labels=cml --cloud=gcp/
runs-on: [self-hosted, cml]and the maximum allowable
You can follow along with this doc to get started.
The key is requesting GitHub's
This signals to CML to
restart the workflow
just before the timeout. You'll also have to write your code to cache results so
that the restarted workflow will use previous results (e.g. use
Awesome question about Studio from @Benjamin-Etheredge!
In order to show the experiment results in Studio, you would have to commit and push the results as part of your CI (continuous integration) action. Here's an example GitHub action script that does this.
We do understand that it is not ideal that there are 2 commits, one with your changes and one with the results. We have been thinking about how this can be improved and it would be great to hear if you have any thoughts/ideas!
Good question from @qubvel!
You can use SSH to handle this and run the following command:
$ dvc import email@example.com:<reposiotry location> <data_path>
Very interesting question from @cajoek!
The data will not be symlinked from the remote to the cache.
Sometimes we can treat cache as something temporary so a lot of data that will never be used can get there from failed experiments, etc. In this case having a local remote to keep track of important data for important versions of your project would be good.
That way, later when your cache is too big and the project takes up too much
space, you can remove
.dvc/cache and download latest important version from
At our May Office Hours Meetup we will have Matt Squire of Fuzzy Labs join us sharing his view on open source MLOps tools! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!
Join us in Discord to get all your DVC and CML questions answered!