New Release! The DVC 3.0 Stack: Beyond the Command Line
dvc repro
on a stage, does it automatically push any outputs to my remote?
Great question from @tina_rey!
The dvc repro
command doesn't automatically push any outputs or data to your
remote. The outputs are stored in the cache until you run dvc push
, which then
pushes them from your cache to your remote.
dvc dag
based on deps
and outs
, so that a stage that depends on the output of another stage will always be executed after the former has finished?
This is a good question from @johnysku!
That is correct! If the pipelines are independent or the stages are independent, they may run in any order. Without explicit dependency linkage, stages could be executed in an unexpected order.
foreach
utility in dvc repro
, is there a way I can use glob patterns to create the list DVC needs to iterate over?
Another interesting question from @copah!
If you have mystage
which uses foreach
, you can do dvc repro
to mystage
to iterate over every mystage
stage.
Really good question from @Meme Philosopher!
DVC will fail when you try to pull files that have been deleted from the remote and notify you that those files are missing in remote storage.
Thanks for the question @Atsu!
This is supported out-of-the-box! Here's how it works:
cml runner --labels=cml --cloud=gcp
/--cloud=aws
runs-on: [self-hosted, cml]
and the maximum allowable
timeout-minutes: 4320
You can follow along with this doc to get started.
The key is requesting GitHub's
maximum timeout-minutes: 4320
.
This signals to CML to
restart the workflow
just before the timeout. You'll also have to write your code to cache results so
that the restarted workflow will use previous results (e.g. use
https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints
and https://github.com/iterative/dvc/issues/6823)
Awesome question about Studio from @Benjamin-Etheredge!
In order to show the experiment results in Studio, you would have to commit and push the results as part of your CI (continuous integration) action. Here's an example GitHub action script that does this.
We do understand that it is not ideal that there are 2 commits, one with your changes and one with the results. We have been thinking about how this can be improved and it would be great to hear if you have any thoughts/ideas!
Good question from @qubvel!
You can use SSH to handle this and run the following command:
$ dvc import [email protected]:<reposiotry location> <data_path>
Very interesting question from @cajoek!
The data will not be symlinked from the remote to the cache.
Sometimes we can treat cache as something temporary so a lot of data that will never be used can get there from failed experiments, etc. In this case having a local remote to keep track of important data for important versions of your project would be good.
That way, later when your cache is too big and the project takes up too much
space, you can remove .dvc/cache
and download latest important version from
remote.
At our May Office Hours Meetup we will have Matt Squire of Fuzzy Labs join us sharing his view on open source MLOps tools! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!
Join us in Discord to get all your DVC and CML questions answered!