Support our DVC Extension for VSCode on Product Hunt
This is a good scenario to try out one of these use cases:
You can have a single storage location mounted on each workstation to serve as a central cache.
That way all of your machine learning engineers can work with the same data in a central location.
Thanks for the question @fchpriani!
Thanks for bringing this up @mattlbeck!
Right now DVC just uses whichever remote is configured in a respective commit that you've checked out.
To clarify things a bit more, if you run dvc push/pull
in a workspace with a
new remote, that new remote will be used for --all-branches
, --all-tags
, and
--all-commits
.
You can freeze the stages that you do not want to be executed.
dvc freeze
and dvc unfreeze
help you do this. Or you can use
dvc repro --glob pattern*
together with -s
to match the stages you want to
run.
Thanks for the question @LucZ!
dvc checkout
for each experiment?
This brings up a good point, so thanks @dmh!
If you usually run experiments with dvc repro
, you'll notice that it doesn't
checkout any files. That's because the experiment is running in the current
workspace.
When you use dvc exp run --queue
or dvc exp run --run-all
, it runs each
experiment in its own separate temp workspace, so files have to be checked out
into those workspaces. Check out the notes in
this reference doc on queueing and parallel execution
for more details.
dvc push
to the remote storage without pulling data from all the directories?
This is definitely possible. The most common way to handle this is by working in
the specific folder. You can dvc pull -R
from the sub-directory, then make
your changes in the sub-directory, and dvc add
the changes. Then you can do a
git commit
and dvc push
to manage those changes.
You can also use a Git sub-repo and a DVC sub-repo to do this if each folder has
a distinct project. Use git init
and dvc init
in the project folders and
then you can pull them down, modify, commit and push commit back.
Really good question @ross.tsenov!
Thanks for asking @Masmoudi!
When you need to retrieve data, you can run dvc pull
to get it from the S3
bucket. If you run into an error with this, try adding
uses: iterative/setup-dvc@v1
to the dvc pull
command. This could happen
because the default CML action doesn't install DVC.
If you want more details on how CML works in GitHub, check out the docs!
You can use
pipeline schedules to
train your model periodically and dvc push
the results.
Good question @mihaj!
At our October Office Hours Meetup we will be going over how to get started with data version control. RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!
Join us in Discord to get all your DVC and CML questions answered!