A roundup of technical Q&A's from the DVC community. This month, we discuss the DVC cache, cloud storage options and concurrency.
Here are some Q&A's from our Discord channel that we think are worth sharing.
To list the files that are currently being tracked in a project repository by
DVC and Git, you can use
dvc list. This will display the contents of that
.dvc files. To download the contents corresponding to a
.dvc file, use
Let's consider an example using both functions. Assume we're working with DVC's data registry example repository. To list the files present, run:
$ dvc list -R https://github.com/iterative/dataset-registry .gitignore README.md get-started/.gitignore get-started/data.xml get-started/data.xml.dvc ...
Note that the
-R flag, which enables
dvc list to display the contents of
directories inside the repository. Now assume you want to download
which we can see is being tracked by DVC. To download the dataset to your local
workspace, you would then run
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml
dvc gc --cloudso users can't accidently delete files in the remote. Will it be sufficient to restrict deletion in the remote's settings?
You're right to be careful, because
dvc gc --cloud can be dangerous in the
wrong hands- it'll remove any unused files in your remote (for more info,
see our docs). To prevent users
from having this power, setting your bucket policy to block object deletions
should do the trick. How to do this will depend on your cloud storage provider-
we found some relevant docs for
For the full list of supported remote storage types,
There's no need for a DVC server. Our remote storage works on top of most kinds of cloud storage by default, including S3, GCP, Azure, Google Drive, and Aliyun, with no additional infrastructure required. As for GitHub (or BitBucket, or GitLab, etc.), this is only needed if you're interested in sharing your project with others over that channel. We like sharing projects on GitHub, but you don't have to. Any Git repository, even a local one, will do.
So a "minimal" DVC project for you might consist of a local workspace with Git enabled (which you do need), a local Git repository, and your S3 remote storage. Check out our use cases to see some examples of infrastructure and workflow for teams.
There are a few ways for concurrency to occur: multiple jobs running in parallel on the same machine, or different users on different machines. But in any case, the answer is the same: there's nothing to worry about! When pushing a file to a DVC remote, all operations are non-destructive and atomic.
We support granular operations on DVC project repositories! Say your project's
DVC remote contains several
.dvc files corresponding to different stages of
2_train_model.dvc. If you're only interested in the files output by the final
stage of the pipeline (
2_train_model.dvc), you can run:
$ dvc pull process_data_stage.dvc
You can also use
dvc pull at the level of individual files. This might be
needed if your DVC pipeline file creates 10 outputs, for example, and you only
want to pull one (say,
model.pkl, your trained model) from remote DVC storage.
You'd simply run
$ dvc pull model.pkl
Sometimes, you realize you don't want to put a file under DVC tracking after
all. That's okay, easy to fix. Simply remove the
.dvc file like any other-
rm <file>.dvc. DVC will then stop tracking the file, and the associated target
file will still be in your local workspace. Note that the file will still be in
unless you clear it with
dvc move command is used to rename a file or directory and simultaneously
modify its corresponding DVC file. It's handy so you don't rename a file in your
local workspace that's under DVC tracking without updating DVC to the change
(see an example here).
The function doesn't work on
"stage files" from DVC
pipelines. There's not currently an easy way to safely move
and it's an
open issue we're working on.
Until then, you can manually update
dvc.yaml, or make a new one in the desired
dvc pushfiles to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?
Yep, that's exactly how it should be! In order to provide deduplication and some
other optimizations, your DVC remote's directory structure will mirror the DVC
cache (which is by default in your local workspace under
Effectively, DVC uses your Git repository to store DVC files, which are keys for
cache files on your remote. So looking inside your remote won't be particularly
enlightening if you're looking for human-readable filenames- the file names will
look like hashes (because, well, they are). Luckily, DVC handles all the
conversions between the filenames in your local workspace and these hashes.
To get some more intuition about this, check out some of our docs about how DVC organizes files.