A roundup of technical Q&A's from the DVC community. This month, we discuss getting started with CML, configuring your DVC cache, and how to request a tutorial video.
Here are some of our top Q&A's from around the community. With the launch of CML earlier in the month, we've got some new ground to cover!
create a global DVC remote,
a config file will be created in
~/.config/dvc/config instead of your project
Note that on a Windows system, the config file will be created at
dvc pullto sync my local workspace with the project repository. Then, I try running
dvc repro, but get an error:
dvc.yaml does not exist. No one else on my team is having this issue. Any ideas?
This error suggests there is no
dvc.yaml file in your project. Most likely,
this means your teammates are using DVC version 0.94 or earlier, before the
dvc.yaml standard was introduced. Meanwhile, it sounds like you're using
version 1.0 or later. You can check by running
$ dvc version
The best solution is for your whole team to upgrade to the latest version- and there's an easy migration script to help you make the move. If for some reason this won't work for your team, you can either downgrade to a previous version, or use a workaround:
$ dvc repro <.dvc file>
substituting the appropriate
.dvc file for your pipeline. DVC 1.0 is backwards
compatible, so pipelines created with previous versions will still run.
If you're installing DVC from binary-such as the
downloadable on the DVC homepage- all the standard
dependencies are included. You shouldn't need to use
pip to install extra
boto for S3 storage).
When DVC adds a file to a remote repository (such as an S3 bucket, or an SSH file server), there's only one change happening: DVC calculates an md5 for the file and renames it with that md5. In technical terms, it's storing files in a "content-addressable way". That means if you know the hash of a file, you can locate it in your DVC remote and manually download it.
To find the hash for a given file, say
data.csv, you can look in the
corresponding DVC file:
$ cat data.csv.dvc
Another approach is using a built-in DVC function:
$ dvc get --show-url . data.csv
You can read more about
dvc get --show-url in
our docs. Note that this
functinality is also part of our Python API, so you can locate the path to a
file in your remote within a Python environment.
Check out our API docs!
Yes, we hear from many users who have created a shared cache. Because of the way DVC uses content-addressable filenames, you won't encounter issues like accidentally overwriting files from one project with another.
A possible issue is that a shared cache will grant all teammates working on a given project access to the data from all other projects using that cache. If you have sensitive data, you can create different caches for projects involving private and public data.
To learn more about setting your cache directory location, see our docs.
The first release of CML is compatible with GitHub and GitLab. We've seen many requests for Bitbucket support, and we're actively investigating how to add this. Stay tuned.
There is a slight difference in the way CML handles credentials for AWS and GCP, and that means you'll have to modify your workflow file slightly. We've added an example workflow for GCP to our project README.
We've updated our cloud compute use case repository docs to cover a GCP example.
Note that for Azure, the workflow will be the same as for AWS. You'll only have
to change the arguments to
Nope, there's no installation unless you wish to install CML in your own Docker image. As long as you are using GitHub Actions or GitLab CI with the CML Docker images, no other steps are needed.
If you're creating your own Docker image to be used in a GitHub Action or GitLab CI pipeline, you can add CML to your image via npm:
$ npm i -g @dvcorg/cml
CML is designed to integrate with lots of tools that ML teams are already familiar with. For example, we set up a wrapper to use CML with Tensorboard, so you get a link to your Tensorboard in a PR whenever your model is training (check out the use case).
While we haven't yet tried to create a use case with MLFlow in particular, we think a similar approach could work. We could imagine using MLFlow for hyperparameter searching, for example, and then checking in your best model with Git to a CI system for evaluation in a production-like environment. CML could help you orchestrate compute resources for model evaluation in your custom environment, pulling the model and any validation data from cloud storage, and reporting the results in a PR.
If this is something you're interested in, make an issue on our project repository to tell us more about your project and needs- that lets us know it's a priority in the community.
Yes! We recently launched our first CML tutorial video, and a lot of folks let us know they want more. We're aiming to release a new video every week or so in the coming months. Topics will include:
We really want to know what use cases, questions, and issues are most important to you. This will help us make videos that are most relevant to the community! If you have a suggestion or idea, no matter how small, we want to know. Leave a comment on our videos, reach out on Twitter, or ping us in Discord.