DVC and Hydra integration

How to use Hydra and DVC in the same project has been a recurrent question in our community for a while. We decided to tackle this by providing a deeper integration and allowing users to benefit from the best of both tools.

  • David de la Iglesia
    +2
    Gema Parreno
    Dave Berenbaum
  • October 04, 20223 min read
Hero Picture

A mixture of DeeVee and Hydra, according to stable diffusion

Hydra has become one of the most popular tools for managing the configuration of research projects and complex applications, given its ability for composing and overwriting configuration both from the command line and from files.

These features are a great complement to many of the values provided as part of DVC: data versioning, data pipelines, and experiment management.

Therefore, we decided to tackle this by providing a deeper integration: using Hydra internals inside DVC and allowing users to benefit from the best of both tools.

In this post, we are going to provide an overview of the benefits that users of both tools can get from the integration.

What DVC users gain from the integration

Use Hydra composition to configure DVC experiments

DVC didn’t provide a way of composing configuration from multiple sources, which can be very convenient in several use cases, like switching between different model architectures. The Hydra docs provide a great overview of common patterns where this composition is useful.

DVC can now use Hydra Composition to configure entire DVC pipelines and run DVC experiments.

You can learn more about this feature on the Hydra Composition page of the user guide.

Appending and removing parameters on the fly

DVC supported a limited functionality for modifying configuration using exp run --set-param.

--set-param can now be used with Hydra’s Basic Override syntax supporting new operations like Appending and Removing parameters for arbitrary parameter files.

When Hydra’s composition is enabled, the same syntax can be used to override values in the Config Groups and Defaults list.

# Append new param
$ dvc exp run -S '+trainer.gradient_clip_val=0.001'
# Remove existing param
$ dvc exp run -S '~model.dropout'
# Target arbitrary files
$ dvc exp run -S 'train_config.json:+train.weight_decay=0.001'
# Modify the defauls list
$ dvc exp run --set-param 'train/model=efficientnet'

Grid Search of parameters

DVC exp run only supported queuing a single experiment at a time.

exp run --set-param can now use Hydra's Choice and Range syntax for adding multiple experiments to the queue and performing a grid search:

$ dvc exp run -S 'model.learning_rate=range(0.01, 0.5, 0.01)' --queue
Queueing with "{'params.yaml': ['model.learning_rate=0.01']}".
Queued experiment '84e89be' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.02']}".
Queued experiment 'd7708fa' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.03']}".
Queued experiment '5494d5c' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.04']}".
Queued experiment '2e16c1f' for future execution.
Queueing with "{'params.yaml': ['model.learning_rate=0.05']}".
Queued experiment '7c7a615' for future execution.

$ dvc queue start

What Hydra users gain from the integration

Git-based versioning and caching

Hydra relies on folder-based versioning for managing multiple runs.

By using the DVC and Hydra integration, you can version the runs using DVC experiments, enabling a more git-friendly workflow and adding caching capabilities so runs won’t be unnecessarily recomputed.

Multi-step pipelines and Language Agnostic

Hydra's scope is limited to a single Python script wrapped with the @hydra.main decorator.

By using the DVC and Hydra integration, you can use Hydra to configure entire DVC pipelines, which can be composed of multiple stages running arbitrary commands.

stages:
  featurize:
    cmd: python src/featurization.py data/prepared data/features
    deps:
      - data/prepared
      - src/featurization.py
    params:
      - featurize.max_features
      - featurize.ngrams
    outs:
      - data/features
  train:
    cmd: python src/train.py data/features model.pkl
    deps:
      - data/features
      - src/train.py
    params:
      - train.min_split
      - train.n_est
    outs:
      - model.pkl
$ dvc exp run -S 'featurize.max_features=200' -S 'train.n_est=100'
Running stage 'featurize':
> python src/featurization.py data/prepared data/features

Running stage 'train':
> python src/train.py data/features model.pkl

Conclusion

Starting with DVC 2.25.0, you can use the features described in this post to efficiently combine Hydra and DVC in your projects.

To get a deeper understanding of all the parts involved, you can read the Hydra Composition page of the DVC user guide.

Back to blog