Support our DVC Extension for VSCode on Product Hunt
DVC is looking for students to take part in Google Summer of Code 2020.
Announcement, announcement! After a successful experience with Google Season of Docs in 2019, we're putting out a call for students to apply to work with DVC as part of Google Summer of Code. If you want to make a dent in open source software development with mentorship from our team, read on.
Besides the general requirements to apply to Google Summer of Code, there are a few skills we look for in applicants.
If you like our mission but aren't sure if you're sufficiently prepared, please be in touch anyway. We'd love to hear from you.
Below are several project ideas that are an immediate priority for the core DVC team. Of course,we welcome students to create their own proposals, even if they differ from our ideas. Projects will be primarily mentored by co-founders Dmitry Petrov and Ivan Shcheklein.
Migrate to the latest v3 API to improve Google Drive support. Our
organization is a co-maintainer of the PyDrive library in collaboration with
a team at Google. The PyDrive library is now several years old and still
relies on the v2 protocol. We would like to migrate to v3, which we expect
will boost performance for many DVC use cases (e.g. the ability to filter
fields being retrieved from our API, etc). For this project, we’re looking
for a student to work with us to prepare the next major version of the
PyDrive library, as well as making important changes to the core DVC code to
support it. Because PyDrive is broadly used outside of DVC, this project is a
chance to work on a library of widespread interest to the Python community.
Skills required: Python, Git, experience with APIs
Difficulty rating: Beginner-Medium
Introducing parallelism to DVC. One of DVC’s features is the ability to
create pipelines, linking data repositories with code to process data, train
models, and evaluate model metrics. Once a DVC pipeline is created, the
pipeline can be shared and re-run in a systematic and entirely reproducible
way. Currently, DVC executes pipelines sequentially, even though some steps
may be run in parallel (such as data preprocessing). We would like to support
parallelization for pipeline steps specified by the user. Furthermore, we’ll
need to support building flags into DVC commands that specify the level of
parallelization (CPU, GPU or memory).
Skills required:
Python, Git. Some experience with parallelization and/or scientific computing
would be helpful but not required.
Difficulty rating: Advanced
Developing use cases for data registries and ML model zoos. A new DVC
functionality that we’re particularly excited about is summon
, a method
that can turn remotely-hosted machine learning artifacts such as datasets,
trained models, and more into objects in the user’s local environment (such
as a Jupyter notebook). This is a foundation for creating data catalogs of
data-frames and machine learning model zoos on top of Git repositories and
cloud storages (like GCS or S3). We need to identify and implement model zoos
(think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo)
and data registries for types that are not supported by DVC yet. Currently,
we’ve tested summon
with PyTorch image segmentation models and Pandas
dataframes. We’re looking for students to explore other possible use cases.
Skills required: Python, Git, and some machine learning or
data science experience
Difficulty rating: Beginner-Medium
Continuous delivery for JetBrains TeamCity. Continuous integration and
continuous delivery (CI/CD) for ML projects is an area where we see
DVC make a big impact-
specifically, by delivering datasets and ML models into CI/CD pipelines.
While there are many cases when DVC is used inside GitHub Actions and GitLab
CI, you will be transferring this experience to another type of CI/CD system,
JetBrains TeamCity. We're working to
integrate DVC's model and dataset versioning into TeamCity's CI/CD toolkit.
This project would be ideal for a student looking to explore the growing
field of MLOps, an offshoot of DevOps with the specifics of ML projects at
the center.
Skills required: Python, Git, bash scripting. It
would be nice, but not necessary, to have some experience with CI/CD tools
and developer workflow automation.
Difficulty rating:
Medium-Advanced
DVC performance testing framework. Performance is a core value of DVC. We
will be creating a performance monitoring and testing framework where new
scenarios (e.g., unit testing)can be populated. The framework should reflect
all performance improvements and degradations for each of the DVC releases.
It would be especially compelling if testing could be integrated with our
GitHub workflow (CI/CD). This is a great opportunity for a student to learn
about DVC and versioning in-depth and contribute to its stability.
Skills required: Python, Git, bash scripting.
Difficulty
rating: Medium-Advanced
Please refer to the Google Summer of Code application guides for specifics of the program. Students looking to know more about DVC, and our worldwide community of contributors, will learn most by visiting our Discord channel, GitHub repository, and Forum. We are available to discuss project proposals from interested students and can be reached by email or on our Discord channel.