Using GCP Remotes in DVC
When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to do this quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote in a GCP storage bucket and handle data versioning with DVC.
We’ll start by creating a new storage bucket in our GCP account, then we’ll show how you can add DVC to your project, and finally, we’ll make updates to the dataset with DVC commands. We’ll be working with this repo if you want an example to play with. By the time you finish, you should be able to create this setup for any machine learning project using a GCP remote.
Make sure that you already have a GCP account. You’ll need a valid credit card to create a new account. Once you’re logged into your account, you should see a screen like this with some of the services GCP offers.
Note: Remember, GCP does have a free tier if you just want to try it out.
From here, you'll need to create a new project. Search for "create a project"
and click the "IAM & Admin" option. You'll enter the name of the project, which
Bicycle Project, and choose the organization and location and click the
Create button. This will take you to your project dashboard and show you all
of the stats and settings you have available.
Then you need to go to
Cloud Storage in the left sidebar to create a bucket to
store the data. When you get to the Cloud Storage page, you should see something
similar to this and you’ll click the
Create Bucket button.
The Bucket page will have a lot of configurations you can set, but you can leave
the settings in the default state if there’s nothing you need to customize. We
have named this example bucket
updatedbikedata as you can see below.
Now you can save your changes and you’ll be redirected to the
page and you’ll see the bucket you just created.
Since you have the bucket created, we need to get the credentials to connect the
GCP remote to the project. Go to the
IAM & Admin service and go to
Service Accounts in the left sidebar.
Create Service Account button to create a new service account that
you'll use to connect to the DVC project in a bit. Now you can add the name and
ID for this service account and keep all the default settings. We've chosen
bicycle-service-account for the name and
bicycle-account for the ID. Click
Create and Continue and it will show the permissions settings. Select
in the dropdown and click
Then add your user to have access to the service account and click
Finally, you'll be redirected to the
Service accounts page.
You’ll see your service account and you’ll be able to click on
Actions and go
to where you
Manage keys for this service account.
Once you’ve been redirected, click the
Add Key button and this will bring up
the credentials you need to authenticate your GCP account with your project. Go
ahead and download the credentials file and store it somewhere safe.
That’s it for setting up your storage bucket and getting the credentials you need! Now let’s add DVC to our demo repo and set up the remote.
First, add DVC as a requirement to your project with the following installation command:
$ pip install 'dvc[gs]'
Then you can initialize DVC in your own project with the following command:
$ dvc init
This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in GCP to the DVC repo.
Now we can make the GCP storage the default remote for the project with the following command:
$ dvc remote add -d bikes gs://updatedbikedata
This creates a default remote called
bikes that connects to the
updatedbikedata bucket we made earlier which is where the any data for the
model will be stored.
In order for DVC to be able to push and pull data from the remote, you need to have valid GCP credentials.
If you are using the
GCP CLI (google-cloud-sdk)
already, you should be able to run
gcloud auth application-default login. This
method doesn't require a service account.
You can also authenticate with the service account we created earlier in a couple of ways with that credentials file we downloaded.
You can run the following command with your service account email.
$ gcloud auth activate-service-account email@example.com --key-file=../tonal-history-154018-e62a79baf90f.json
If you don't have the GCP CLI installed and you want to use the service account,
you can set the
GOOGLE_APPLICATION_CREDENTIALS environment variable to point
to the credentials file, like this:
$ export GOOGLE_APPLICATION_CREDENTIALS='../tonal-history-154018-e62a79baf90f.json'
Or you can add the credentials file location with the following command:
$ dvc remote modify --local bikes credentialpath '../tonal-history-154018-e62a79baf90f.json'
You can check out more about authentication here in the GCP docs.
Now you can push data from your local machine to the GCP remote! First, add the data you want DVC to track with the following command:
$ dvc add data
This will allow DVC to track the entire
data directory so it will note when
any changes are made. Then you can push that data to your GCP remote with this
$ dvc push
Here's what that data will look like when it has been successfully uploaded to GCP.
Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo, setting up the remote and running:
$ dvc pull
Note: Depending on the authentication method being used, there might be some required extra steps, such as making sure users actually have the permissions to read/write to the bucket.
That’s it! Now you can connect any DVC project to a GCP storage bucket. If you run into any issues, make sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.