We recently published a tutorial on using Terraform Provider Iterative (TPI) to move a machine learning experiment from your local computer to a more powerful cloud machine. We've covered how you can use Terraform & TPI to provision infrastructure, sync data, and run training scripts. To simplify the setup, we used a pre-configured Ubuntu/NVIDIA image. However, instead of using a pre-configured image, we can use custom Docker images. These are often recommended in machine learning as well as in traditional software development.
Using Docker to manage dependencies (e.g. Python packages) does not remove all other setup requirements. You'll still need Docker itself installed, as well as GPU runtime drivers if applicable. Happily, TPI sets up all of this by default.
When confronted with cloud infrastructure and dependencies, people often think "oh no, not again" (much like the petunias in the cover image). To solve this, separating dependencies into Docker images gives more control over software versions, and also makes it painless to switch between cloud providers — currently Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and Kubernetes. Your Docker image is cloud provider-agnostic. There are thousands of pre-defined Docker images online too.
In this tutorial, we'll use an existing Docker image that comes with most of our requirements already installed. We'll then add add a few more dependencies on top and run our training pipeline in the cloud as before!
If you haven't read the previous tutorial, you should check out
the basics there first. This includes how to let Terraform know about TPI, and
essential commands (
The only modification from the previous tutorial is the script
part of the
main.tf config file.
Let's say we've found a carefully prepared a Docker image suitable for data
science and machine learning — in this case,
This image comes loaded with Ubuntu 20.04, Python 3.8, NodeJS, CUDA 11.0.3,
CuDNN 8, Git, CML, DVC, and other
essentials for full-stack data science.
script block is now:
script = <<-END #!/bin/bash docker run --gpus all -v "$PWD:/tpi" -w /tpi -e TF_CPP_MIN_LOG_LEVEL \ iterativeai/cml:0-dvc2-base1-gpu /bin/bash -c " pip install -r requirements.txt tensorflow==2.8.0 python train.py --output results-gpu/metrics.json " END
Yes, it's quite long for a one-liner. Let's looks at the components:
docker run: Download the specified image, create a container from the image, and run it.
--gpus all: Expose GPUs to the container.
-v "$PWD:/tpi": Expose our current working directory (
$PWD) within the container (as path
-w /tpi: Set the working directory of the container (to be
-e TF_CPP_MIN_LOG_LEVEL: Expose the environment variable
TF_CPP_MIN_LOG_LEVELto the container (in this case to control TensorFlow's verbosity).
iterativeai/cml:0-dvc2-base1-gpu: The image we want to download & run a container from.
/bin/bash -c "pip install -r requirements.txt ... python train.py ...": Commands to run within the container's working directory. In this case, install the dependencies and run the training script.
We can now call
export TF_LOG_PROVIDER=INFO, and
terraform apply to provision infrastructure, upload our data and code, set up
the cloud environment, and run the training process. If you'd like to tinker
with this example you can
find it on GitHub.
Don't forget to
terraform refresh && terraform show to check the status, and
terraform destroy to download results & shut everything down.
Now you know the basics of using convenient Docker images together with TPI for provisioning your MLOps infrastructure!
If you have a lot of custom dependencies that rarely change (e.g. a large
requirements.txt that is rarely updated), it's a good idea to build it into
your own custom Docker image. Let us know if you'd like a tutorial on this!