End-to-End Computer Vision API, Part 1: Data Versioning and ML Pipelines

This is the first part of a three-part series of posts:

  • Part 1: Data Versioning and ML Pipelines (this post)
  • _Part 2: [Local

Experiments](/blog/end-to-end-computer-vision-api-part-2-local-experiments)_

  • _Part 3: [Remote Experiments & CI/CD For

Machine Learning](/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cd)_

Lately, training a well-performing Computer Vision (CV) model in Jupyter Notebooks became fairly straightforward. You can use pre-trained Deep Learning models and high-level libraries like fastai, keras, pytorch-lightning etc., that abstract away much of the complexity. However, it's still hard to incorporate these models into a maintainable production application in a way that brings value to the customers and business.

Below we'll present the tools that naturally integrate with your git repository and makes this part of the process significantly easier.

  • Alex Kim
  • May 03, 20225 min read
Hero Picture

Introduction

In this series of posts, we'll describe an approach that streamlines the lifecycles stages of a typical Computer Vision project going from proof-of-concept to configuration and parameter tuning to, finally, deployment to the production environment.

Automatic defect detection is a common problem encountered in many industries, especially manufacturing. A typical setup would include a conveyor belt that moves some products along the production line and a camera installed above the conveyor. The camera takes pictures of the products moving below and connects to a computer that controls it. This computer needs to send raw images to some defect detection service, receive information about the location and size of the defects, if any, and may even control what happens to a defective product by being connected to a robotic arm via a PLC (programmable logic controller).

As our demo project, we've selected a very common deployment pattern for this setup: a CV model wrapped in a web API service. Specifically, we'll perform an image segmentation task on a magnetic tiles dataset first introduced in this paper and available in this GitHub repository.

A sample from the image segmentation dataset we used for this project. Top
row: images of magnetic tile surfaces. Bottom row: segmentation mask (white
pixels show defective areas)

  • This post (part 1) introduces the concepts of data versioning and ML pipelines as they apply to Computer Vision projects.
  • Part 2 will focus on experiment tracking and management - key components needed for effective collaboration between team members.
  • In part 3, you’ll learn how to easily move your model training workloads from a local machine to cloud infrastructure and set up proper CI/CD workflows for ML projects.

Target Audience

We assume the target audience of this post to be technical folks who are familiar with the general Computer Vision concepts, CI/CD processes, and Cloud infrastructure. Familiarity with the Iterative ecosystem of tools such as DVC, CML, and Studio is not required but would help with understanding the nuances of our solution.

Summary of the Solution

All the code for the project is stored in this GitHub repository.

The CV API solution that we are proposing can be summarized in the following steps:

  • Client service will submit the image to our API endpoint
  • The image will be preprocessed to adhere to the specifications that our model expects
  • The CV model will ingest the processed image and output its prediction image mask
  • Some postprocessing will be applied to the image mask
  • A reply back to the client with the output mask

The repository also contains code for the web application itself, which can be found in the app directory. While the web application is very simple, its implementation is beyond the scope of this blog post. In short, we can say that it's based on the FastAPI library, and we deploy it to the Heroku platform through a Docker container defined in this Dockerfile.

Simplified diagram of the CV API
solution

Prerequisites for Reproduction

Feel free to fork the repository if you'd like to replicate our steps and deploy your own API service. Keep in mind that you'll need to set up and configure the following:

For security reasons, you'll need to set up all keys and tokens through GitHub secrets. You'll also need to change the remote location (and its name) in the DVC config file for versioning data and other artifacts.

Proof-of-Concept in Jupyter Notebooks

A typical ML project would start with data collection and/or labeling, but we are skipping all this hard work because it was done for us by the researchers who published the dataset.

We'll get right to the exciting part of training CV models in Jupyter notebooks which you can find here. In short, there we have three notebooks:

  1. 1_ProcessData.ipynb downloads, processes, and organizes the data for easy loading into the training process later
  2. 2_TrainSegmentationModel.ipynb uses fastai Deep Learning framework to train an image segmentation model
  3. 3_Evaluate.ipynb computes model performance on the test dataset

Jupyter Notebook is by far the most popular tool for quick exploratory work when it comes to data analysis and modeling. However, it's not without its own limitations. One of the biggest issues of Jupyter is that it has no guardrails to ensure reproducibility, e.g. hidden states of variables and objects as well as the possibility to run cells out of order. While there are several projects that attempt to alleviate some of these issues (notably, nodebook, papermill, nbdime, nbval, nbstripout, and nbQA), they don’t solve them completely.

That's where the concepts of data versioning and ML pipelines come in.

Data Versioning

In most ML projects, training data changes gradually over time as new training instances (images in our case) get added while older ones might be removed. Simply creating snapshots of our training data at the time of training (e.g. labeling data directories with dates) quickly becomes unsustainable since these snapshots will contain many duplicates. Additionally, tracking which data directory was used to train each model becomes hard to manage very fast; and linking data versions and models to their respective code versions complicates things even further.

A much better approach is to:

  1. track only the deltas between different versions of the datasets; and

  2. have the project’s git repository store only the reference links to the data while the actual data is stored in a remote storage

This is exactly what we can do with DVC by running only a couple of DVC commands. In turn, DVC handles all the underlying complexity of managing data versions, performing file deduplication, pushing and pulling to/from different remote storage solutions and more.

Check out this tutorial to learn more about data and model versioning with DVC.

Diagram of how DVC performs data versioning

In this project, AWS S3 is our remote storage configured in the .dvc/config file. In other words, we store the images in an AWS bucket while only keeping references to those files in our git repository.

Refactoring Jupyter code into an ML pipeline

Another powerful set of DVC features is ML pipelines. An ML pipeline is a way to codify and automate the workflow used to reproduce a machine learning model. A pipeline consists of a sequence of stages.

First, we did some refactoring of our Jupyter code into individual and self-contained modules:

  • data_load.py downloads raw data locally
  • data_split.py splits data into train and test subsets
  • train.py uses fastai library to train a UNet model with a ResNet-34 encoder and saves it into a pickle file
  • eval.py evaluates the model's performance on the test subset

Specific execution commands, dependencies, and outputs of each stage are defined in the pipeline file dvc.yaml (more about pipelines files here).

We've also added an optional check_packages stage that freezes the environment into a requirements.txt file containing all python packages and their versions installed in the environment. We enabled the always_changed field in the configuration of this stage to ensure DVC reruns this stage every time. All other stages have this text file as a dependency. Thus, the entire pipeline will be rerun if anything about our python environment changes.

We can see the whole dependency graph (directed acyclic graph, to be exact) using the dvc dag command:

$ dvc dag
                               +----------------+
                               | check_packages |
                          *****+----------------+
                    *****         *    **    **
               ****             **       **     ***
            ***               **            **      ***
+-----------+               **               *         ***
| data_load |             **                 *             *
+-----------+           **                   *             *
           ***        **                     *             *
              *     **                       *             *
               **  *                         *             *
          +------------+                     *             *
          | data_split |***                  *             *
          +------------+   ***               *             *
                  *            ***           *             *
                  *               ***        *             *
                  *                   **     *             *
                  **                    +-------+        ***
                    ***                 | train |     ***
                        ***             +-------+   ***
                            ***        **        ***
                               ***   **      ***
                                   **    ***
                                  +----------+
                                  | evaluate |
                                  +----------+

The entire pipeline can be easily reproduced with the dvc exp run command:

$ dvc exp run
Running stage 'check_packages':
> python src/stages/check_pkgs.py --config=params.yaml
...
Running stage 'data_load':
> python src/stages/data_load.py --config=params.yaml
...
Running stage 'data_split':
> python src/stages/data_split.py --config=params.yaml
...
Running stage 'train':
> python src/stages/train.py --config=params.yaml
...
Running stage 'evaluate':
> python src/stages/eval.py --config=params.yaml
...

Summary

In this first part of the blog post, we talked about the following:

  • Common difficulties when building Computer Vision Web API for defect detection
  • Pros and cons of exploratory work in Jupyter Notebooks
  • Versioning data in remote storage with DVC
  • Moving and refactoring the code from Jupyter Notebooks into DVC pipeline stages

In the second part, we’ll see how to get the most out of experiment tracking and management by seamlessly integrating DVC, DVCLive, and Iterative Studio.

Back to blog