This is the first part of a three-part series of posts:
Lately, training a well-performing Computer Vision (CV) model in Jupyter
Notebooks became fairly straightforward. You can use pre-trained Deep Learning
models and high-level libraries like
etc., that abstract away much of the complexity. However, it's still hard to
incorporate these models into a maintainable production application in a way
that brings value to the customers and business.
Below we'll present the tools that naturally integrate with your git repository and makes this part of the process significantly easier.
In this series of posts, we'll describe an approach that streamlines the lifecycles stages of a typical Computer Vision project going from proof-of-concept to configuration and parameter tuning to, finally, deployment to the production environment.
Automatic defect detection is a common problem encountered in many industries, especially manufacturing. A typical setup would include a conveyor belt that moves some products along the production line and a camera installed above the conveyor. The camera takes pictures of the products moving below and connects to a computer that controls it. This computer needs to send raw images to some defect detection service, receive information about the location and size of the defects, if any, and may even control what happens to a defective product by being connected to a robotic arm via a PLC (programmable logic controller).
As our demo project, we've selected a very common deployment pattern for this setup: a CV model wrapped in a web API service. Specifically, we'll perform an image segmentation task on a magnetic tiles dataset first introduced in this paper and available in this GitHub repository.
We assume the target audience of this post to be technical folks who are familiar with the general Computer Vision concepts, CI/CD processes, and Cloud infrastructure. Familiarity with the Iterative ecosystem of tools such as DVC, CML, and Studio is not required but would help with understanding the nuances of our solution.
All the code for the project is stored in this GitHub repository.
The CV API solution that we are proposing can be summarized in the following steps:
The repository also contains code for the web application itself, which can be
found in the
directory. While the web application is very simple, its implementation is
beyond the scope of this blog post. In short, we can say that it's based on the
FastAPI library, and we deploy it to the
Heroku platform through a Docker container defined in this
Feel free to fork the repository if you'd like to replicate our steps and deploy your own API service. Keep in mind that you'll need to set up and configure the following:
For security reasons, you'll need to set up all keys and tokens through GitHub secrets. You'll also need to change the remote location (and its name) in the DVC config file for versioning data and other artifacts.
A typical ML project would start with data collection and/or labeling, but we are skipping all this hard work because it was done for us by the researchers who published the dataset.
We'll get right to the exciting part of training CV models in Jupyter notebooks which you can find here. In short, there we have three notebooks:
1_ProcessData.ipynbdownloads, processes, and organizes the data for easy loading into the training process later
fastaiDeep Learning framework to train an image segmentation model
3_Evaluate.ipynbcomputes model performance on the test dataset
Jupyter Notebook is by far the most popular tool for quick exploratory work when
it comes to data analysis and modeling. However, it's not without
its own limitations. One of the
biggest issues of Jupyter is that it has no guardrails to ensure
reproducibility, e.g. hidden states of variables and objects as well as the
possibility to run cells out of order. While there are several projects that
attempt to alleviate some of these issues (notably,
nbQA), they don’t solve them completely.
That's where the concepts of data versioning and ML pipelines come in.
In most ML projects, training data changes gradually over time as new training instances (images in our case) get added while older ones might be removed. Simply creating snapshots of our training data at the time of training (e.g. labeling data directories with dates) quickly becomes unsustainable since these snapshots will contain many duplicates. Additionally, tracking which data directory was used to train each model becomes hard to manage very fast; and linking data versions and models to their respective code versions complicates things even further.
A much better approach is to:
track only the deltas between different versions of the datasets; and
have the project’s git repository store only the reference links to the data while the actual data is stored in a remote storage
This is exactly what we can do with DVC by running only a couple of DVC commands. In turn, DVC handles all the underlying complexity of managing data versions, performing file deduplication, pushing and pulling to/from different remote storage solutions and more.
Check out this tutorial to learn more about data and model versioning with DVC.
In this project, AWS S3 is our remote storage configured in the
file. In other words, we store the images in an AWS bucket while only keeping
references to those files in our git repository.
Another powerful set of DVC features is ML pipelines. An ML pipeline is a way to codify and automate the workflow used to reproduce a machine learning model. A pipeline consists of a sequence of stages.
First, we did some refactoring of our Jupyter code into individual and self-contained modules:
data_load.pydownloads raw data locally
data_split.pysplits data into train and test subsets
fastailibrary to train a UNet model with a ResNet-34 encoder and saves it into a pickle file
eval.pyevaluates the model's performance on the test subset
We've also added an optional
stage that freezes the environment into a
requirements.txt file containing all
python packages and their versions installed in the environment. We enabled the
field in the configuration of this stage to ensure DVC reruns this stage every
time. All other stages have this text file as a dependency. Thus, the entire
pipeline will be rerun if anything about our python environment changes.
We can see the whole dependency graph (directed acyclic graph, to be exact)
dvc dag command:
$ dvc dag +----------------+ | check_packages | *****+----------------+ ***** * ** ** **** ** ** *** *** ** ** *** +-----------+ ** * *** | data_load | ** * * +-----------+ ** * * *** ** * * * ** * * ** * * * +------------+ * * | data_split |*** * * +------------+ *** * * * *** * * * *** * * * ** * * ** +-------+ *** *** | train | *** *** +-------+ *** *** ** *** *** ** *** ** *** +----------+ | evaluate | +----------+
The entire pipeline can be easily reproduced with the
dvc exp run command:
$ dvc exp run Running stage 'check_packages': > python src/stages/check_pkgs.py --config=params.yaml ... Running stage 'data_load': > python src/stages/data_load.py --config=params.yaml ... Running stage 'data_split': > python src/stages/data_split.py --config=params.yaml ... Running stage 'train': > python src/stages/train.py --config=params.yaml ... Running stage 'evaluate': > python src/stages/eval.py --config=params.yaml ...
In this first part of the blog post, we talked about the following: