This month you will find:
🧱 AI infrastructure is becoming modular,
🔎 Distribution drift with Chip Huyen
🥰 Tutorials and workflows from the Community,
🗣 IRL events,
💻 Online course(s) updates,
🚀 Quebec becoming an Iterative hub, and more!
This month's Heartbeat image is inspired by Community member Gudmundur Heimisson. Gudmundur submitted some great PRs to update WebHDFS docs pending some other issues in the DVC repo.
This image refelcts his Paris area team's view of Château de Vincennes out their company windows!
We are grateful for all our Community members' contributions from all around the world!
Welcome to April! We have lots to ingest from the AI World and the Community so let's get started with all the building blocks for success!
Casber Wang of Sapphire VC recently wrote a piece in Medium on the necessary trend of AI infrastructure tooling becoming modular. He notes three types of AI user types, "Off-the-shelfers," "Bet-the-Farmers," and "Rocket Scientists." As the industry matures he makes the case (and we concur) for the need for modular infrastructure tooling to provide AI teams with the most flexible approach as they fine-tune their advancing and ever-growing processes.
Where organizations used to seek all-in-one solutions to operationalize machine learning (ML) due to limited in-house resources and expertise, we’re seeing a rise in the demand for modular, best-in-class tooling that equips today’s more robust ML teams with the ability to flexibly run highly-custom and performant ML workloads.
Clayton Christensen's Modularity Theory (Source link)
Soon, large data teams will turn to modular toolkits with dozens of solutions that manage different stages of the AI lifecycle. This will be particularly true of the “bet-the-farmers”, who will need customized, best-in-class tools that provide the flexibility that can match their exact challenge.
Wang describes the different toolchain groupings in the AI Lifecycle and discusses some of the players in each of them. DVC shows up in the Model Evaluation & Experiment Tracking group, but soon you will see that our tools deliver flexible, modular building blocks for some other pieces of the puzzle.
Chip Huyen's most recent blog post created for the course at Stanford CS 329S: Machine Learning Systems Design goes into detail on all things related to data distribution shifts and the monitoring of them. The piece provides great examples to understand concepts such as natural labels, the types of distribution shifts, causes of ML System failure, and the metrics needed to monitor these things to determine when your model is no longer producing the desired results. She discusses tools that can help identify these shifts including logs, dashboards, and alerts, acknowledging the pluses and minuses of each approach. Finally, the emergence of the favoring of the term observability over monitoring is discussed because it is a stronger concept for determining what went wrong with the internal states of a system by observing the external outputs.
Drift detection algorithms by open-source package alibi-detect (Source link)
Preventing Stale Models in Production (Source link)
The team at xpresso.ai created this short post about all the facets that make up MLOps. While the tried and true CRISP-DM model for Data Science takes us right up to production, MLOps encompasses considerably more processes that keep and maintain a model in production over time. You can see all of these things highlighted in their image below, providing lots to ponder!
Machine Learning Operations (Source link)
If your head is spinning with all the ample facets of the MLOps world as outlined in xpresso.ai's diagram above and where you fit, or in the AI world in general, Kaushik Shakkari wrote this article dividing up the AI space into three environments: Research, Development, and Production. He goes into detail about the type of work, skillsets, and roles found in each. This breakdown can help the reader zero in on where he or she may best fit and be fulfilled in this vast and often confusing space as well as determine a pathway for their career.
Yahaswi Nayak writes a wonderful guide for data scientists and engineers, filled with great story-telling and fun images created by the author about using CML to provide CI/CD to ML projects. He discusses the usual software development cycle using Git and then follows with the complexities introduced by ML projects. He identifies the reasons why CML is needed in the ML space, and how CML works.
Yahaswi gives the scenario of a team working on a classifier problem and how CML would work for different team members tackling different parts of the problem. He details all the questions a CML.yml file answers and takes care of in the workflow. Finally, he lists a number of use cases for readers to try out with CML. We'd love to see some Community members write about some of these use cases that they've put into action!
CML workflow (Source link)
If you've ever struggled with setting up your Azure Blob Storage with DVC, or you know you will need to in the near future, you're in luck! Zoumana Keita shows you how to do just that in this post in Towards Data Science. He recently was struggling with the same problem and team member, David de la Iglesia Castro came to the rescue on our Discord Server. Zoumana was kind enough to write a blog article on the detailed steps for the benefit of the Community.
At this point in this Heartbeat, you probably grasp the importance of data, model, and experiment versioning and how DVC easily versions large files in conjunction with Git, which Zoumana describes. But he then takes you on a detailed journey with screenshots of all the steps to get DVC set up with Azure Blob Storage. Many thanks for this tutorial! 🙏🏼
Ahmed Abdullah wrote this tutorial in Medium about how to get DVC set up to version your data and models with a Google Drive. He takes you in detail through the steps and discusses many of the reasons why this versioning is important to your success as an ML engineer including ever-changing data, effective collaboration with teammates, and the need for keeping data separated from code for security reasons.
In-person conferences are going on and we are excited to be able to see the Community in person again!
We've grown from 250 students last month to 450 right now!🎉 We are so happy to see you all in the platform learning! What's coming:
Thank you to all who have provided feedback after each course module! We are going through this feedback, making adjustments, and keeping them in mind for the next course!
Dan Martinec joins us from the Czech Republic as a field data scientist. Dan first learned about Iterative through using DVC in his work as an ML Engineer. Dan originally studied Control Engineering at CTU in Prague. He graduated with a PhD and has worked in various fields (C++ development at Porsche, mathematical optimization in a small start-up, ML engineer at Avast). When not working Dan enjoys hobby projects in the garden such as building my own storage lodge for firewood, building a wooden composter, implementing a wireless water level reader in the water tank, etc. And after that hard work, he is known to appreciate a good movie. Welcome, Dan!
Yury Kasimov also joins us from Prague, the Czech Republic as Field Data Scientist. He studied Robotics during his Bachelor's studies and then Artificial Intelligence for his Master degree. Yury worked for some as a part of a university group that helps protect NGOs from different cyber attacks. Prior to joining the team, he spent the last 4 years as an ML engineer at Avast. In his free time, Yury plays a lot of tennis and is learning to play the drums. He speaks English, Czech, Russian, and a bit of Spanish. Bienvenidos, Yury!
Chaz Black joins us as an Account Executive from Atlanta, Georgia. Most recently he worked at H2O.ai leading their business development team for 3 years. When Chaz is not helping clients, you may find him checking out the ever-growing Atlanta food scene and hunting new and exciting coffees and brewing styles. He is also a big audiophile and like many on our team, Chaz enjoys board and video games when he has the time, with his two cats looking over his shoulder. Welcome, Chaz!
Many in our Community already know our latest hire, Daniel Barnes, as he has already been a great contributor to our tools! We are excited to welcome him officially to the team as a Software Engineer. Daniel is based in the Seattle, Washington area, having recently moved back after two years in Korea. He has had a varied career path, starting in IT security, programming, as a medic, then cyber in the US military, and then to PACCAR where he discovered our open-source community! When not solving complex software engineering challenges, Daniel has been noted as a bit of an adrenaline junky with "hobbies" like skydiving, paragliding, and motorcycles. Welcome, Daniel!
Maxim Aginsky joins the team as a Senior Product Designer from Montreal, Canada, marking our 4th employee from the Province of Quebec! Maxim has worn many hats over the years working on Product Development and most recently was the Director of Design for a Montreal Fintech company. You can explore his portfolio here. Welcome, Maxim!
Even with our amazing new additions to the team, we're still hiring! Use this link to find details of all the positions and share with anyone you think may be interested! 🚀
Iterative is Hiring (Source link)
We've been following along on Anna's journey through #100daysofcode to learn DVC. And now she's working on a project of her own using Amazon Best Seller data.
This ensures:— Anna 🐍 🌹 (@__anavc__) March 26, 2022
- Serves as a basis for perfoming CI/CD for ML.
👉 Since I love pipelines, I have been creating my own using a dataset and a notebook from Kaggle. It is still in progress, but here is the link 👇:https://t.co/E09g6WSYor
Do you have any use case questions or need support? Join us in Discord!
Head to the DVC Forum to discuss your ideas and best practices.