Challenges and solutions for building machine learning systems

According to Camilla Montonen, the challenges in building machine learning systems primarily revolve around creating and maintaining models. MLOps platforms and solutions include the components needed to build machine systems, but MLOps is not about tools. It is a culture, a set of customs. Montonen suggests that we need to bridge the gap between data science and machine learning engineering practices.

Camilla Montonen spoke about building machine learning systems at NDC Oslo 2023.

Challenges associated with deploying machine learning systems into production include: how to clean, curate, and manage model training data; how to train and evaluate models efficiently; and how to ensure that models perform well in production. Montonen said that includes ways to measure whether they continue to maintain their Other challenges include how to compute and deliver the predictions a model makes on new data, how to handle missing or corrupted data, and edge cases, and when and how to efficiently retrain this model. , version control and how to store these different versions. she added.

Montonen explained that there are typically a set of common components that are part of a machine learning system. They are a feature store, an experiment tracking system that allows data scientists to easily version the different models they create, and a model registry or model versioning system. Track data quality monitoring systems that track which models are currently deployed in production and detect any potential data quality issues. These components are now part of many of her MLOps platforms and solutions available on the market, she added.

Montonen points out that while tools and components do solve problems for the systems they are designed for, in a typical enterprise, the evolution of machine learning systems is often dominated by external factors. He claimed that he had not been able to explain the issue. Areas of technical problems.

MLOps isn't about tools, Montonen argued, it's about culture. It's not just about adding a model's registry or feature's store to the stack, it's about how the people building and maintaining the system interact with it, and minimizing any points of friction. She explained that there is.

This includes thinking about Git hygiene for your ML code repositories, designing ways to test individual components of your pipeline, and thinking about how to maintain feedback loops between your data science experimental and production environments. This includes everything from maintaining high engineering standards throughout the system. code base.

We prioritize data science practices that prioritize rapid experimentation and iteration over robust production-quality code and deployment to production via version control, managed delivery, and CI/CD pipelines. We must strive to bridge the gap between machine learning engineering practices. , automated testing, and more thoughtfully written product code designed to be maintained over a longer period of time, Montonen said.

Rather than immediately deploying a slew of MLOps tools, which are likely to complicate rather than solve problems, Montonen suggested going back to basics.

Start by honestly diagnosing why your machine learning team is struggling.

Montonen concludes that the biggest gains in development speed and production reliability for data scientists come from surprisingly basic and easy investments in testing, CI/CD, and Git hygiene.

InfoQ interviewed Camilla Montonen about building machine learning systems.

InfoQ: To what extent do currently available MLOps tools and components solve problems faced by software engineers?

Kamila Montonen: Most major MLOps tool providers were born out of projects started by engineers working on training large-scale language models or training computer vision models, and are well-suited for these use cases. They fail to account for the fact that most small and medium-sized companies that are not large technology companies do not train SOTA computer vision models. We build models that help predict customer churn and help users discover interesting items.

In these particular cases, off-the-shelf components are often not flexible enough to account for the many idiosyncrasies that accumulate in ML systems over time.

InfoQ: What advice would you give to companies struggling to implement machine learning systems?

Montonen: Find out what your machine learning team is struggling with before implementing a tool or solution.

Is your codebase complex? Data scientists are deploying ML pipeline code from their local machines to production, making it difficult to track which code changes are being executed in production. Are you having trouble identifying code changes that are causing bugs in production? Perhaps you need to refactor and invest in proper CI/CD processes and tools.

Your new model is performing poorly in online A/B tests compared to your production model, but you have no insight into why? Perhaps you should invest in a simple dashboard that tracks key metrics? There will be.

Diagnosing the current problem will help you identify the tools that actually solve the problem and reason about trade-offs. Most MLOps tools require learning/maintenance/integration effort, so it's good to know that the problem you're trying to solve using an MLOps tool is worth these trade-offs. .