Balancing AI performance and safety: Lessons from PyData Berlin

Machine Learning


Hi Nicolas! How was PyData in Berlin? Have you ever attended a conference like this before?

This was my first PyData event and it was a very positive experience. It was different from previous academic conferences where I presented a research paper on Amazon Alexa AI. PyData attracts a lot of practitioners and companies involved in data science and ML with a focus on engineering, development practices, and tools. It brings together a diverse audience from different organizations and open source contributors. I gave a 30-minute talk after submitting a CFP with the ML team.

Nicholas' talk at PyData Berlin 2024

I didn't learn much new about ML best practices, but I realized that the GitGuardian team already had some good practices in place, but I did gain some insight into non-ML development practices and open source contributions, and it was also helpful for networking and sharing information about the company.

Your talk was titled “Would You Trust ChatGPT to Dial 911?” Can you tell us what it was about?

Yes, I chose a catchy title, but the idea I presented was about the growing presence of Machine Learning (ML) in our lives. We often hear that in certain tasks, ML outperforms humans in 90% of cases, but in the remaining 10%, or even 1% or 0.1%, machines can fail. These cases can be critical, such as when calling emergency services. ML models cannot afford to fail even if they perform better than humans in 99.9% of cases. To show the urgency, I looked at statistics from the OECD, which show a 200% increase in AI-related incidents in the past year. These incidents range from minor annoyances to critical errors. For example, a recent case involved Google AI incorrectly summarizing search results. This was a funny case, but there are also serious cases, such as a supermarket's computer vision system falsely flagging a person as a thief. In a more extreme case, there is the infamous Tesla accident. In the talk, we discussed how to safely integrate ML into user-facing systems to minimize such issues.

Have you encountered similar issues with Alexa or calling 911?

In fact, the 911 example is exactly that: when someone asks Alexa to call 911, it goes through various ML models, but ultimately the call is made by a deterministic system without relying on more complex ML predictions to avoid errors.

When we talk about deterministic ML models, aren't ML models generally based on statistics and probability? What defines a deterministic model in ML?

That's a great question. The distinction between deterministic and probabilistic models can be complicated. Variation in ML models occurs in two places: during training or retraining, and in production inference. Although ML models are based on probability, it doesn't mean their behavior is unpredictable. A good ML model should always produce the same output for the same input. Differences in hardware and infrastructure can introduce slight variations, but they are usually negligible.

In training, variability can have a much larger impact. Every time you retrain a model, the inherent randomness in the training process can change performance with much more impact than a simple rule-based system. This unpredictability during retraining is what is often referred to when people say that ML models are not deterministic. ML has the following limitations:CACE Principles“When you change something, you change everything,” he said.

Many people believe that probabilistic models are inherently unpredictable, but that's not true. Deterministic models always produce the same results given the same inputs, whereas probabilistic models involve randomness during training but are typically deterministic in inference. Understanding that ML variation primarily affects the training of the model, not the final output, is key to dispelling these misconceptions.

What do you think are the most common misconceptions about ML today?

A common belief we hear is that machine learning models are unpredictable and uncontrollable. This is a misconception. With proper testing before deployment, we can know the predicted outcomes. Furthermore, there are many ways to control and limit the model's output. There are general techniques and techniques specific to generative AI like Chat GPT. For example, alignment techniques ensure that the model adheres to certain principles during training, or generate outputs limited to a certain vocabulary to ensure consistency.

Another misconception is that machine learning will replace everything and solve all problems. For many use cases, simpler models and regular expressions can be just as effective as ML, less costly, and more reliable. A combination of simple rules and machine learning often provides the best user experience and balance between cost and accuracy.

Is that what will guide ML development at GitGuardian?

That's right. For example, at GitGuardian, we don't replace our secrets detection engine with machine learning. Instead, we use machine learning to efficiently and reliably identify an initial pool of secrets and cost-effectively scan many commits. We then layer on machine learning to add additional context and validation. This hybrid approach leverages the strengths of both systems.

The model is Trust Scorer or recently announced FP Remover These are great examples of blending rule-based systems with machine learning. These models are very robust because they have been extensively and rigorously tested to ensure they meet all requirements and do not perform poorly in important use cases. We expose the model to a wide variety of data during training to address potential differences in real-world inputs. This is part of making the model more error-resistant.

ML-powered feature FP Remover reduces false positives by 50%

GitGuardian is taking the accuracy of its secrets detection engine to new heights. Using machine learning to improve detection, we've cut the number of false positives in half, allowing security and engineering teams to significantly reduce the time they spend reviewing and dismissing false positives.

Finally, we use advanced monitoring tools in our production environment to quickly identify and resolve issues so that we can address or roll back any issues immediately.

Recently, I Arnaud (Principal ML Engineer at GitGuardian) said something interesting: At GitGuardian, he said that rule-based models can already be considered AI systems. What do you think about this?

I agree with Arnaud. When people talk about AI, they think of the latest technology like ChatGPT and get fascinated by the hype. But AI doesn't necessarily mean the latest fancy technology. It includes any system that performs a task better or more efficiently than humans. By that definition, our secret detection engine fits perfectly as AI. Machine learning is just a small subset of AI.

The main difference between machine learning and other forms of AI is that machine learning models learn rules from data, while in other AI systems the rules may be explicitly coded. However, both fall under the umbrella of AI. For example, Netflix uses a complex rule-based system for recommendations, but no one would argue that it is not AI just because it relies on predefined rules. There are many other examples of such systems that are widely accepted as AI, even though they are not purely machine learning based.

Any comments specifically about generative AI? Does mixing deterministic and probabilistic models add complexity or is it a completely different challenge?

Generative AI makes it increasingly difficult to control and ensure the reliability of a model. Unlike traditional models that choose from a limited set of categories, these models can generate a wide range of outputs, such as phrases, images, etc., so the output space is vast and less constrained. They are also often multimodal, taking in multiple types of inputs, such as text, audio, and images. The wider the generation or prediction space, the higher the risks and the harder it is to control.

Given these complexities, new solutions are being developed to address the risks specific to generative AI. For example, guardrails for large language models (LLMs) aim to control what LLMs can produce and verify whether the output is valid. At AI companies, red teams specifically try to defeat LLMs, finding ways to produce undesirable outcomes.

On the other hand, you can add layers to make your model safer, but if you overdo it you might make it unusable. A good example is this model: Goody 2The developers brazenly and correctly call this language model “ridiculously safe” – a language model trained so rigorously to avoid offensive language and potential problems that it has become virtually useless. Ask it about the weather in Paris and it will respond with something like, “I can't answer this question because it might affect your decision to go out or not.”

There are an incredible number of errors occurring and even the biggest companies seem to struggle with this. I'm talking about Google, but it's the same for all companies.

This is quite worrying, but it's not good to be overly pessimistic and assume that everything is doom and gloom. This problem is exacerbated by generative AI systems, which are becoming more and more integrated into our daily lives and will soon be everywhere, including our iPhones. This ubiquity increases the risk, as these models touch many more aspects of our lives than, say, an isolated system used by a specific client. The chances of running into problems are exponentially higher.

Besides the risks, what are the main advantages compared to traditional techniques and the usual simpler model combinations?

This new era of generative models, while fundamentally the same as before, offers huge opportunities as development and resource allocation evolve. Previously, we had small expert systems trained to perform specific tasks very well. Now, these models can handle a wide range of tasks without explicit training, quickly achieving human-like performance. Moreover, these models are multimodal, handling text, image, and even mathematical tasks simultaneously. This is a fundamental leap. To me, this versatility starts to resemble “intelligence”, but frankly, these systems still often fall short when it comes to other important aspects of intelligence, such as reasoning.

Finally, do you have any tips or recommendations for people who want to integrate ML into their projects?

My first recommendation is a bit cliché, but still very relevant: “Don't do ML if you don't have to.” If you can solve your problem with a simpler rules-based system that is cheaper and more efficient, then do it.

The second recommendation is to have a deep understanding of user experience. Often times, people working in ML and data science focus so much on improving model performance on a dataset that they forget about what really matters to the user. This ties into my talk about certain data points being more important than others. For example, having Alexa accurately respond to the question “play Spotify” is much more important than telling jokes. Therefore, it is important for anyone considering incorporating ML into their project to have a product manager who is focused on UX and product needs, and align metrics and training as much as possible with user satisfaction.

For example, at GitGuardian, our strategy is to optimize models that directly impact user experience and product performance – we don't waste resources on improving detection features that no one uses or issues that don't impact user satisfaction. Direct input from product managers helps us keep our models relevant.

Finally, in addition to developing good models, it is also important to have robust testing and monitoring in place. ML engineers need to understand that they are engineers too. Good testing and monitoring practices are just as important as developing the model itself. These principles are the same as for any engineering task. Make sure your testing and monitoring tools and guidelines are best in class to maintain model performance and reliability.

Thank you Nicholas, that was very helpful!

thank you!

***This is a Security Bloggers Network syndicated blog from the GitGuardian Blog – Code Security for the DevOps generation by Thomas Segura. Read the original post here: https://blog.gitguardian.com/balancing-ai-performance-and-safety-lessons-from-pydata-berlin/



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *