Unlearning to build better AI apps | Sam Stone

Product strategies from traditional ML to adapt (or abandon) the world of generative AI

The first piece of advice my boss at Opendoor gave me a few years ago was simple: “Invest in backtesting. AI product teams succeed or fail based on the quality of their backtesting.” This advice was proven at the time and was learned the hard way by teams in search, recommendations, life sciences, finance, and other high-risk products. It's advice I've stuck with for nearly a decade.

But I have come to believe that this is not self-evident for architecture. Driving force AI Products. A year ago, I switched from traditional ML products (producing simple outputs like numbers, categories, ordered lists, etc.) to generative AI products. In the process, I realized that many principles of traditional ML no longer served me or my team.

In my work at Tome, where I serve as Head of Product, and through conversations with leaders at generative AI startups, I've recognized three behaviors that distinguish the teams that ship the most powerful and useful generative AI capabilities. These teams are:

Work backwards (from the user problem) and forwards (from the technology opportunity) at the same time
Design low-friction feedback loops from the start
Rethinking traditional R&D tools from ML

These actions require us to “unlearn” many things that remain as traditional ML best practices. Some may seem counterintuitive at first. Yet they apply broadly to generative AI applications, from horizontal to vertical software, startups to incumbents. Let's take a closer look.

(You might be wondering why automated backtesting is no longer a principle for generative AI application teams, and what we’re replacing it with? Read principle #3.)

(More interested in the tactics than the process of how the UI/UX of a generative AI app should differ from traditional ML products? Check out this blog post.)

“Working backwards” from the user's problem is a mantra in many product and design circles, made famous by Amazon: Study your users, estimate their pain points, write UX requirements to alleviate their biggest pain points, identify the best technology to implement, and iterate. In other words, figure out “this is the most important nail we need to drive in, and then what hammer should we use?”

This approach doesn't make sense when enabling technologies are rapidly evolving. ChatGPT was not built by working backwards from users' pain points. It succeeded because we exposed powerful new enabling technologies through a simple, open-ended UI. In other words, “we invented a new hammer, let's see what nails users will drive with it.”

The best generative AI application teams work backwards and forwards simultaneously. They do user research to understand the breadth and depth of pain points. But they don't just work their way through a ranked list. Everyone on the team, including PMs and designers, is deeply involved in the latest AI advancements. The team connects these technical opportunities as they unfold to user pain points, often in ways that are more complex than a one-to-one mapping. For example, the team realizes that user pain points #2, #3, and #6 can all be mitigated by model breakthrough X. Then, on the next project, it might make sense to focus on “working forward” by incorporating model breakthrough X, rather than “working backward” from pain point #1.

Getting deep into recent AI advances means going beyond reading research papers to understanding how they apply to real-world applications. This requires prototyping. Until you try out new technology in an application environment, your estimates of user benefits are just guesses. The growing importance of prototyping requires you to overturn traditional thinking. Spec → Prototype → Build process Prototype → Spec → BuildMore prototypes will be discarded, but that is the only way to consistently specify features that match useful new technology to broad and deep user needs.

Feedback to improve the system

Traditional ML products produce relatively simple output types, such as numbers, categories, or ordered lists, and users tend to accept or reject these outputs – clicking a link on a Google search results page or marking an email as spam. The data provided from each user interaction is fed back directly into retraining the model, so the connection between actual usage and model improvement is strong (and mechanical).

Unfortunately, most generative AI products tend not to generate new ground truth training data for every user interaction. This challenge is related to what makes generative models so powerful: their ability to generate complex artifacts that combine text, images, video, audio, code, etc. For complex artifacts, users rarely make a “take it or leave it” decision. Instead, most users use more/different AI or manually refine the model output. For example, a user may copy ChatGPT output into Word, edit it, and send it to a colleague. This behavior causes the application (ChatGPT) to not “know” the final desired format of the artifact.

One implication is to allow the user to iterate over the output within your application. But that doesn't solve the problem: if the user doesn't iterate over the output, does it mean “great” or “disappointing”? You could add sentiment indicators (e.g. thumbs up/down) to each AI response, but the interaction-level feedback response rate would be much higher. very Low. And the answers submitted tend to be biased towards the extremes. Sentiment collection efforts, in most cases, don’t help users get better results right away, and users perceive them as unnecessary friction.

A better strategy is to identify the step in the user's workflow that says, “This output is good enough.” Build that step into your app and document what the output was at this point. For Tome, which uses AI to help users create presentations, a key step is sharing the presentation with others. To build this into the app, we've invested heavily in sharing capabilities. Then evaluate which AI outputs are “sharable” and which outputs required significant manual editing to make them shareable.

Feedback to assist users

Free text has emerged as the primary user-preferred way to interact with generative AI applications. But free text is a Pandora's box: when you give users free text input to your AI, they'll ask for all sorts of things your product can't do. Free text is a very hard input mechanism to communicate your product's constraints. In contrast, old-school web forms make it very clear what information you can and must submit, and in what exact format.

But users don't need forms to get creative or complex work done; they want free text and guidance on how to create a good prompt specific to the task at hand. Strategies to help users include example prompts or templates, and guidance on optimal prompt length and format (should you include a short example?). Human-readable error messages are also important (e.g., “This prompt was for language X, but only languages Y and Z are supported.”).

One advantage of free text input is that unsupported requests can be a great source of inspiration for what to build next. The trick is to help users identify and categorize what they're trying to do with their free text. More on this in the next section.

Make something, keep something, throw something away

Build: Natural Language Analysis

Many generative AI applications allow users to pursue very different workflows from the same entry point (an open-ended free-text interface). Users aren't selecting “I'm brainstorming” or “I want to solve a math problem” from a dropdown. The workflow the user desires is implicit in the text input. Therefore, to understand the workflow the user desires, we need to segment that free-text input. Some segmentation approaches may be permanent; at Tome, we are always interested in preferred language and audience type. We also have ad-hoc segmentation to answer specific questions on our product roadmap; for example, how many prompts request visual elements like images, videos, tables, graphs, etc., and therefore which visual elements should we invest in?

Natural language analysis should complement, not replace, traditional research approaches. NLP is especially powerful when combined with structured data (such as traditional SQL). A lot of the data that matters isn't free text – when did the user sign up, what are their attributes (organization, job, region, etc.)? At Tome, we tend to look at language clusters by job function, region, free/paid user status – all of which require traditional SQL.

Quantitative insights are useless without qualitative insights. When we observe how users interact with our product, live It can sometimes yield 10 times more insight than user interviews (where users talk about their experience with the product). after), and we've found scenarios where one good user interview can yield 10x more quantitative insights.

Keep: A low-code prototyping tool

Two types of tools, prototyping tools and output quality assessment tools, enable fast and high-quality generative AI app development.

There are many ways to improve ML applications, but one fast and accessible strategy is prompt engineering. It's fast because it doesn't require model retraining, and it's accessible because it uses natural language instead of code. Allowing non-engineers to interact with prompt engineering approaches (in a development or local environment) can greatly improve speed and quality. Often this can be implemented via notebooks. While notebooks can contain a lot of code, non-engineers can make significant progress by iterating on natural language prompts without touching code.

Evaluating the output quality of a prototype is often very difficult, especially when building an entirely new feature. Rather than investing in automated quality measurements, we find it much faster and more useful to have colleagues and users vote in a “beta tester program” and collect 10-100 structured evaluations (score + notes). The technology that enables the “voting approach” is lightweight: a notebook that generates input/output examples at a moderate scale and pipes them into a Google Spreadsheet. This allows us to parallelize manual evaluation, and we can easily evaluate about 100 examples with a small number of people, usually in less than a day. The evaluator notes are an added bonus, providing insight into patterns of failures and excellence. Notes tend to be more useful in identifying what to fix or build next than a numerical score.

Discard: Automated, back-tested quality measures

A traditional ML engineering tenet is to invest in robust backtesting. Teams retrain traditional models frequently (weekly or daily) and with good backtesting, only good new candidates are released into production. This makes sense for models that output numbers or categories. These models can be easily scored against a ground truth set.

But with complex (possibly multi-modal) output, scoring accuracy becomes difficult. You might have some text that you think is great, and you're tempted to call it “the truth”, but if the model output is just one word off from it, does it mean anything? Just one sentence? What if the facts are all the same, but the structure is different? What if you have text and images together?

However, all is not lost. Humans tend to find it easy to assess whether generative AI output meets quality standards. This does not mean that it is easy to transform bad output into good output, just that users tend to be able to determine whether text, images, audio, etc. is “good or bad” in a matter of seconds. Furthermore, most application-layer generative AI systems are not retrained daily, weekly, or monthly due to the compute costs and the long timelines required to get enough user signals to justify retraining, and therefore do not need a quality assessment process that runs daily (unless you're Google, Meta, OpenAI).

Given the ease with which humans can evaluate generative AI output and the infrequency of retraining, it often makes sense to evaluate new model candidates based on internal manual testing (such as the polling approach described in the subsection above) rather than automated backtesting.

Source link