Apache Cassandra: the data foundation for real-time AI

AI Basics


The growing attention and media coverage of machine learning and artificial intelligence has increased the urgency of how businesses can best leverage AI to increase their impact. One of the most powerful changes we see is the move to AI-powered applications that use real-time data to take advantage of events in the moment. This includes consumers who are actively involved in the business, to supply chain operations who need to constantly adapt to changing variables.

New experiences and enhancements to every customer interaction by extracting intelligence from data in real time, feeding it into applications, informing decisions, and driving actions when customers are actively engaged in service provided context. But it also brings new challenges and places great demands on the underlying infrastructure to support this intelligent real-time model.

When it comes to real-time AI, you can find inspiration in the consistent patterns, blueprints, and best practices of pioneering organizations that have invested time and resources to build their own AI-powered real-time solutions. Without exception, major organizations such as Netflix, Apple, Uber, and FedEx that are using real-time AI today have chosen to build their solutions on Apache Cassandra.

The reasons for building real-time AI on top of Cassandra range from world-class latency and speed, to scalability, availability, and improved prediction and action accuracy. Here we explore how Cassandra provides the foundation for two of the most important data management categories of real-time AI: features and events, enabling us to deliver highly accurate insights based on the right data at the right time. I’ll look. Maximum impact for your business.

feature

In January, ChatGPT reached 100 million users faster than any other service (including DataStax Astra DB, which is a huge disappointment). Since then, there has been an explosion of AI literacy, not just among techies. One of the most prominent pieces of AI jargon that has become mainstream is feature.

It’s tempting, but I wouldn’t ask ChatGPT to write a paragraph explaining how it works. Use a dictionary instead.

featurenoun: a characteristic attribute or aspect of something

In the context of machine learning, we have data that represent specific values ​​of those attributes or aspects. These features are used to train machine learning models to recognize patterns discovered in data. Features are also used during inference to provide the current context on which the model should base its inferences.

If our goal is to enable real-time AI (we want to make inferences based on the latest and most relevant information possible), how do we keep information for all entities? should consider whether to The functionality is “fresh” because events (more on that later) flow continuously through the system.

It is very important that the functionality that needs to be new resides in a database that can support very high rates of inserts/updates. It can also serve low-latency queries even during write peaks.

To achieve this, we typically have stream processors such as Apache Flink or Spark Streaming that continuously process events to keep functionality up-to-date. We are also excited to open source and soon release his revolutionary stream processing technology from Kaskada, recently acquired by DataStax. Kaskada technology offers similar capabilities to Flink and Spark, and also has a rich feature set for feature engineering.

There may or may not be a reason why the events themselves are stored in the database, but there is a feature that needs to be new to a database that can support inserts/updates at a very high frequency while still serving it at a low frequency. Being present is very important. – Query latency, even during write peaks.

To complete inference on an entity, we need to query the features of that entity and complete inference in less than 200 ms for many real-time use cases. Cassandra can get multiple functions in parallel at once. In a recent test I ran on Astra DB for integration with Feast, an open source feature store, a query of three tables achieved a p99 of 23ms, leaving most of the 200ms for other processing. I was.

event

To understand how best to design real-time AI, we must first clarify the role of events. If your technology career hasn’t intersected with “complex event processing”, he may not be aware that an “event” is a data record with a specific structure associated with it.

The event typically captures a unique identifier for the “entity” (most likely an email address or a randomly assigned alphanumeric identifier) ​​and a timestamp and set of key data values ​​associated with that time . Events therefore capture a particular state about an entity at a particular point in time.

This is important because there are practical realities in continuously processing data. Time ticks by. Therefore, any real-time calculations you make on your data will always be in the context of a specific timeframe.

To place the data in the proper timeframe, the records must have timestamps. Timestamps should be part of the data because these records move frequently these days.For example, you can’t rely on timestamps from a database INSERT event.

Now, it’s not always necessary to store events in the database. Sometimes all you need to do is run the event through a stream processor. It is often used to compute summary data in real time and store records in object storage. Cassandra works particularly well with both Apache Kafka and Apache Pulsar. sink and sauce for record.

A sink handles the storage of stream records in the database. Cassandra can also be used as a stream source via change data capture (CDC), ingesting streams from database records. The flexibility that the combination of database and stream providers represents is a powerful tool for processing data in a real-time context, with excessive architectural complexity later manifesting as latency.

Although most event data can be safely processed in streams and then archived to files, there may be times when your application needs to store events in an online database. Events are often not an audit trail of what happened in your app, but they are also used to represent the current state of a user’s actions.

In this case, Cassandra partitions map very cleanly to the event data model, making Cassandra a particularly efficient solution. The partition key stores the event’s entity key, and the timestamp (or “timeuuid”) is stored as the clustering column.

Building a real-time infrastructure on Cassandra gives you the freedom to capture user activity signals very fast and query new features with high throughput and low latency.

As a result, Cassandra stores events for each entity in sorted order across partitions, and static columns provide a space-efficient mechanism for storing common data across partitions. Later, you can efficiently retrieve chronologically sequential records for each entity.

From a hygiene standpoint, if you later build machine learning models from the event data, make sure the timestamp reflects the actual value of the data captured at that point in time.

Please be very careful when updating or correcting these records. If the data stored for an event actually reflects a value that was true only after the timestamp of the record, data “from the future” leaked into the model, potentially degrading model performance. there is.

Data architecture for speed and scale

Machine learning works best with lots of high signal data. Building a real-time infrastructure on Cassandra gives you the freedom to capture user activity signals very fast and query new features with high throughput and low latency.

This allows you to learn the signals that produce the best model and disable other signals to optimize storage costs. Cassandra’s linear scaling properties allow machine learning engineers to easily support the ingestion of the desired number of events and continuously deliver new features fast enough to support real-time interactions.

There’s a reason Netflix and Uber chose Cassandra as the foundation of the data architecture that powers their AI systems.

Building AI-powered real-time apps on the right platform to provide highly personalized viewing recommendations in real-time, so that routes can be adjusted instantly to get drivers to their destinations most efficiently. , anticipate and eliminate disruptions to your manufacturing and supply chains.

All of these opportunities are more than just making predictions. Unlike traditional approaches that rely on batch processing and costly and time-consuming transformations to ingest data into ML, these real-time AI systems drive near-instantaneous action. The only way to achieve this is with a fundamental data architecture built for speed and scale. Cassandra is the perfect choice for achieving this.

Learn more about Cassandra’s unique speed, scale and availability advantages here,check out this new guide Building real-time AI applications with Cassandra.

group Created by sketch.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *