Artificial intelligence does not exist without data. And when your data is spread all over the place, you’ll spend a lot of time managing the implementation process instead of focusing on what’s most important – building your application. As the world’s most prominent applications already use Apache Cassandra, improving data efficiency has become an increasingly important goal. AI is all about scale, and by bringing Vector Search to Cassandra, a key component when using AI models, organizations can reduce costs, streamline data management, and extract value from every last drop of data. can be squeezed out.
This state-of-the-art feature, recently outlined in the Cassandra Enhancement Proposal (CEP-30), is further evidence of the Cassandra community’s commitment to building reliable features rapidly. This is also a testament to how Cassandra, which provides tools to create sophisticated data-driven applications, is becoming increasingly attractive to AI developers and organizations working with massive datasets.
What is Vector Search?
The established concept of text search has existed for a long time. This includes searching for specific keywords within the document. But important data isn’t just found in text. Audio, images, and video (or a combination thereof) also contain relevant information that requires search methods. That’s where vector search comes in. Vector search has been around for quite some time and has proven to be very valuable in a variety of applications, especially in AI and machine learning.
Also known as vector similarity search, this search game requires two parts to perform advanced searches. First, we need to index the raw data into a vector representation (an array of numbers) that serves as a mathematical description. Second, vector data should be stored in a way that allows developers to ask, “Given one point, what other similarities are there?” This is simple and powerful for developers, but difficult to implement at scale on the server side. This is where Cassandra really shines: it consistently delivers data at any scale, anywhere in the world, with resilience that gives you peace of mind.
This is not meant to be a full-deep dive into vector search, but it does add a whole new dimension to reducing code complexity and allowing users to use the functionality and get to production quickly. It describes what your application can do by creating useful data for want.
Some practical examples of vector searches include:
- Content-based image search. Visually similar images are identified based on feature vectors. Libraries such as img2vec can convert image files into 512 unique identifiers that can be used for similarity searches.
- Recommender system. Products or content are recommended to consumers based on their similarity to items they have previously interacted with.
- A natural language processing application. It identifies semantic similarities between text content and can be leveraged for tasks such as sentiment analysis, document clustering, and topic modeling. This is typically done using tools like word2vec and may require the scale provided by Cassandra.
- Do you want to use ChatGPT? Vector search is a large language model (LLM) use case because it enables efficient storage and retrieval of vector embeddings that represent the extracted knowledge gained during the LLM training process. important for By performing a similarity search, vector search can quickly identify the most relevant embeddings corresponding to the user’s prompt. This helps LLM to generate more accurate and contextually appropriate responses, while providing the model with a form of long-term memory. In essence, vector search is an important bridge between LLMs and the vast knowledge bases on which they are trained.
what happens to cassandra
The Cassandra project is on a never-ending quest to make Cassandra the ultimate powerhouse of the database world. As mentioned before, after converting the data to vector embeddings, we need a place to store and use it. These features are added to Cassandra and exposed in a simple and powerful way.
Vector data type
We introduce a new data type to support storing high-dimensional vectors. VECTOR<type, dimension>. This allows processing and storage of Float32 embeddings commonly used in AI applications. This has already led to discussions to add Cassandra to AI libraries such as LangChain. For this example, imagine creating a vector from the descriptions to enable semantic similarity searches.
CREATE TABLE products( id UUID PRIMARY KEY, name varchar, description varchar, item_vector VECTOR
|
create table product( identification UUID Major key, name variable length characters, explanation variable length characters, item vector vector<float, 3> ); |
ANN search index
Add a new Storage Attachment Index (SAI) called “VectorMemtableIndex”. This corresponds to the approximate nearest neighbor (ANN) search function. This index works in conjunction with new data types and Apache Lucene’s Hierarchical Navigable Small World (HNSW) library to enable efficient vector search capabilities within Cassandra.
CREATE CUSTOM INDEX item_ann_index ON product(item_vector) USING ‘VectorMemtableIndex’;
|
create custom index item_ann_index upon product(item vector) already taken ‘VectorMemtableIndex’; |
ANN operator in CQL
Introduces a new Cassandra Query Language (CQL) operator ANN OF to make it easier for users to perform ANN searches on their data. This operator allows the user to efficiently perform her ANN search on the data using a simple and familiar query syntax. Continuing the example, the developer can ask the database for what resembles a vector created from the description.
CREATE CUSTOM INDEX item_ann_index ON product(item_vector) USING ‘VectorMemtableIndex’;
|
create custom index item_ann_index upon product(item vector) already taken ‘VectorMemtableIndex’; |
Emphasizing Cassandra’s scalability
When Cassandra 4.0 was released, one of the often overlooked highlights was the concept of improved pluggability. Cassandra’s new vector search functionality is built as an extension of the existing SAI framework, avoiding rewriting the core indexing engine. It uses the well-known and widely used HNSW functionality in Lucene and provides a fast and efficient solution for finding approximate nearest neighbors in high-dimensional space.
Cassandra 4’s new additions highlight its incredible modularity and extensibility. The integration of HNSW Lucene and extensions to his SAI framework gives developers even faster access to a wide range of production-ready features. Developers have access to a large number of vector databases, and many prefer to build a vector indexing engine before adding storage. For over a decade, Cassandra has successfully tackled the difficult problem of large-scale data storage. We believe that incorporating vector search into Cassandra will provide even better production-ready capabilities.
New use case
Cassandra is no stranger to machine learning and AI workloads. Longtime users of Cassandra have long used Cassandra as a fast and efficient feature store. There are even rumors that OpenAI heavily uses Cassandra in building his LLM. All of these use cases employ existing features of Cassandra. There are many ways to use the new vector search. It will be exciting to see what our community comes up with, but they probably fall into two categories.
Enhance existing use cases with ANN search
If you already have an application built on Cassandra, you can enhance it by incorporating ANN (“approximate nearest neighbor”) searches. For example, if you have a content recommendation system, you can use ANN search to find similar items and make your recommendations more relevant. For product catalogs, features can be denormalized into embedding vectors stored in the same record. Fraud detection can be further enhanced by mapping behaviors to features. Given your use case, it’s probably relevant.
Building a new one that requires vector search
If you’re starting a new project that requires fast similarity search capabilities, Cassandra’s new vector search capabilities are an excellent choice for data storage and retrieval. Knowing that you can scale from gigabytes to petabytes on the same system allows you to focus on building applications without worrying about trade-offs. In addition to storing vector embeddings, you can throw in all the functionality of CQL and the tabular storage of a full-featured database.
All of these options are available no matter how you consume Cassandra. Whether you deploy on your own using open source Cassandra, on Kubernetes with K8ssandra, or in the cloud with services like DataStax Astra DB, you get the same great system. The freedom you get with open source is the freedom to choose how you build your application.
Built by developers for developers
As we continue to innovate and expand Cassandra’s capabilities, we are committed to staying at the forefront of your data management needs. The introduction of vector search is an exciting new use case that makes data-driven applications even more powerful and versatile. This, along with other cutting-edge features such as large-scale distributed ACID transactions, will make Cassandra 5.0 the most significant upgrade. We don’t stop here either. Companies and developers that support Cassandra are eager to find more ways to consolidate data, simplify management, and save costs.
We believe this addition will be useful not only for AI developers, but also for organizations managing large data sets that benefit from fast similarity searches. So keep an eye out for the alpha release of Cassandra with vector search capabilities, planned for Q3. We look forward to seeing the amazing applications you build with this new feature. We would also appreciate it if you could share your use cases with the Planet Cassandra community.
