Data has history. We’ve been using “records” to aggregate information since biblical times, but the databases as we know them today actually came out only about 60 years ago. Some agree that it emerged in the 1960s with the birth of application logic and graphical user interfaces, while others say it was a decade later. As databases have evolved over the years (and only half a century?), they have gradually expanded and enhanced their ability to ingest, capture, snapshot, query, analyze, and manage information over time. .
The situation we currently stand with these tools is at an inflection point of sorts. As I’ve stressed before, the software developers we love for their ability to deliver shiny new “apps” to smartphones and other devices aren’t necessarily trained as data scientists. Automation can alleviate some of the pressure data developers feel, but broader development is also progressing rapidly, and of course it is being driven by generative artificial intelligence (AI). increase.
enter the vector database
The elements of data science may be beyond the realm of businessmen at this point (until we democratize them by developing no-code drag-and-drop tools), but understand why vector database capabilities are useful is important to all business people. In short, it impacts how the apps in our pockets work, driving new use cases that emerge in every commercial environment. So, in a nutshell, what is a vector database?
Vector databases utilize generative AI to perform analytics related to similarity searching and anomaly detection. We often make use of temporal data, that is, time-stamped data that tells us not just “what” happened, but when it happened. Relationship to all other events within a particular IT system. Vectors are “objects” of data, meaning they are numbers that represent space, location, time, and various other categorical characteristics that allow us to give detailed value and meaning to data.
Due to their inherent temporal (time awareness) capabilities, vectors are very useful in tracking how the world of sensors is behaving in the Internet of Things (IoT). Vectors are not only very fast in terms of computing power (you can ingest data and perform actions such as replication and sharding for data partitioning very quickly), but they are also very fast to hold on to a wide variety of data. You can build a data store that “understands” the values that are stored. Convert data formats in a more sophisticated way. Traditionally, and until now, it has been difficult for a document (for example) to know much about an audio file, image, or video, but using vector “embedding” offers new ways to use storage. A powerful effect can be obtained. , indexing and query processing.
As a result of building the web and the cloud, there are many data sources in many different formats and in different locations. Finding what you need in a boggy stream of information can be difficult, so companies like Google have developed powerful similarity search capabilities. Like their sister technologies such as anomaly search, these algorithmic styles are fundamental to how we navigate information today. The Vector database employs all of these features, but is designed to work closely with how software application development logic is written. This means there is room for new ways of navigating huge datasets.
Simple vector example
Most of us know that information is tagged with metadata to create tags that indicate what a particular data source is related to. A music file is packed with data to deliver the sound to an audio player application, but it also contains metadata to record who the artist is, how long the song is, and of course, the name of the song. increase. This means that it is often said that metadata is “information about information”, so you can tell what is what.
To go even further with vectors, if an AI process needs to determine whether a given image contains a dog or a cat, just looking at basic data from that image may not be enough. I have. After all, both animals have her four legs, fur, sharp teeth and are considered domestic pets. When we use vectors to attribute broader “attributes” of an image, we can use the power of Large Language Models (LLM) with generative AI to explore other features in the image. If the caption of the image states that the animal may have been taken “for a walk,” it is more likely that it is a dog. This is because LLM looks for word sequences that are most likely to follow each other. In 1000 he has one cat on a leash taking him for a walk, but this anomaly forms part of Vector’s logic and stops the machine from judging the picture as a dog. I can not do it.
Going back to our music example, we might want to use vectors to recommend other songs based on the person’s current playlist, based on their tastes. Using traditional database techniques, you would have to examine every user’s songs in the table through a “looping” process that displays all data records. Vectors allow you to assign attributes to each track (beats per minute, genre, year of recording, artist, inappropriate language usage, etc.) to create a list that is temporally related. This can improve speed and performance in a more detailed way. Accurate data analysis.
“A vector is like a map, and (from a data point of view) any object can be represented as a list or table based on time-series information,” said KX CEO, known for its kdb+ time-series database and real data. Ashok Reddy said. – Time analysis engine. “We live in a world where the information that exists is largely unstructured (text documents, social media feeds, chat streams, image and video files, etc.). Once we have , we will be able to manage information in a new format, in a way that spans all industries.With vectors, we can go as far as we want on a particular subject and list hundreds of attributes of selected data objects. A vector database simplifies the encoding of many dimensions, and temporal time series information will always be important among those dimensional attributes. If you think about weather, health, military defense, transportation, business, all decisions are about timing.”
Generative AI + Vector Database
To bridge the world of generative AI and vector databases to a new unified technical proposition, the KX team this year took its core kdb+ database to a new level, creating the logically named KDB.AI . This is an explicitly designed vector database. Cloud-native vector data management, vector embedding, and GPT-style natural language processing (NLP) query exploration.
As CEO Reddy puts it, “We are now using human language to ‘ask’ a machine a query, itself acting as a programming language. We bring generative AI to data because OpenAI itself may “know” about data, but it doesn’t support data at a native level in the context of what we’re doing. When a user asks OpenAI about data, it returns results based on what people said about the dataset in the information stream, not about the data and its properties, value, or core value of the business domain. ”
Reddy reminds us that even if generative AI is fused with the data layer, as his company’s platform does, there are still inherent limitations to the extent to which human language can express queries and intentions. increase. “If the machine determines that some process can be optimized more than originally expected, it may accept its development if it falls within agreed-upon levels of explainability and governance. We always place them around the AI as perimeter guardrails in this regard,” he added.
The company’s move forward with the Winder KX technology stack is a move to position it as a way to streamline “traditional” data science workflows, many of which relied on looping processes found in the aforementioned relational database management systems. . KX has worked to integrate and “bundle” organizational tools with vector encoding, built-in algorithms, data connectivity, and support for popular and relevant data science languages such as Python, Java, and SQL. rice field.
crush the stack
Reddy calls this the process of “crushing the stack”. This means fewer application data dependencies, faster processing, and simplified data exploration. The company claims that KDB.AI streamlines the “data pipeline” of related data tools and technologies by consolidating various functions into a single engine. This mechanism effectively revolutionizes the traditional techniques of structured and unstructured data processing by facilitating the processing of data into vectors.
As for the work process (when a data engineer or developer puts the key in the ignition and starts the virtual engine), KDB.AI first ingests data from an external database, ETL, or streaming data source. This is done via the ‘native’ connector of the platform, and in a subsequent step this data was aggregated, summarized and ‘cleansed’ (of duplicates or corruption) for preparation and cleanup by the data store. Confirmed. The data is saved before the third stage confirms it. KDB’s built-in algorithms encode vector embeddings into the database.
The fourth foundation is the consumption phase. Users can ask interactive questions about vector embeddings (through support for interfaces written in the programming languages mentioned above) and process complex queries from the command prompt. The fifth and final stage enables KDB.AI to work with various integration tools, resulting in analyzes performed in many business intelligence (BI) and data management products such as Informatica, Dataiku, Matlab, Power BI, etc. may appear. Tableau.
Traces of data destruction
KX is not the only time series data organization, nor is it the only company looking to develop a vector database. Other time series specialists include Milvus, Pinecone, Weaviate, Vald, Deephaven and Qdrant. But there is a lot going on here. In fact, would a customer be happy to publish data to generative AI through a vector-based data platform such as KDB.AI?
“Generally speaking, we know that customers do not want to expose the full breadth of their private datasets to open AI engines or large-scale language models (LLMs). , have developed their own Small Language Model (SML) within their organization to maintain more control.This does not mean that they will not go back to an open model and not share it.In other words, most large organizations , still holds a public domain dataset that can be shared to help train open AI, just the reality of sensible and sensible separation.. But overall, from loops to vector power and performance We’ve come a long way,” concludes KX’s Reddy.
2014 ACM Turing Award winner Michael Stonebreaker said, “To be truly disruptive, a database must be 50 times faster than its predecessor.” In response, a proponent of vector databases suggests that the technology runs 100 times faster than his previous traditional database approach. There are increasing vectors across the fabric of search, data, and the web, and perhaps many enterprise applications may need to consider his V factor, if not the X factor.
follow me twitter Or LinkedIn.
