Building the perfect data foundation for AI

Artificial agents, applications, and bots are popping up in every enterprise environment, and data is flowing at an alarming rate.

There’s a lot of hype, excitement, and fear about the potential of AI, but little attention has been paid to the data infrastructure needed to make it all work, especially at the enterprise level. AI requires a strong, well-thought-out data foundation, whether extended from traditional infrastructure or part of next-generation data technologies.

“This trend will continue across almost every industry. As AI moves from experimentation to core business operations, the pressure on the data layer will also increase,” said Karthik Ranganathan, co-CEO and co-founder of Yugabyte. To explore what the data foundation will look like in the age of AI, we consulted experts and leaders across the industry to consider the thinking and efforts needed to create a data environment that works well for current and future enterprise AI initiatives. Remember, it’s the data that counts.

Are you ready?

There is general agreement among data leaders and experts. BDQ Today’s data infrastructure is simply not ready for the demands of AI. “Enterprises can drive impressive AI demos, but at the root of it all are still disorganized data, disorganized identities, and fragmented platforms,” said Mark Gowdy, chief partner technologist at Quest Software.

According to Vikram Venkat, principal at Cota Capital, most enterprise data infrastructures “were built and deployed in the analytics era.” “Most data was structured, with row- and column-optimized tables being the dominant format, and batch processing was sufficient for most large-scale requirements.”

What most companies lack is a “clear, integrated, well-defined data model,” says Cole Bowden, developer advocate at InfluxData. “When you use AI to leverage data, it has to explore tables, schemas, column names, and infer how all the data is related and how plain text concepts are related to all the data.”

Along these lines, “a column named ‘col1’ is not functionally visible and is of no use to the AI agent,” Borden explained. “AI has a really hard time when you have multiple columns with the same name that contain different data or different concepts. The data model is the documentation for the AI, and most data models aren’t clean enough to make the AI fully functional.”

The result, Gowdy noted, is “AI that cannot be fully trusted, tracked, or secured in production.” “A small number of companies are closing that gap, but only if they intentionally invest in finding, understanding, and managing their data while integrating into modern platforms centered around strong identities. Until companies fix their housekeeping, it’s like throwing a junk drawer of random parts and information into a scalable environment and calling it AI-enabled.”

Building AI-powered environments is getting easier every year, but the underlying data architecture can make or break an AI application. This is where things get complicated. “Building on top of an existing traditional database becomes very complex,” Ranganathan says. “Multiple databases are required, each supporting a different data model (typically SQL, vector, graph, search, and time series). Hacking all of these databases creates a huge problem of data silos, which significantly slows development, reduces performance, and lacks system observability.”

Chetas Joshi, a software engineer at Robinhood, said he has seen firsthand that “most data infrastructure is not ready to support AI and machine learning use cases.” AI systems, especially applications that leverage large-scale language models (LLMs), real-time decision-making systems, and co-pilots “bring very different requirements,” he explained. “Use cases such as fraud detection, personalization, recommendations, and RAGs [retrieval-augmented generation] They rely on continuous streams of large amounts of, often unstructured, data. They need fresh signals, i.e. high-throughput, low-latency data processing requirements and the right context. ”

The gap between AI capabilities and enterprise responsiveness is “not a hardware issue, it’s a hygiene issue,” advises Anusha Kovi, business intelligence engineer at Amazon. “Most enterprise data environments are built for storage and reporting, not to feed models. This means data is siled, inconsistently labeled, poorly documented, and unstructured to be usable by AI without first doing extensive cleanup work. The infrastructure exists, but the underlying foundation was not designed with this use case in mind.”

Practical issues also need to be considered. “Managing costs as data volumes grow, keeping data fresh enough for real-time use cases, handling model drift, and maintaining governance across distributed systems,” Joshi added. “Supporting AI at scale requires high-throughput streaming ingest, an online serving layer backed by an intelligent caching layer, vector indexing, and reliable object storage all working together.”

“Enterprise data is spread across many systems with no unified means of access,” said Pascal Van Hentenrick, A. Russell Chandler III Professor and Professor at Georgia Tech University and recently appointed director of the Gurobi AI Innovation Lab. “The need to map roles and users for security, privacy, and access control purposes combine to create an additional layer of complexity. So the challenge is to move from data to workflow – from data infrastructure to workflow infrastructure. It creates a paradigm shift.”

Application to AI

One thing is certain: AI-based applications require large amounts of data. “From prediction to optimization to generative and agent solutions, AI is data-driven and, by definition, data-intensive,” Van Hentenryck said. For now, AI is showing up in “real-time decision-making, fraud detection, personalization, and operational analytics,” Gordy said. “LLM-based co-pilots and assistants sit in warehouses and lakesides answering questions about documents, logs, and reports, as well as advanced regulatory and risk analysis that combines structured and unstructured data.”

LLM “is being used to power better decision-making through more informed analysis, while agent AI enables end-to-end process automation, increasing speed and efficiency,” said David Thoumas, co-founder and CTO of Huwise (formerly Opendatasoft). “Both of these rely on access to reliable data that is reliable, easy to understand, and provides context and consistency.”

AI is focused on the richest and most established datasets. According to Graham McMillan, CTO at Redgate, these consist of “data warehouses and back-office systems such as CRM that have grown with the organization itself.” “These contain some of the thorniest but most valuable data that companies have, which is why people want to deploy AI against them.”

AI is being applied to data sets to “identify trends in customer behavior, identify churn before it happens, and find cross-sell opportunities hidden in plain sight,” McMillan said. Additional use cases include “real-time fraud prevention, personalized digital banking, communications network optimization, logistics coordination, e-commerce recommendation engines, and intelligent retail pricing,” Ranganathan said. For example: “AI-driven shopping concierges on e-commerce sites can rely on past transaction data, website behavior, locale information, and current conversation context to make accurate decisions instantly, often across globally distributed environments.”

Source link