At QCon London 2026, Lan Chu, AI Technology Lead at Rabobank, shared lessons learned from implementing a production AI search system used internally by over 300 users across 10,000 documents. Her experience shows that most failures in RAG systems stem from indexing and retrieval, rather than the language model itself.

The system allows users to search through thousands of internal documents, including PDFs and PowerPoint files, to quickly extract insights for tasks such as preparing for a meeting with a client.
Its architecture follows a typical RAG pipeline.
1- Ingest documents: Parse, chunk, and embed documents before indexing them into the vector database.
2- Acquisition and generation: Retrieve the relevant chunks and send them to LLM to generate the answer.
3- Observability: Trace monitoring, acquisition performance, evaluation metrics
Although the architecture appears simple, Chu explained that challenges quickly arise with document quality, search relevance, and evaluation in a production system.

Presenters emphasized that accurately parsing documents is important for AI search systems. Corporate documents often contain complex layouts such as tables and infographics, and simply converting them to plain text removes important structure and can lead to misread numbers and misinterpreted tables. To fix this, she built a pipeline that combines traditional text extraction with a visual language model that understands layout.
Even when using modern language models, content must be chunked to avoid model overgrowth and increased cost. Chu tested different methods and found that dividing the document into sections worked best for her dataset and achieved high accuracy. However, he emphasized that the appropriate strategy depends on specific data.
Standard search systems rely on vector similarity, which can miss important context such as document timing. Her system added temporary scoring to prioritize new documents and a routing layer to decide whether to retrieve the document or call an external API. The model may struggle with the tool’s parameters, so the user may be asked to confirm input in some cases.
While evaluation is often neglected, Chu recommends building datasets from real user queries, tracking failure modes such as routing and temporal errors, and using statistical techniques to validate improvements. Real queries often provide more value than synthetic datasets.
The key lessons are that building effective AI search systems requires careful attention to several key areas, that search quality relies on accurate document parsing and indexing, that chunking strategies need to be tested and validated on real datasets, and that searches need to consider signals beyond simple textual similarity (such as temporal relatedness). Presenters noted that while agent architectures can enhance functionality, they introduce additional complexity, and that a robust, structured evaluation framework is essential to ensure reliable performance in production AI systems.
