If you have been even remotely active or interested in the booming area of AI-powered software you have heard the term RAG or some of its more nuanced and newer variants. In this short article we try to answer some of the most frequent questions we get from our clients. Can RAG be trusted as part of the AI-powered app architecture? And why do many of the RAG-based tools fall short of delivering on their promises while others readily become an indispensable tool?
What is RAG?
RAG, short for retrieval-augmented generation, is an architectural pattern that combines a series of tools, principles, and reusable patterns when designing an AI-powered software. This software can utilize existing knowledge-base to answer user questions. As the name suggests, a major component of such software design is the retrieval. When a user asks the model a question, a reworded or reconstructed version of that question (a query), is utilized to search the troves of textual (and sometimes image-like) data, and then the most relevant pieces of information are returned from this search. Finally the returned pieces of information are fed to an LLM (large language model) to generate a final response. The retrieved information enables what generally is referred to as the “grounding in truth” part of generation.
There are also newer variants of the architecture such as self-augmenting retrieval (SARAG), memory-augmented RAG, and Knowledge-Enhanced RAG (KE-RAG). There has been much research done on each part of the RAG components as well (e.g. HyDE technique). All these variants and alternative component choices are very closely adhering to the same original idea.
RAG is a great replacement of the traditional indexing search that oftentimes relies on a similar query-and-retrieve principle but uses different indexing algorithms. While traditional indexing retrieves data based on keyword search (or some variant of keyword search), in RAG, much attention is paid on how the textual data is divided up into chunks (chunking or partitioning), how they are stored as numerical representations (embedding), and then what type of metric is used to compare similarity of these embeddings (e.g. cosine similarity).
The advantages of RAG over traditional indexing is that it enables searching based on the semantics of the query as opposed to exact word choice. In a traditional keyword-based search asking the question of “Describe the general weather in California” cannot retrieve any information that only contains the Climate keyword. This is true unless a good tagging practice has been utilized which requires additional machine-learning models or extensive human input.
Technically, when you embed a phrase asking about weather in California, this will not find words in semantic proximity of weather such as climate by default either. But if the chunking step is done right, the numerical representation for the entirety of that sentence ends up in close coordinates to all available information about climate. A very naive explanation is that there might be other words in common between those similar sentences but it works quite more magically than that and a full explanation requires some advanced probability speculations.
We just described the retrieval part of the architecture. We can simply attach an LLM model to a well-designed retrieval pipeline and we have ourselves a RAG. Much of the software architecture and what separates a good vs not-so-good RAG lies in the use of suitable embedding algorithms, good storage services, using the right similarity search calculation and prompting the LLM correctly. The type of LLM used is another very important factor in the design. But besides those, the rest is really a game of software and data engineering that makes a RAG work at scale and economically.
Why does RAG sometimes not work?
A successful implementation of RAG requires good understanding of the data science behind its component but to make it work at scale also requires a great level of data and software engineering craftsmanship. Currently much of the product offerings that utilize this architecture come from teams that are great at either of these areas but not necessarily a mix of both.
Lacking in data science background results in products that might utilize state-of-the-art DevOps and GitOps tools and practices with highly available and scalable cloud computing resources, but are missing on the core components. Namely the chunking, prompting, embedding, similarity search algorithm is often overlooked in this scenario. Not to mention if there is ever a need to have a properly fine-tuned model working on part of the retrieval pipeline, creating such models is well beyond the expertise of such teams.
In particular one place we have seen major flaws is the chunking and partitioning part. The new LLM providers often boast their models with rapidly growing context windows which encourages teams to spend less time on chunking of information. We have scene designs where information is fed to the models 10 to even 50 standard pages at a time. While models can technically take-in this information and start looking at it before the generation step, the fundamental science behind transformer architecture and the way the “attention heads” work has remained fundamentally the same which limits the scope of what the model can actually pay attention to. The model ends up not able to pay attention to the entirety of the fed data, and critical pieces will be left out of the generated response. The left-out information might have been a key piece that the user was looking for.
On the other hand you have data-science-heavy teams that can deliver well-optimized embedding and retrieval with deep understanding of what factors affect the proximity in dimensional coordinates. They might even be able to fine-tune (or perhaps build from scratch) language models that are better at reconstructing prompts deep in the retrieval pipelines than off-the-shelf LLMs. Thanks to AI-enhanced software development tools, these teams are now able to deliver a prototype or even MVP quickly to the market and serve them on cloud providers. But as soon as real-world clients and users start interacting with the software, cracks start to appear and it is a matter of days before it completely gives up. Or in another very common scenario, complete oversight in FinOps might end up hitting the clients with massive cloud compute, storage, and API usage bills or the startup might go bankrupt in a matter of weeks unable to serve the clients.
What does a successful implementation of RAG look like?
As mentioned, RAG is a great substitute for traditional indexing search. Teams that utilize RAG-based software can enjoy the power of reliable information retrieval as well as semantic understanding and human-like response generation that LLMs offer. A well-designed RAG tool gives instant access to troves of internal documents, client data, and reliable information on the web. This also requires limiting the access of the model to only reliable information and filtering out obsolete or unreliable information (remember that much of the data baked into LLMs neural networks comes from Reddit and blog posts/comments).
Besides instant access to massive amounts of reliable information and having a very easy and human-friendly way to search and retrieve this information, the RAG architecture offers another big advantage (or rather addresses one of the big disadvantages of naked use of LLMs). It prevents hallucinations which often root in a model’s lack of contextual information about a given topic or model trying to fill the void in its knowledge base baked in its neural network.
What else to watch out for when utilizing RAG-based software?
As we mentioned, at the heart of any RAG architecture tool, LLM models still exist and play a pivotal role in generating the response. Although a sufficient implementation of a RAG architecture can greatly reduce chances of model hallucination, without using additional safeguards (e.g. proper prompting), hallucination might still happen.
The vast majority of the products implementing RAG architecture (or LLMs in any other form), use managed LLM services from the major providers like OpenAI, Anthropic, Google etc. As such, attention needs to be paid to their terms of service and where share and use of client information is concerned as well as cost of usage and any usage quotas and caps.
