When RAG Lacks Context

How to use Contextual Chunk Headers to get better results from RAG systems.

Nov 27, 2024

NYC Planning Commission Protocols get fragmented — No books were damaged in the preparation of this image

Overview

In this article, we’ll explore one of the common pitfalls of “naive” RAG (Retrieval-Augmented Generation) implementations. Specifically, the problem that arises when we chop up a document into chunks, embed them individually, and then realize those chunks often lack the crucial context needed for a RAG system to work its magic.

Picture this: a financial report packed with numbers about earnings, profits, and losses. Sounds useful, right? But here’s the catch—depending on how the report is structured, that section might not tell you which year, quarter, or department those figures apply to. A small but critical omission, and suddenly your perfectly good data becomes… well, not so useful.

This issue crops up frequently in medium to large documents. Fortunately, there’s a straightforward and effective solution: Contextual Chunk Headers. With this clever technique, you can dramatically boost the accuracy of RAG-generated responses in situations like these.

In the sections ahead, we’ll dive deeper into the problem, showcase how Contextual Chunk Headers save the day, weigh the tradeoffs of using this approach, and point you toward further resources for exploration. Ready to level up your RAG game? Let’s go!

In the following examples, we’ll use code from our RAG-Base project and the accompanying notebook, “contextual-chunk-headers-demo.ipynb”.

If you’d like to try it out yourself, feel free to clone the repository and run the notebook. Instructions are available in both the project’s README.md and our previous post about creating a chatbot API.

The Problem

Let’s kick things off with a concrete example. Imagine you’ve built a shiny new RAG application designed to sift through the New York City Planning Commission’s protocols1. The goal? To chat with an AI model about all things city planning. Want to find new places to relax? Check. Curious about subway improvements? Double check. Anything remotely related to city development? This app’s got you covered.

Here’s how it works. Say you ask:

message = 'What are some new places I can use to relax in the city?'

async for chat_msg in llm_agent.astream_events(message, user_config):
    if chat_msg.type == LLMEventType.CHAT_CHUNK:
        print(chat_msg.content, end='')

And voilà! The system responds:

Based on the information from the sources, here are some new places in the city where you can relax:

1. Pershing Square Plaza: Located on the west side of Park Avenue between East 41st and East 42nd streets, this plaza includes seating, plantings, lighting, and a drinking fountain, providing a serene spot for relaxation.

2. Vanderbilt Plaza: A new plaza planned for the public space located on former Vanderbilt Avenue between East 42nd and East 43rd streets, offering a tranquil environment for unwinding.

3. East 53rd Street Corridor Improvements: Streetscape enhancements along five blocks of East 53rd Street between Second Avenue and Fifth Avenue, featuring improved circulation, seating, and greenery, providing a peaceful setting for relaxation.

4. Shared Streets: The concept of shared streets designed for slow travel speeds where pedestrians, cyclists, and motorists share the right of way, offering a calm and pedestrian-friendly atmosphere for relaxation and leisure.

These new places in the city aim to provide residents and visitors with peaceful and inviting environments to unwind and relax amidst the urban hustle and bustle.

Not bad, right? The system efficiently pulls relevant data from the embedded corpus and gives you actionable insights.

But what happens when you ask something trickier? Let’s say you’re interested in projects that required an environmental review. So, you ask:

Can you give a list of environmental reviews and the associated projects?

The bot responds (snippet shortened for brevity):

Here's a comprehensive list of environmental reviews and their associated projects:

1. Project: Unnamed (160254a)
   CEQR Number: 16DCP136M
   ...

2. Project: Unnamed (110285)
   CEQR Number: 10DCP004M
   ...

3. Project: C 100049 ZSM, N 100048 ZRM, C 100050 ZSM, and C 100237 PQM
   CEQR Number: 09DCP019M
   ...

4. Project: N 130247(A) ZRM, C 130248 ZMM, and N 130247 ZRM
   CEQR Number: 13DCP011M
   ...

Here’s where things get interesting—and by “interesting,” I mean a little frustrating. While the chatbot can sometimes scrape out the filing numbers for the reviews, it can’t consistently identify the actual projects associated with those reviews.

Lack of Context

So, why did that happen? To understand the issue better, let’s peek under the hood at the search the bot performs in the vector database. Thankfully, it’s straightforward to extract:

{'sender': 'system', 'content': 'Searching Vector DB', 'payload': {'search_query': 'list of environmental reviews and associated projects'}}

And here’s a snippet of the search result:

…
{'source_name': '200102.pdf', 'source_id': '200102', 'modified': '2024-11-19T13:14:24.508171', 'content': 'ENVIRONMENTAL REVIEW \nThis application (C 200102 ZMM), in conjunction with the applications for the related action (N \n200107 ZRM), was reviewed pursuant to the New York State Environmental Quality Review \nAct (SEQRA), and the SEQRA regulations set forth in Volume 6 of the New York Code of \nRules and Regulations, Section 617.00 et seq. and the City Environmental Quality Review Rules  \n \n \n7 C 200102 ZMM \nof Procedure of 1991 and Executive Order No. 91 of 1977. The lead is the City Planning \nCommission. The designated CEQR number is 20DCP058M. \n  \nAfter a study of the potential environmental impacts of the proposed actions, a Negative \nDeclaration was issued on October 28, 2019. Following certification, a Revised Environmental \nAssessment Statement (EAS) dated January 15, 2020 was issued that included edits to the \nHistoric and Cultural Resources narrative, figures and tables, for clarification purposes. The \nRevised Negative Declaration, issued on January 21, 2020, supersedes the Negative Declaration \nissued on October 28, 2019. The conclusions of the original Negative Declaration, which found'}
…

At first glance, this result seems promising—it has an application ID and some environmental review details. But on closer inspection, there’s a glaring omission: there’s no mention of the actual project associated with this review.

This is a common issue with RAG systems: while the information is embedded in the document, it’s often scattered across sections. When we divide the document into chunks, the connections between the chunks—and the crucial context—are lost. So even though the vector database returns the “correct” blob of information, it doesn’t include enough data to establish what that information relates to.

And this lack of context can lead to issues such as:

- Incomplete results: The system finds part of the answer but leaves out key details.

- Erroneous responses: The AI may try to infer connections where none exist, leading to inaccuracies.

Contextual Chunk Headers to the Rescue

Fortunately, the solution to this problem is surprisingly simple: we just need to provide additional context to the chunks we insert into the vector database. This technique, known as Contextual Chunk Headers (CCH), ensures that each chunk includes enough information to retain its connection to the bigger picture. Let’s see how it works in practice.

Here’s how you can implement Contextual Chunk Headers for a single PDF file:

# Load our PDF file
loader = PyPDFLoader(pdf_path)
docs = loader.load()
docs = RecursiveCharacterTextSplitter(chunk_size=1_200, chunk_overlap=200)

# Get a short summary of the document content
document_content = ''.join(d.page_content for d in docs)[:5_000]
llm = ChatModel(model_kwargs={'temperature': 0, 'max_tokens': max_summary_tokens})
res = llm.invoke(SUMMARY_PROMPT + document_content)
doc_summary = res.content

# Add the summary to each of the doc sections
for doc in docs:
    doc.page_content = f'Context: {context}\n\nContent: {split.page_content}'

# Add the enhanced documents to the vector DB, as you normally would.
vector_db.add_documents(documents=docs)

This example is a simplified version to keep things focused. For a more robust implementation—featuring iterator support, flexible parameters, and additional enhancements—check out our repository.

Here are a few things to note about the code above:

Most of the steps—loading the file, splitting it into chunks, and adding the chunks to the vector DB—are standard parts of the RAG pipeline. These would remain unchanged, whether or not you use Contextual Chunk Headers. For more details, check out our previous post on RAG
The key difference is the addition of a document summary. We take the first 5,000 characters (or another predefined threshold) and ask an LLM to summarize the content into a concise overview.
Once we have the summary, we prepend it to the content of each chunk. This ensures that every chunk in the database is context-rich, making it easier to retrieve relevant information and enabling our bot to connect the pieces to the bigger picture.

The prompt used for generating the summary is straightforward::

SUMMARY_PROMPT = """You are tasked with summarizing a document into 3 sentences max.
The summary should be concise and contain the main points of the document.
This is the document: """

By adding this simple step, you transform otherwise isolated chunks into context-rich units, improving the effectiveness of your RAG system significantly.

Comparing Results

Let’s ask our chatbot the same question as before and compare the results:

Can you give a list of environmental reviews and the associated projects?

Here's a list of environmental reviews and their associated projects based on the information provided:

1. East Midtown Rezoning (CEQR number: 13DCP011M)
   Project: Zoning text amendment and map amendment for the East Midtown business district in Manhattan, including establishing a new East Midtown Subdistrict.

2. One Vanderbilt Development (CEQR number: 14DCP188M)
   Project: Special permit application for the development of a commercial building in Manhattan, including public realm improvements and transfer of development rights.

3. Theater Subdistrict Zoning Amendment (CEQR number: 16DCP136M)
   Project: Amendment to the Zoning Resolution for the Special Midtown District to modify regulations on the transfer of development rights from listed theaters in Manhattan.

4. M1-6D Zoning District Creation (CEQR number: 10DCP004M)
   Project: Creation of a new zoning district (M1-6D) to facilitate high-density, mixed-use areas with affordable housing opportunities while preserving existing commercial and industrial spaces.

As you can see, the updated responses now include the names of the associated projects. This is a significant improvement over the earlier results, where the chatbot could only sometimes provide filing numbers without project details.

Why the Results Improved

The difference lies in the additional context we provided using Contextual Chunk Headers. By appending a summary to each chunk, we ensured the LLM had the necessary background to associate content from the vector database with the correct projects. For instance, let’s take the first bullet point and see what the vector database returned for it:

{
    'source_name': '130247a.pdf',
    'source_id': '130247a',
    'modified': '2024-11-13T10:01:14.090973',
    'content': 'Context: The City Planning Commission is considering a zoning text amendment and map amendment for the East Midtown business district in Manhattan. The proposed changes aim to protect and strengthen the area, including establishing a new East Midtown Subdistrict. The project area includes major office buildings, Grand Central Terminal, and significant transit infrastructure.\n\nContent: ENVIRONMENTAL REVIEW \n \nThis application (N 130247(A) ZRM), in conjunction with the applications for the related actions \n(C 130248 ZMM and N 130247 ZRM) was reviewed pursuant to the New York State \nEnvironmental Quality Review Act (SEQRA), and the SEQRA regulations set forth in Volume 6 \nof the New York Code of Rules and Regulations, Section 617.00 et seq. and the New York City \nEnvironmental Quality Review (CEQR) Rules of Procedure of 1991 and Executive Order No. 91 \nof 1977. The designated CEQR number is 13DCP011M. The lead is the City Planning \nCommission. \n \nIt was determined that the Department’s proposal may have a significant effect on the \nenvironment. A Positive Declaration was issued on August 27, 2012, and distributed, published \nand filed. Together with the Positive Declaration, a Draft Scope of Work for the Draft \nSupplemental Environmental Impact Statement (DEIS) was issued on August 27, 2012. A public'
}

Without the added summary (the Context field), the LLM wouldn’t have been able to link the environmental review details to the specific project it pertains to. The summary provides the critical connective tissue, allowing the LLM to see the bigger picture.

By embedding this additional context into each chunk, we bridge the gap between isolated pieces of information and the overarching document, enabling the chatbot to deliver more complete and accurate responses.

Drawbacks

As with most things in life, adding additional context comes with some tradeoffs. While these are usually minor for most applications, it’s worth discussing them to give you a complete picture:

Slower and Costlier Insertion Process: Moving the original data through an LLM for summarization slows down the insertion process and can incur additional costs. However, for many applications, insertion is a one-time or infrequent operation compared to inference, so the added cost is generally manageable.
That said, there are ways to mitigate this:
1. Use a simpler, faster, and cheaper model for summarization
2. Summarize only the documents/sections that will benefit most from the additional context (e.g. only long documents, financial documents, etc.).
Larger Entries in the Vector Database: Each entry in the vector database becomes larger due to the added summary. This can make retrieval less concise, as the model receives more data when searching for answers. As a result:
1. The model may lose focus on the most relevant content.
2. The additional tokens sent to the model could lead to higher inference costs.
3. The context history may be truncated sooner than it otherwise would.
Potential Bias Toward Larger Documents: When a long document has many chunks containing specific keywords, it might dominate the vector search results. This is particularly likely if the document summary (the context) includes those relevant keywords. Even if many chunks within the long document are not directly relevant to the search query, the shared context may cause them to be retrieved anyway. This can lead to redundancy in the results and overshadow other, more relevant chunks from shorter documents.

The impact of these drawbacks can often be minimized by tailoring the approach to your specific use case and requirements. For example:

Adjust chunk sizes and overlap to strike a balance between context and token efficiency.
Prioritize which documents or sections need summarization.
Experiment with vector search configurations, such as ranking models or filtering strategies, to reduce bias toward larger documents.

Ultimately, while these tradeoffs exist, they are generally outweighed by the improved retrieval accuracy and contextual richness provided by Contextual Chunk Headers.

AI Bot Summarizing Documents — Illustration of a bot summarizing a document

Final Thoughts

As we’ve seen, adding a bit of context to each chunk can dramatically enhance the performance of a RAG system. This technique is versatile and applicable to a wide range of documents, from financial reports and contracts to forum posts and replies. For the latter, context can often be extracted directly without needing an LLM, making this approach even more efficient.

While this method offers significant benefits, it’s important to weigh the additional processing costs against the potential improvements. If cost is a concern, there are usually ways to strike a balance—whether by optimizing your LLM usage, summarizing only critical sections, or exploring alternative summarization models.

Finally, for those who want to try it out, a Jupyter notebook demonstrating this approach is available here as part of our RAG-Base project. The notebook includes all the code you’ll need to implement and maintain this technique in your own system.

Data & AI Engineering @ Hipposys

Discussion about this post