Open-Sourcing Deterministic Rule-based Retrieval for RAG

Published in

WhyHow.AI

9 min readMar 19, 2024

Hello world! WhyHow.AI builds developer tools to bring more accuracy and determinism to AI solutions, including graph technologies. Today, we’re open-sourcing our rule-based vector retrieval package, a simple tool that helps users build LLM retrieval workflows with more accuracy and precision. This tool does not require graphs to work, but as we describe later, in our opinion it is most powerful when augmented with graph structures.

Why we built this

Developers frequently tell us that they know exactly where to find the answer to a question within their raw data, but for some reason, their RAG solution is not pulling in the right chunks. This is a frustrating problem that is especially challenging to fix given the black-box nature of retrieval and LLM response generation.

Although vector similarity searches return highly relevant data, developers must still contend with LLM hallucinations and their RAG systems sometimes failing to return the data that is most relevant to the problem they are trying to solve. Perhaps a user’s query is poorly phrased and it yields bad results from a vector database, as the index may store a lot of semantically similar data. Or maybe you want to include response data that is semantically dissimilar to the query embedding, but is still contextually relevant to building a complete, well-rounded response to a particular user query. In these cases, it helps to have more determinism and control in the chunks of raw data that you are retrieving in your RAG pipeline.

Thus, we developed a rule-based retrieval solution whereby developers can define rules and map them to a set of chunks they care about, giving them more control in their retrieval workflow.

How does it work?

The rule-based retrieval SDK does a few things for the user:

Index & namespace creation — the SDK creates a Pinecone index and namespace on your behalf. This is where chunk embeddings will be stored.
Splitting, chunking, and embedding — when you upload pdf documents, the SDK will automatically split, chunk, and create embeddings of the document before upserting into the Pinecone index. We’re leveraging Langchain’s PyPDFLoader and RecursiveCharacterTextSplitter for pdf processing, metadata extraction, and chunking. For embedding, we’re using the OpenAI text-embedding-3-small model.
Auto-filtering — using a set of rules defined by the user, we automatically build a metadata filter to narrow the query being run against the Pinecone index. Pinecone’s single-stage filtering allows us to perform fast (<100ms) queries while producing accurate results per the rules specified by the user.

To use the package, users just need to define their OpenAI and Pinecone API keys as environment variables, select a name for their Pinecone index and namespace, and specify the PDF files they wish to upload.

index_name = "whyhow-demo-index"
namespace = "legal-documents"
pdfs = ["LPA.pdf", “side_letter_investor_1.pdf”]

When you upload your documents, the PDFs are automatically split and chunked, and high-level document information is extracted and added to the chunk in the form of Pinecone metadata (chunk text, page number, sequential chunk ids, and document filename).

Now that this information is part of the Pinecone vector, we can use metadata filters to run pointed queries against these vectors. We define metadata filter rules by specifying the filename and the page numbers we want to include in a given query, and add them to a list to be passed into the client. This abstraction is meant to offer a simple, intuitive way of building/grouping rules to be managed, and applied to different types of queries.

We allow you to add optional keywords which will automatically trigger the rule if any of the keywords are detected in the question, as long as keyword_trigger is set to ‘true.’ In the code snippet below, if the question is ‘what are the rights of client 1?,’ both of the rules will be triggered, and the filters will be applied. If keyword_trigger is set to ‘false,’ then all of the rules specified will be applied by default. This is a very simple application of managing and applying keyword detection/triggering, but you can easily extend this type of rule automation using the semantic reasoning capabilities provided by LLMs and supporting solutions like knowledge graphs.

rules = [
    Rule(
        filename="LPA.pdf",
        page_numbers=[41,42,43],
        keywords=[‘rights of client 1’]
    ),
    Rule(
        filename="side_letter_investor_1.pdf",
        page_numbers=[2],
        keywords=[‘rights’,’client 1’, ‘rights of client 1’]
    )
]

When the query is run, the client will generate an embedding for the question text and run a vector similarity search using the embedding and the filters. Depending on whether or not you have enabled include_all_rules, the filters will be applied in one of two ways.

If ‘false,’ all the filters will be concatenated into a single filter which will be applied to a single Pinecone vector search. In this case, you’re asking Pinecone, “find me the most semantically relevant chunks given that they meet one of the conditions defined in this filter.”

{'$or': [
    {'$and': [
        {'filename': {
            '$eq': 'LPA.pdf'
            }}, 
        {'page_number': {
            '$in': [41,42,43]
        }}
    ]},
    {'$and': [
        {'filename': {
            '$eq': 'side_letter_investor_1.pdf'
            }}, 
        {'page_number': {
             '$in': [2]
         }}
    ]}
]}

Now, although you may have asked Pinecone to retrieve information from a certain set of pages, depending on the number of rules you have, the results of the vector similarity search, and the size of your top_k, it is possible that your vector similarity search may not return results from some of the pages you’ve specified. But what if you want to ensure that data from all pages are sent to the LLM?

If include_all_rules is set to ‘true,’ a separate Pinecone query will be run for each filter. In this case, for each rule you’ve built you’re asking Pinecone, “find me the most semantically relevant chunks on these pages…done?…cool…now do it for the next set of pages. And so on.” Instead of concatenating all the rules into a single filter, we’re running multiple vector similarity searches (one per rule) and concatenating all the matches into a single output to be sent to the LLM. With this strategy, we can guarantee the results will include information from each set of pages in your rule set, however the downside is that you may be pulling in less semantically relevant information than if you had just queried the entire vector database using the concatenated filters.

if include_all_rules:
   texts = []

   for filter in filters:
      query_response = index.query(
         namespace=namespace,
         top_k=top_k,
         vector=question_embedding,
         filter=filter,
         include_metadata=True,
       )
       
       . . .

Depending on the number of rules you use in your query, you may return more chunks than your LLM’s context window can handle. Be mindful of your model’s token limits and adjust your top_k and rule count accordingly.

Example

Imagine you’re a legal professional at an investment fund, and you’re trying to understand the rights and obligations of your various investors. To understand this, you want to perform RAG on two key documents, a limited partnership agreement (LPA) which details general terms of the partnership between you and your investors, and a side letter which details additional investor rights and privileges not mentioned in the LPA.

When you search ‘what are the rights of investor 1?,’ your RAG system returns some relevant results, but it also returns raw data from other investor side letters and from irrelevant pages of the LPA, causing the LLM to hallucinate and return an inaccurate response.

INFO:querying:Index whyhow exists
INFO:whyhow.rag:Raw rules: []
INFO:querying:Answer: The provided contexts do not contain specific
   information about the rights of client 3.

To fix this, you can write rules to mimic the workflow that domain experts typically follow for this use case. First, you want to check the specific rights and privileges that have been explicitly granted to the investor in the side letter, which are specifically mentioned on page 2. Then, you want to check LPA pages that pertain to the specific investor rights you care about. In this case, page 41 of the LPA.

INFO:querying:Index whyhow exists

INFO:whyhow.rag:Raw rules: [Rule(filename='side_letter_client_3.pdf',
   uuid=None, page_numbers=[1], keywords=[]),
   Rule(filename='LPA.pdf', uuid=None, page_numbers=[41],
   keywords=[])]

INFO:querying:Answer: Access to key financial and operational data
   related to their investments, access to a list of the Partnership
   Investments valued at fair value, extended reporting, advisory
   board

When we add these rules, we perform queries that are scoped in on a much narrower set of chunks which increases the likelihood of relevant data to generate an accurate response. With some more tuning of the prompt and query, we can continue to improve the output generated by the SDK..

It may not be necessary to create a rule for every extraction. Rules can simply be implemented for particularly tricky questions, or questions which simply seem to be facing failed retrievals. However, some of our design partners have implemented rule extraction for all questions, even simple ones, for the peace of mind that a deterministic system in production brings.

What about a multi-agent approach?

Another way of handling complex questions across multiple documents or sections is to use an LLM to break a question into multiple pieces, then use another agent to select the most appropriate index/namespace for the query, or add filters to the vector search based on the question. Relying solely on multi-agent systems like this can take a lot of time to set up and add another layer of probability to your workflow, making RAG pipelines even more difficult to troubleshoot and maintain. By building/managing retrieval rules using an abstraction like the one provided in this package, or potentially in addition to the multi-agent system, it is much easier to understand and predict the behavior of your system.

The drawback to deterministic retrieval is the rigidity that it introduces into your workflow. It is important to remain aware of this limitation and use deterministic retrieval sparingly in use cases where you want to implement high control.

Why start with pages?

This SDK is meant to demonstrate where we’re going, and pages are just the first step. In order to offer even greater precision to the developer, a natural evolution of this pattern is to deterministically retrieve vector embeddings within document sections or even by specifying individual chunk IDs. Over time, we want to build incremental tooling to empower developers with more control in their RAG pipelines, and this is step one.

At scale, we believe this type of feature is best delivered alongside a knowledge graph. For example, we can write rules to automatically retrieve chunks of raw text or particular vector embeddings when a graph node is accessed by a graph query. By linking graph nodes to collections of vector embeddings or raw text chunks, we can leverage the best of various different solutions: semantic reasoning using a well-structured graph, highly predictable, deterministic retrieval of raw text, and well-scoped vector similarity search using a vector store. We’re strong proponents of a hybrid approach to RAG that leverages vector databases alongside KGs.

What’s next

Looking forward, there are several features we’d like to add to improve the effectiveness and usability of deterministic retrieval workflows:

Integration with knowledge graphs

WhyHow.AI is streamlining knowledge graph creation + integration to bring more determinism and accuracy in RAG. While you can use this SDK as a standalone solution for building, customizing, and managing rules for your unstructured data retrieval, we believe this feature is best implemented alongside a knowledge graph. By deterministically linking text chunks and vector embeddings to their nodes, developers can benefit from the best of multiple solutions and reliably add trustworthy context to the semantic reasoning they perform in a knowledge graph.

Natural language rules

Developers should be able to leverage the power of LLMs to define and enforce deterministic retrieval rules using the power of natural language. By building a text-to-rule engine, we can enable users to build and manage rules in plain English, and even dynamically create and enable rules at runtime based on the question they’re asking their RAG system.

Segmentation by document sections and chunks

While this SDK allows users to write rules to filter retrieval by page numbers, it becomes even more powerful when users can explore and specify individual chunks, or specify document sections for filtering retrieval. This is an extraction and usability problem that we’re excited to solve.

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG, we’d love to chat. On our roadmap, we will be linking deterministic rule-based access with knowledge graphs to ensure deterministic semantic retrieval. To get access to this as one of our design partners, ping us at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.

Check out the code on our github, and install the package using pip.