Deterministic Document Structure based Retrieval

Chia Jeng Yang
WhyHow.AI
Published in
3 min readMar 24, 2024

--

Last week, we open-sourced the rule-based retrieval python package which provides a rule-based abstraction layer for performing filtered vector similarity retrieval based on page numbers extracted from PDF documents. While page numbers are a good start, developers need retrieval solutions that map to the structure of their documents. Therefore, a natural evolution of this pattern is to enable developers to deterministically retrieve vector embeddings based on document sections.

The use case we are solving for is enabling developers to navigate and extract chunks from the right sections of the right documents. This is especially relevant for well-structured documents where sections and sub-sections are clearly defined, such as legal documents, academic papers, financial documents, etc.

We believe that representing this information in a graph format is the most intuitive way to understand your documents. More importantly, a graph enables you to perform programmatic traversal and retrieval of document data according to your understanding of the structure of a document. This is the concept we commonly refer to as ‘document hierarchies’. Here, we built on top of AWS, Pinecone, Langchain, and Neo4J.

In this example, we focused on breaking down a SAFE agreement into its section and subsections, and the associated chunks in each sub-section.

  • Definitions
  • Events
  • Company Representations
  • Investor Representations
  • Miscellaneous

With this document structure captured in this graph format, and with chunk IDs tied to this semantic structure, for the first time ever, we can then run rules for the LLM to follow for specific questions with the rule-based retrieval python package we open sourced. A rule may look like ‘When answering a question about Liquidity Events, refer to the group of chunks in the Liquidity Event section in the SAFE agreement”.

If you are interested in Deterministic Document Structure based Retrieval in your RAG pipelines, ping us at team@whyhow.ai to be one of our early access partners for the tooling described in this article.

If you are interested in our open-source rules-based chunk retrieval package that we released earlier, check out the repo here.

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.

--

--