WhyHow.AI’s Knowledge Graph Creation SDK — “Understand”

Chia Jeng Yang
WhyHow.AI
Published in
7 min readApr 1, 2024

--

Introducing WhyHow.AI’s knowledge graph Creation SDK, nicknamed the “Understand” SDK.

We first started augmenting RAG systems with knowledge graphs when we were trying to build more accurate chatbot solutions for legal professionals. We quickly realized that this is a very challenging process and there is a gap in existing developer tooling for performing graph RAG. Even our CTO, someone with a decade of experience and who focused on KGs in their PhD, felt the pain.

Our design partners agree. As they begin to implement graphs into their RAG workflows, some of the most common questions we hear are:

  • Am I supposed to capture every single possible relationship and entity?
  • Which ontology should I be using to create my graph?
  • How can I inject my context into the knowledge graph creation process?
  • How do I stream data into my knowledge graph?
  • How do I integrate knowledge graphs into my existing RAG workflows?

We set out to build the developer experience we wished existed for structuring knowledge and creating more deterministic information retrieval systems. One of our first products we are beta-releasing is our “Understand” SDK, which is a solution for automatically creating basic knowledge graphs.

Depending on your use case, there are many different approaches to creating a knowledge graph. For example, we recently shared our Deterministic Document Structure-based approach to creating a knowledge graph based on document structure and running chunk retrieval rules on that basis. This type of graph is known as a document hierarchy, and the graph creation process there is quite different to the “Understand” process.

With the “Understand” SDK, we intend to provide a more generalized framework for graph creation. There are a few key aspects of our approach that make this process unique and efficient.

First, we believe that ontologies for graphs should be defined by you and your view of the world. While industry-specific ontologies absolutely have their place, it is difficult to accurately capture your unique perspectives and use cases with generic ontologies. This is especially important in graph RAG where LLMs need accurate, specific data to generate reliable responses.

Second, we believe that graphs should be small and well–scoped to your use cases. Traditionally, knowledge graphs have been known to be large, comprehensive representations of an expansive, complex domain. While these use cases are still relevant, graph RAG demands many small, well-scoped graphs that can be easily maintained and queried with precision to retrieve the most relevant data.

Third, we leverage LLMs and various other state of the art NLP solutions to provide a user-friendly way of building and managing your own ontologies and graphs. Traditionally, ontology development has been an arduous, time-consuming process that involves countless hours of conversations between domain experts, ontologists, and engineers to design an accurate overview of an organization’s knowledge. Our solution eliminates this back and forth and helps developers quickly build graphs with natural language in a matter of minutes rather than days or weeks.

For example, let us say you want to build a graph of the things that Harry Potter owns and interacts with throughout each book in the series. Our SDK enables users to upload a raw document along with a collection of seed concepts (in natural language) that tells the WhyHow “Understand” solution what you care about and what kind of information you would like to extract from the raw document. Although we have our own data extraction models, we are ultimately data extraction model agnostic, and want to be able to plug in any of your proprietary models, or third party tools like Unstructured, into our workflows. We are laser focused on tooling to structure your extracted information in graphical format.


# Import and initialize client

from whyhow import WhyHow

client = WhyHow()

# Define graph namespace, raw documents, and seed concepts

hp_namespace = “harry-potter”

hp_docs = [
“./files/harry_potter_and_the_philosophers_stone.pdf”,
“./files/harry_potter_and_the_chamber_of_secrets.pdf”
]

hp_concepts = [
“What does Harry wear?”,
“What does Harry own?”,
“What does Harry use?”
]

# Upload data and create graph

response = client.graph.create(
namespace = hp_namespace,
documents = hp_docs,
concepts = hp_concepts
)

In the code snippet above, we import and initialize the WhyHow client, then we define the seed concepts to guide ontology creation, the raw documents we will use for graph creation, and a unique namespace for this use case.

When we create the graph, the raw documents are pre-processed (cleaned, split, chunked) and stored in a unique namespace for your use case. We then use the seed concepts to construct a simple ontology and use it to guide extraction of relevant entities and relationships contained within the raw data. From here, we use these entities and relationships to create triples, then build a graph. To do this, we’re leveraging a variety of technologies under the hood, such as vector databases like Pinecone, graph databases like Neo4j, and AI toolkits like Langchain.

The advantage of this approach is that the resultant graph is populated only with the specific nodes and edges that have been deemed relevant by the user, instead of every possible relationship and entity. By omitting irrelevant data, we can build a graph that is hyper-focused to your use case, revealing uniquely relevant relationships and reducing the risk of retrieving potentially misleading context.

This also makes it possible to ‘train’ a graph by throwing a set of prepared questions against the system to create graphs. In many instances, developers come prepared with a list of questions that they anticipate they are using their RAG systems to solve. Continued user discovery of questions to answer helps build up this graph iteratively over time.

This is meant to be a continuous, iterative process that seamlessly integrates into your existing data movement workflows and RAG pipelines. For example, let us say you want to upload the next book of the Harry Potter series, or you decide that you want to expand your graph to include emotions that Harry feels, or spells that Hermione uses. All you need to do is update your raw data or seed concepts, and entities will be automatically extracted and appended to your graph.

# Add documents

new_book = [“./files/harry_potter_and_the_prisoner_of_azkaban.pdf”]
update_documents_response = client.graph.update_documents(
namespace = hp_namespace,
documents = new_book
)

# Add concepts

new_concepts = [“Which spells does Hermione use?”]
update_concept_response = client.graph.update_concepts(
namespace = hp_namespace,
concepts = new_concepts
)

When we extend these graphs, we’re running many of the same processes to build new mini graphs behind the scenes and then merging them with the original graph stored in your namespace. When these graphs are merged, it reveals previously unconnected relationships, enabling us to retrieve more detailed, contextually relevant information to pass into the LLM.

This process of ‘just-in-time graph creation’ is exactly how we enable automated Extract, Contextualize, and Load (ECL) workflows. Let’s say you upload a new book to a folder in Amazon S3 or Google Drive. Once the document has landed, you can automatically retrieve the raw file and extend the existing graph in the Harry Potter namespace using the new book and existing concepts. Or, let’s say you have an existing RAG system that allows users to ask questions about Harry Potter in natural language. You can add to the graph by using user questions as new concepts, and automatically extend the graph according to the things your users want to know.

Using strategies like these, your graphs and RAG systems can learn and grow over time, gradually capturing a more comprehensive understanding of your data and enabling you to generate more accurate, complete answers.

In addition to building and modifying graphs using raw documents and seed concepts, we also enable users to query the graph using natural language or Cypher. Responses contain a natural language answer as well as detailed data about graph nodes and edges that you can use to pass to your LLM, make subsequent queries or retrievals to graphs and other data stores, etc.

# Query graph

question = “What does Harry wear?”
query_response = client.graph.query(
namespace = hp_namespace,
query = question
)

print(query_response)
# {
# answer: "Harry wears a cloak",
# query: "What does Harry wear?"
# namespace: "harry-potter"
# triples: [
# {
# head: "Harry",
# relationship: "WEARS"
# tail: "Cloak"
# }
# ]
# }

In future releases we will be providing more functionality for fine-grained graph data maintenance tasks, we’ll be developing clean, developer-friendly interfaces for graph exploration and management, as well as improved querying capabilities.

We are making a free beta version of our SDK available to the first 100 users. We want to deeply understand our initial beta users so to get an API key, please ping us at team@whyhow.ai with a description of your use-case, or find some time with us here.

To summarize, here are some of the advantages of the WhyHow “Understand” SDK:

  • Ability to instantly create well-scoped graphs according to your world view using PDFs and seed concepts in natural language
  • Ability to query graphs with natural language
  • Ability to iteratively expand graphs over time with more documents and questions and build automated ECL workflows

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.

--

--