Introducing Schema Controlled Automated Knowledge Graphs

Chia Jeng Yang
WhyHow.AI
Published in
5 min readApr 21, 2024

--

We’re excited to announce our first major update to our “Understand” Knowledge Graph SDK that we launched into a closed Beta a couple of weeks ago. We are building tools to help you structure your data with your opinion, through knowledge graph tooling. In natural language, you can describe what you want to capture and have it captured and neatly organized.

With this latest upgrade, the following features are unlocked:

  • Full retrieval of all related concepts and entities across all documents uploaded
  • The ability to upload and define your own schema in JSON file

This means that you can now define the specific entities and relationships that you care about within a range of documents and raw texts, and extract and store them in a graph that you can query against for more accurate RAG, and for uses of memory or personalization.

WhyHow builds workflow tools for data orchestration, and graph creation, and we work on top of any data extraction model you want to bring. In this case, we work on top of Unstructured, OpenAI, Neo4J, Langchain, and Pinecone, and will be supporting the most popular data extraction models, LLMs, graph and vector databases.

Use-Cases:

  • Medical: Let’s say that you want to be able to extract all the blood types across a hundred patient records. You can now set the schema (“blood type”), describe what the schema should look like, and then immediately extract all the blood types of various patients across all documents, into a knowledge graph. You can then query each patient or group of patients about their blood types and associated information.
  • Legal: Let’s say that you wanted to know the amount every investor has invested, but you only have a list of SAFE documents. You can now set the schema (“Invested amount”), describe what the schema should look like, and then immediately extract all the invested amount into a knowledge graph. You can then query each investor or group of investors about their invested amount and associated information.
  • Finance: Let’s say that you wanted to know the GDP per capita per country across a number of country economic reports. You can now set the schema (“GDP per capita”), describe what the schema should look like, and then immediately extract all the countries GDPs into a knowledge graph. You can then query each country or group of countries about their GDP per capita and associated information.
  • Personalization: Let’s say that you want to personalize the interaction between your LLM and your user. You know that you want to personalize it by making sure the LLM keeps track of the hobbies of the user whenever it is mentioned, so you can suggest activities related to the hobbies in the future. You can now set the schema (“hobbies”), describe what the schema should look like, and then extract all the hobbies that the user mentions they have, over time, across all conversations for future extraction, into a knowledge graph.

Schema Definition

Here is an example of the schema that we used in our demo video below for questioning different medication fact sheets.

The schema is straightforward to write, and can be written by a non-technical domain expert if needed. The descriptions are optional, but used as more granular control over the LLM if it is being less precise than you want it to be.

{
"entities": [
{
"name": "medication",
"description": "The brand name for the medication mentioned in the document. Examples of medication are Advil, Benadryl, etc. "
},
{
"name": "side_effect",
"description": "Harmful bodily effects that can be caused by using the medication. Examples of side effects are nausea, swelling, asthma, shock, blisters, rash, liver damage"
},
{
"name": "active_ingredient",
"description": "Chemical compounds and ingredients used to make a medication. Examples of active ingredients are Acetaminophen, Ibuprofen, NSAID, Diphenhydramine HCl"
},
{
"name": "symptom",
"description": "An ailment or issue that is treated or relieved by using the medication. Examples of symptoms are headache, soreness, pain, toothache, backache, arthritis"
}
],
"relations": [
{
"name": "causes",
"description": "Indicates that the medication has an undesirable bodily side effect."
},
{
"name": "contains",
"description": "Medications contain an active ingredient."
},
{
"name": "treats",
"description": "Medications are used to treat bodily ailments or illnesses."
}
],
"patterns": [
{
"head": "medication",
"relation": "causes",
"tail": "side_effect",
"description": "A medication causes a side effect, e.g., Advil causes nausea, Benadryl causes drowsiness, Tylenol causes stomach bleeding"
},
{
"head": "medication",
"relation": "contains",
"tail": "active_ingredient",
"description": "A medication contains an active ingredient, e.g., Advil contains Ibuprofen, Tylenol contains Acetaminophen"
},
{
"head": "medication",
"relation": "treats",
"tail": "symptom",
"description": "A medication treats a symptom, e.g., Advil treats toothaches, Tylenol treats backache"
}
]
}

Video Walkthrough of the SDK

This demo shows that you can set the schema of what you want to be able to extract and organize from healthcare reports. In this case, we want to extract and store specific drug facts from the US National Institute of Health reports. These drug facts include the name of the drug, side effect, symptom treated, etc. By setting and defining the schema, you can now run our SDK across multiple documents, have specific graphs created, and easily query these graphs within your RAG pipelines.

As you can see, in the Advil report, there are six problems it treats and all six are captured in the graph. It also perfectly captures all five side effects listed in the document. As can also be seen in the Tylenol report, it perfectly and completely finds all eight of the listed uses and all four of the listed warnings.

What can you now do that you could not before

Structured Knowledge Representation of your data increases accuracy, reduces hallucinations, and can act as contextual memory as you can now define the things that you care about in your data, ensure it is captured and stored, and queried against.

You can also now do more precise and deterministic comparison between and across different documents, especially if there is a checklist / schema of things that you want to compare against.

In the healthcare example above, you can plug this graph into your RAG system, and it can then accurately come with all context about Advil and Tylenol structured and ready to be used and queried.

You can run this schema against any other medication fact sheet, and even unstructured text to continue to accumulate information into this graph.

We want to deeply understand our initial beta users so to get an API key to the above “Understand” Knowledge Graph SDK, please ping us at team@whyhow.ai with a description of your use-case, or find some time with us here.

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG for accuracy, memory and determinism, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.

--

--