Text to Knowledge Graphs with LLMs

knowledge-graph

Unstructured text is everywhere – such as emails, PDF docs, or webpages – and it’s full of insights that are inefficient to make sense of without structure. If you’re working on an NLP application such as a RAG (retrieval-augmented generation) or building any other LLM apps, chances are that you’ve run into the problem: How to represent raw, free-form text using something simple and structured while preserving (most of) its semantics?

That’s where knowledge graphs come to the rescue.

A knowledge graph is a structured representation of knowledge that organizes information as a network of interconnected entities (nodes) and their relationships (edges). It visually and conceptually models real-world objects and concepts (like people, places, or things), their attributes, and the connections between them. Knowledge graphs are designed to make data more intuitive, comprehensible and accessible for both humans and machines.

For text data, a knowledge graph shows how ideas, people, or events in chunks of text are related to each other across the documents. This is in contrast to other traditional knowledge representation such as a vector DB, where those chunks are often treated as standalone and unrelated.

In this post, I will show how to harness the power of LLMs to extract knowledge graphs from unstructured text data. This data structure can then be readily used for a variety of downstream tasks such as reasoning, question-answering, fraud and anomaly detection.

From unstructured text to graphs

So, how do we go from an unstructured text to a knowledge graph? Until recently, this more or less required time-consuming manual work – kind of like doing detective work. Imagine an old-school detective scouring through dossier of documents, extracting photos, clipping articles, put them on a board and connect them together using red strings. Sounds familiar?

detective-work

Except that we’re not doing that manually here. Rather, we’d leverage the power of modern LLMs to automate the extraction of entities and relationships!

Large language models (LLMs) offer unprecedented capabilities to understand and manipulate natural languages. These capabilities are further bolstered by carefully crafting the input prompts (think prompt engineering) as well as providing them with relevant examples (think few-shot learning or fine-tuning) such that they can efficiently extract entities and relationships from a given text in the right format. To this end, I’m exploring two LLMs in this post: Phi-3-mini-128k-instruct-graph from Emergent Methods and GPT-4.1 mini from OpenAI.

Phi-3-mini-128k-instruct-graph

The folks at Emergent Methods used Phi-3-mini-128k-instruct from Microsoft as the base model and fine-tuned it such that the model would output a graph structure in JSON format given an input text. Of course, a proper system prompt has to be set at the beginning. Refer to the fine-tuned model’s card for the precise system prompt and JSON output format. What’s remarkable about the Phi-3 family of models is that they are billed as “small” language models (SLMs) by Microsoft that are lightweight (from 3.8B params), fast (inference), and outperforming models of the same size (as of this writing).

Emergent Methods also published a blog post on how they did this fine-tuning and benchmarked against Anthropic’s Claude 3.5 Sonnet, where they achieved 20% better quality in knowledge graph extraction (at a much lower cost as well). The fine-tuning was done using PEFT and QLoRA for quantized low-rank adaptation. They specifically trained on 4,000 unique news article summaries, used GPT-4o to generate the desired knowledge graphs, and transferred that knowledge to the smaller, more efficient Phi-3-mini model – this is essentially classic knowledge distillation. Finally, as the name suggests, the fine-tuned model supports up to 128k combined input and output tokens.

I’ve done my own testing as well. For example, given the following text (taken from ICIJ’s Panama Papers):

The family of Azerbaijan President Ilham Aliyev leads a charmed, glamorous life, thanks in part to financial interests in almost every sector of the economy. His wife, Mehriban, comes from the privileged and powerful Pashayev family that owns banks, insurance and construction companies, a television station and a line of cosmetics. She has led the Heydar Aliyev Foundation, Azerbaijan’s pre-eminent charity behind the construction of schools, hospitals and the country’s major sports complex. Their eldest daughter, Leyla, editor of Baku magazine, and her sister, Arzu, have financial stakes in a firm that won rights to mine for gold in the western village of Chovdar and Azerfon, the country’s largest mobile phone business. Arzu is also a significant shareholder in SW Holding, which controls nearly every operation related to Azerbaijan Airlines (“Azal”), from meals to airport taxis. Both sisters and brother Heydar own property in Dubai valued at roughly $75 million in 2010; Heydar is the legal owner of nine luxury mansions in Dubai purchased for some $44 million.

The extracted knowledge graph (KG) is visualized below (plotted using NetworkX and Pyvis)

As can be seen, the KG seems to capture the main entities mentioned in the text (such as persons, places, organizations, etc.) as well as their relationships (such as ownership, kinship, etc.) However, it also seems to miss out some important nodes and edges at the same time, leading to the extracted KG having a much simplified structured than it should. This exposes one weakness of the LLM – to recognize and extract all entities and relationships mentioned – or being comprehensive. Additionally, every time an inference is made, the resulting KG would be slightly different (in terms of the nodes and edges extracted). This is due to the inherent randomness of the model. Finally, it’s worth mentioning that the inference time is relatively fast, which is one of the main advantages of this lightweight model.

GPT-4.1 mini

The second model explored here is nothing new or exciting, but what’s more interesting is how it is utilized for the purpose. To this end, I’m using the LightRAG package to extract KGs from texts. As the name suggests, the package implements knowledge graphs for RAG applications, which is itself based on GraphRAG from Microsoft, but with much improvement on cost (in terms of token usage) and efficiency (in terms of KG construction). The interested readers are strongly encouraged to read the original papers of LightRAG and GraphRAG. In this experiment, we’re only interested in the KG aspects (and not the RAG aspects) of the package.

Given an underlying LLM (and embedding model), LightRAG has a sophisticated built-in prompt template for KG extraction, including:

Role: “You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the input text.”
Defined entity types: Organization, Person, Geo, Event, Category
Instructions to extract entity name, type and description
Constraints – including unidirectional relationships and structured delimiters (like <|> for tuples and ## for records)
Instructions to extract relationship description, strength and keywords
Few-shot examples – lots of texts from different domains (legal, news, literary, etc.) with clearly defined output structure.

In addition, since LightRAG is designed for RAG applications, it is typically used for extracting KGs from overlapping chunks of text such that KGs extracted from two chunks of texts in a document (overlapping or not) could potentially have duplicate entities (nodes) and relationships (edges). Thus, the authors also implemented some basic entity resolution to deduplicate those nodes and edges (using simple exact string matching).

Theoretically, any reasonable LLM should be able to “plug” into the package and “play”, but in this post, I’m using the GPT-4.1 mini model as suggested by the authors in the paper. Given the same test example as above, LightRAG produces the following knowledge graph:

As can be seen, the extracted KG captures many more entities and relationships (compared to the first one)¹. This can thus be seen as a much more comprehensive (and accurate) KG. This should also come as no surprise as the model used here (GPT-4.1 mini) is much more advanced and heavier weight (7B params) than the one before as well as the prompt used is much more sophisticated. The obvious tradeoff is that inference now takes much more time (at least 2X). Finally, this model also suffers from the same randomness in output generation as before as typically for most LLMs in general.

Text2Graph app

To tie it all together, I have created a HF Spaces app to demonstrate both models with proper UI for user interactions.

The user can input any arbitrary texts (or use one of the samples provided in multiple languages), choose one of the two models, generate, inspect and interact with the generated knowledge graphs in both JSON format and interactive network visualization.

The app is fully open-source and also embedded in this post:

Summary

The table below summarizes some of the key considerations as well as the pros and cons of the two models discussed in this post for KG extraction.

	Phi-3-mini-128k-instruct-graph	GPT-4.1 mini
Training	Instruction fine-tuned	Prompt engineering (few-shot learning)
# Params	3.8B	7B
Context Window	128k (i/p + o/p)	1,200 i/p tokens (default chunk size of LightRAG)
Inference	Faster	Slower
KG Quality	Lower	Higher
Randomness	True	True
# Languages	6-8 with varying quality	27

In the table, “KG Quality” means the quality of the generated knowledge graphs in terms of comprehensiveness and accuracy, “Inference” means the inference time given the same underlying hardware, “Randomness” means the random nature of the extracted KG, and “# Languages” means the number of languages supported by the model. Phi-3-mini-128k-instruct was trained primarily in English, but also secondarily supports German, Spanish, French, Japanese, Russian, Chinese and Vietnamese. Whereas, GPT-4.1 mini seems to support 27 languages according to this.

The table also indicates that the token usage (and cost) of GPT-4.1 mini is higher than that of Phi-3-mini-128k-instruct-graph (because of the more sophisticated prompt design of LightRAG) – however, this is still not 100% clear at the time of writing and to be fully evaluated.

Finally, it really depends on the specific use case that which one of the above would shine (speed vs. accuracy). But if I could pick one, I would choose GPT-4.1 mini implemented by LightRAG due to the much higher quality KGs generated (at the cost of much longer inference time). Therefore, if I could do anything to further improve its KG extraction, it would be: (1) Faster inference (while preserving high quality output) and (2) Minimize output randomness. Also, it’s worth mentioning that I haven’t investigated the issue of hallucination in this post. That is, if the LLM would “imagine” nodes and edges that don’t actually exist – this is an interesting avenue that’s worth future investigation as well.

A slightly edited version of this blog is posted on Medium.

You can click on or hover over a node or an edge of the graph to see its detailed description. ↩