hkgb

SKILL.md

Hybrid Knowledge Graph Bridge

Integration pattern for linking structured domain data to LLM-extracted lexical graphs

Problem

When building Knowledge Graphs from heterogeneous sources, two distinct graph types often need coexistence:

  1. Domain Graph — Structured, curated data from CSV/databases representing business entities and relationships
  2. Lexical Graph — Entities and relationships automatically extracted from unstructured documents via LLM

These graphs speak different languages: one is schema-driven and deterministic, the other is probabilistic and emergent. Without a deliberate bridge, they remain disconnected silos.

Solution

The solution establishes a reliable join key between both graphs through five steps.

Step 1: Specify the lexical graph schema

Before extraction, define the ontology that guides the LLM. This specification comprises three elements.

Node Types — The entities to extract. Some are simple labels, others are enriched with descriptions (to guide the LLM) and typed properties:

NODE_TYPES = [
    "Entity",           # Simple label
    "Concept",
    "Process",
    {                   # Enriched with description
        "label": "Outcome",
        "description": "A result, benefit, or consequence of a process or action."
    },
    {                   # With typed properties
        "label": "Reference",
        "description": "An external resource such as a document, article, or dataset.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True},
            {"name": "type", "type": "STRING"}
        ]
    },
]

Relationship Types — The possible verbs between entities:

RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "REFERENCES"
]

Patterns — The valid combinations. The LLM can only extract conforming triplets:

PATTERNS = [
    ("Entity", "RELATED_TO", "Entity"),
    ("Concept", "RELATED_TO", "Entity"),
    ("Process", "PART_OF", "Entity"),
    ("Process", "LEADS_TO", "Outcome"),
    ("Reference", "REFERENCES", "Entity"),
]

Step 2: Configure the extraction pipeline

The pipeline assembles the LLM, embedder, text splitter, and schema:

from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(model="text-embedding-ada-002")
text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver,
    neo4j_database=os.getenv("NEO4J_DATABASE"),
    embedder=embedder,
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)

The pipeline performs: PDF → chunks → schema-guided LLM extraction → node/relationship creation → embeddings.

Step 3: Transform the structured source into a dictionary

Each row of the CSV (representing the domain graph) becomes a Python dictionary:

records = csv.DictReader(
    open(os.path.join(data_path, "metadata.csv"), encoding="utf8", newline='')
)
# Produces: {"filename": "doc1.pdf", "category": "...", "author": "...", ...}

Step 4: Add the common key to the dictionary

The pipeline creates Document nodes with a path property. This property serves as the bridge between the two graphs. Enrich the dictionary with a key that matches exactly what the pipeline stores:

record["file_path"] = os.path.join(data_path, record["filename"])
# The same value passed to the pipeline becomes Document.path

This same value is passed to the pipeline which generates the lexical graph:

result = asyncio.run(
    kg_builder.run_async(file_path=record["file_path"])
)

Step 5: Join the two graphs via Cypher

A query uses the common key to attach the domain graph to the lexical graph:

MATCH (d:Document {path: $file_path})
MERGE (e:DomainEntity {id: $entity_id})
SET e.category = $category,
    e.author = $author
MERGE (d)-[:BELONGS_TO]->(e)

The enriched dictionary is passed as parameters:

neo4j_driver.execute_query(cypher, parameters_=record)

Consequences

The pattern works because the dictionary key and Document.path contain identical values. This implicit key connects the lexical graph (entities extracted according to the specified schema) to the domain graph (business structure from structured sources). If these values diverge, the bridge fails silently — orphaned nodes accumulate undetected.

Verification

To ensure the bridge holds, verify that Document nodes are properly attached:

// Orphan documents (broken bridge)
MATCH (d:Document)
WHERE NOT EXISTS { (d)-[:BELONGS_TO]->(:DomainEntity) }
RETURN d.path AS orphan

// Domain entities without documents (bridge never built)
MATCH (e:DomainEntity)
WHERE NOT EXISTS { (:Document)-[:BELONGS_TO]->(e) }
RETURN e.id AS missing

Complete Reference

For a complete implementation example, see references/full_example.py.

Weekly Installs
6
GitHub Stars
1
First Seen
Jan 26, 2026
Installed on
codex6
opencode4
claude-code4
kiro-cli4
windsurf4
mcpjam3