knowledge-graph-creation
Knowledge Graph Creation
This skill enables an AI agent to transform unstructured text into a structured knowledge graph. The agent extracts entities (people, organizations, technologies, concepts), identifies the relationships between them, generates formal graph triples (subject-predicate-object), and outputs the graph in both a queryable format (Cypher for Neo4j, JSON-LD) and a visual diagram (Mermaid). Knowledge graphs are valuable for understanding complex domains, powering semantic search, detecting implicit connections, and building recommendation systems.
Workflow
-
Analyze the Source Material: Read the input text and determine its domain, scope, and complexity. Identify the types of entities likely present (people, organizations, locations, technical concepts, events, etc.) and the granularity appropriate for the graph. A technical architecture document requires fine-grained component-level entities, while a news article may need coarser actor-level entities.
-
Extract Entities: Identify all named entities and significant concepts in the text. For each entity, record its canonical name, type (person, organization, technology, concept, event, location), and any notable attributes mentioned (e.g., founding date, version number, role). Deduplicate entities that appear under different names or abbreviations.
-
Map Relationships: For every pair of entities that interact in the text, identify the relationship between them. Express each relationship as a directed triple: (Subject) -[PREDICATE]-> (Object). Choose predicates from a consistent vocabulary (e.g., WORKS_AT, DEPENDS_ON, CREATED_BY, PART_OF, COMPETES_WITH). Record the source sentence for traceability.
-
Generate Graph Triples and Schema: Formalize the extracted data into a structured format. Output triples in one or more of: Cypher CREATE statements for Neo4j, JSON-LD for web interoperability, or a simple CSV of (subject, predicate, object) rows. Define a lightweight schema listing entity types and valid relationship types.
-
Visualize the Graph: Produce a human-readable visualization of the graph. Use Mermaid syntax for embedding in Markdown, or describe the layout for tools like D3.js, Gephi, or Neo4j Browser. Highlight central nodes and key relationship clusters.
-
Validate and Refine: Review the graph for completeness and accuracy. Check for orphan nodes (entities with no relationships), missing relationships implied by the text, and overly generic predicates that could be made more specific. Invite the user to confirm, correct, or request expansion of particular subgraphs.
Usage
Provide the agent with a text passage, document, or set of documents. Optionally specify the desired output format (Cypher, JSON-LD, Mermaid), the entity types of interest, or the maximum graph size.
Build a knowledge graph from the following text about our microservices architecture.
Output Cypher queries for Neo4j and a Mermaid diagram.
Text: "The API Gateway routes requests to the Auth Service and the Order Service.
The Order Service depends on the Inventory Service and the Payment Service.
The Payment Service integrates with Stripe and PayPal. All services publish
events to the Event Bus, which the Analytics Service consumes."
Examples
Example 1: Knowledge Graph from a Technical Document
User Request:
Build a knowledge graph from this text: "Kubernetes orchestrates containerized applications across clusters. A Pod is the smallest deployable unit in Kubernetes and contains one or more containers. Pods are managed by Deployments, which ensure the desired number of replicas. Services expose Pods to network traffic. Ingress routes external HTTP traffic to Services. etcd stores all cluster state data and is accessed by the API Server. kubectl is the CLI tool that communicates with the API Server."
Extracted Entities:
| Entity | Type | Attributes |
|---|---|---|
| Kubernetes | Technology | Container orchestration platform |
| Pod | Concept | Smallest deployable unit |
| Container | Concept | Runtime unit inside a Pod |
| Deployment | Concept | Manages Pod replicas |
| Service | Concept | Network exposure layer |
| Ingress | Concept | External HTTP routing |
| etcd | Technology | Distributed key-value store |
| API Server | Component | Central management interface |
| kubectl | Tool | CLI client |
Relationships and Triples:
| Subject | Predicate | Object |
|---|---|---|
| Kubernetes | ORCHESTRATES | Pod |
| Pod | CONTAINS | Container |
| Deployment | MANAGES | Pod |
| Service | EXPOSES | Pod |
| Ingress | ROUTES_TO | Service |
| etcd | STORES_STATE_FOR | Kubernetes |
| API Server | READS_FROM | etcd |
| kubectl | COMMUNICATES_WITH | API Server |
Cypher Queries:
CREATE (k:Technology {name: 'Kubernetes', description: 'Container orchestration platform'})
CREATE (pod:Concept {name: 'Pod', description: 'Smallest deployable unit'})
CREATE (container:Concept {name: 'Container', description: 'Runtime unit'})
CREATE (deploy:Concept {name: 'Deployment', description: 'Manages Pod replicas'})
CREATE (svc:Concept {name: 'Service', description: 'Network exposure layer'})
CREATE (ing:Concept {name: 'Ingress', description: 'External HTTP routing'})
CREATE (etcd:Technology {name: 'etcd', description: 'Distributed key-value store'})
CREATE (api:Component {name: 'API Server', description: 'Central management interface'})
CREATE (ctl:Tool {name: 'kubectl', description: 'CLI client'})
CREATE (k)-[:ORCHESTRATES]->(pod)
CREATE (pod)-[:CONTAINS]->(container)
CREATE (deploy)-[:MANAGES]->(pod)
CREATE (svc)-[:EXPOSES]->(pod)
CREATE (ing)-[:ROUTES_TO]->(svc)
CREATE (etcd)-[:STORES_STATE_FOR]->(k)
CREATE (api)-[:READS_FROM]->(etcd)
CREATE (ctl)-[:COMMUNICATES_WITH]->(api)
Example 2: Mermaid Visualization of a Knowledge Graph
User Request:
Create a knowledge graph with a Mermaid diagram from this text: "Alice leads the Backend Team. Bob is on the Backend Team and maintains the Payment Service. The Payment Service depends on the PostgreSQL Database and the Redis Cache. Carol leads the Frontend Team and built the Dashboard App, which calls the Payment Service API."
Extracted Entities and Relationships:
| Subject | Predicate | Object |
|---|---|---|
| Alice | LEADS | Backend Team |
| Bob | MEMBER_OF | Backend Team |
| Bob | MAINTAINS | Payment Service |
| Payment Service | DEPENDS_ON | PostgreSQL Database |
| Payment Service | DEPENDS_ON | Redis Cache |
| Carol | LEADS | Frontend Team |
| Carol | BUILT | Dashboard App |
| Dashboard App | CALLS | Payment Service |
Mermaid Diagram:
graph LR
Alice["Alice (Person)"] -->|LEADS| BT["Backend Team"]
Bob["Bob (Person)"] -->|MEMBER_OF| BT
Bob -->|MAINTAINS| PS["Payment Service"]
PS -->|DEPENDS_ON| PG["PostgreSQL Database"]
PS -->|DEPENDS_ON| RC["Redis Cache"]
Carol["Carol (Person)"] -->|LEADS| FT["Frontend Team"]
Carol -->|BUILT| DA["Dashboard App"]
DA -->|CALLS| PS
This visualization immediately reveals that the Payment Service is a central node with high connectivity — it is maintained by Bob, depended on by the Dashboard App, and itself depends on two data stores. This makes it a key risk area for both the Backend and Frontend teams.
Best Practices
- Use a consistent predicate vocabulary. Define a controlled set of relationship types (DEPENDS_ON, CREATED_BY, PART_OF, etc.) before building the graph. This enables meaningful queries and prevents synonym fragmentation.
- Normalize entity names. Resolve aliases, abbreviations, and co-references to a single canonical name. "JS," "JavaScript," and "ECMAScript" should map to one node unless the distinction matters.
- Include entity attributes. Bare nodes with only a name are less useful than nodes with type, description, and metadata attributes. Richer nodes enable more powerful queries.
- Prioritize relationship directionality. Always model relationships as directed edges with a clear subject and object. Bidirectional relationships should be represented as two directed edges if the semantics differ in each direction.
- Keep the graph focused. Not every noun needs to be an entity. Focus on entities that are relevant to the user's purpose and exclude generic terms that add noise without insight.
Edge Cases
- Ambiguous entity references: When the text contains pronouns or vague references ("it," "the system"), resolve them to specific entities based on context. If resolution is uncertain, note the ambiguity and ask the user to clarify.
- Implicit relationships: Some relationships are implied but not explicitly stated (e.g., "Alice and Bob work at Acme Corp" implies both WORKS_AT relationships). Extract these, but flag them as inferred rather than directly stated.
- Very large source texts: For documents exceeding a few thousand words, process the text in chunks and merge entity graphs across chunks, deduplicating as you go. Warn the user if the resulting graph exceeds a practical visualization size (roughly 50+ nodes).
- Contradictory information: If the source text contains conflicting statements about relationships (e.g., "Service A depends on Service B" in one paragraph and "Service A has no external dependencies" in another), include both and flag the contradiction.
- Domain-specific terminology: In specialized domains (medical, legal, financial), entity types and relationship predicates should reflect domain ontologies (e.g., SNOMED CT for medical, FIBO for financial) when the user requires interoperability with existing knowledge bases.
More from seb1n/awesome-ai-agent-skills
summarization
Summarize text using extractive, abstractive, hierarchical, and multi-document techniques, producing concise outputs at configurable detail levels.
23proofreading
Proofread and correct text for grammar, spelling, punctuation, style, clarity, and consistency, with support for multiple style guides and readability analysis.
19note-taking
Capture, organize, and retrieve notes efficiently using structured formats, tagging, and file management for meetings, ideas, research, and daily logs.
18technical-writing
Write clear, concise, and accurate technical documentation including API references, user guides, tutorials, changelogs, and architecture docs, tailored to the target audience.
14data-analysis
Analyze datasets to extract insights through statistical methods, trend identification, hypothesis testing, and correlation analysis.
14data-visualization
Create clear, effective charts and dashboards from structured data using matplotlib, seaborn, and plotly.
13