Neo4j Graph Data Science (GDS)

Plugin: graph-data-science (self-managed) or built into Aura Pro Python client: graphdatascience (PyPI) — mirrors the Cypher procedure API in Python Docs: https://neo4j.com/docs/graph-data-science/current/ Python client docs: https://neo4j.com/docs/graph-data-science-client/current/ Current client: v1.21 — pip install graphdatascience

When to Use

Projecting an in-memory named graph for algorithm execution
Running GDS algorithms: centrality, community detection, similarity, path finding, node embeddings
Chaining algorithms using mutate mode without round-tripping through the database
Computing node embeddings (FastRP, Node2Vec, GraphSAGE, HashGNN) as ML features
Building recommendation systems using KNN + FastRP embeddings
Using the GDS Python client (graphdatascience) for data science workflows
Estimating memory requirements before running on large graphs

When NOT to Use

Writing or optimizing Cypher queries → use neo4j-cypher-skill
Driver/application connection setup → use neo4j-driver-python-skill (or other driver skill)
GraphRAG retrieval pipelines → use neo4j-graphrag-skill
Aura Graph Analytics (serverless, no Neo4j DB required) → use neo4j-aura-graph-analytics-skill
Snowflake Graph Analytics → use neo4j-snowflake-graph-analytics-skill
GDS on Aura Free, BC, or VDC — GDS plugin is unavailable; users on Free must upgrade to Pro; users on BC/VDC should use Aura Graph Analytics instead → neo4j-aura-graph-analytics-skill

GDS Availability

Deployment	GDS Available
Aura Free	❌ No — upgrade to Aura Pro
Aura Pro	✅ Yes
Aura Business Critical (BC)	❌ No — use Aura Graph Analytics
Aura Virtual Dedicated Cloud (VDC)	❌ No — use Aura Graph Analytics
Self-managed (Community)	✅ With GDS plugin installed
Self-managed (Enterprise)	✅ With GDS plugin installed

Pre-flight check — run this before any GDS operation:

RETURN gds.version() AS gds_version

If this fails with Unknown function 'gds.version', GDS is not installed or not available on this tier. Stop and inform the user.

Installation & Setup

GDS Python Client

pip install graphdatascience                   # core
pip install graphdatascience[rust_ext]         # 3–10× faster serialization
pip install graphdatascience[networkx]         # NetworkX integration
pip install graphdatascience[ogb]              # OGB dataset loading

Compatibility (client v1.21): GDS >= 2.6, Python >= 3.10, Neo4j Driver >= 4.4.12

Connection

from graphdatascience import GraphDataScience

# Local / self-managed
gds = GraphDataScience("bolt://localhost:7687", auth=("neo4j", "password"))

# Aura DS (AuraDS instance)
gds = GraphDataScience(
    "neo4j+s://mydbid.databases.neo4j.io:7687",
    auth=("neo4j", "my-password"),
    aura_ds=True
)

print(gds.server_version())   # verify connection

Graph Projection

GDS algorithms operate on named in-memory graphs projected from the Neo4j database. The graph catalog persists only for the lifetime of the Neo4j instance — restart wipes it.

Native Projection

Cypher:

CALL gds.graph.project(
  'myGraph',               -- graph name
  ['Person', 'City'],      -- node labels (or '*' for all)
  {
    KNOWS: { orientation: 'UNDIRECTED' },
    LIVES_IN: {}
  }
)
YIELD graphName, nodeCount, relationshipCount

Python client:

# Simple: single label + relationship type
G, result = gds.graph.project("myGraph", "Person", "KNOWS")

# Multi-label, multi-relationship, with properties
G, result = gds.graph.project(
    "myGraph",
    {"Person": {"properties": ["age", "score"]},
     "City":   {}},
    {"KNOWS":   {"orientation": "UNDIRECTED"},
     "LIVES_IN": {"properties": ["since"]}}
)

print(f"Projected {G.node_count()} nodes, {G.relationship_count()} relationships")

Cypher Projection

Use when native projection can't express the filtering or transformation you need:

G, result = gds.graph.cypher.project(
    """
    MATCH (source:Person)-[r:KNOWS]->(target:Person)
    WHERE source.active = true AND target.active = true
    RETURN gds.graph.project($graph_name, source, target, {
        sourceNodeProperties: source { .score },
        targetNodeProperties: target { .score },
        relationshipType: 'KNOWS'
    })
    """,
    database="neo4j",
    graph_name="activeGraph"
)

Graph Object API

G.name()                        # "myGraph"
G.node_count()                  # 12_043
G.relationship_count()          # 87_211
G.node_labels()                 # ["Person", "City"]
G.relationship_types()          # ["KNOWS", "LIVES_IN"]
G.node_properties("Person")     # ["age", "score"]  — lists mutated/projected properties
G.exists()                      # True
G.memory_usage()                # "45 MiB"
G.density()                     # 0.0032
G.drop()                        # remove from catalog

# Re-attach to an existing projected graph by name
G = gds.graph.get("myGraph")

# Context manager — auto-drops on exit
with gds.graph.project("tmpGraph", "Person", "KNOWS")[0] as G:
    results = gds.pageRank.stream(G)
# G is dropped here automatically

Memory Estimation

Always estimate before projecting or running algorithms on large graphs:

CALL gds.graph.project.estimate(['Person'], 'KNOWS')
YIELD requiredMemory, bytesMin, bytesMax, nodeCount, relationshipCount

est = gds.graph.project.estimate("Person", "KNOWS")
print(est["requiredMemory"])   # e.g. "1234 MiB"

Execution Modes

Every algorithm supports four modes — choose deliberately:

Mode	Side effect	Returns	When to use
`stream`	None	One row per node/pair with result	Inspect results; top-N queries
`stats`	None	Single row with aggregate metrics	Summary statistics, convergence check
`mutate`	Adds property to in-memory graph only	Stats row	Chain algorithms without writing to DB
`write`	Persists property to Neo4j database	Stats row	Final step — make results queryable

Pattern: stream first to verify → mutate to chain → write to persist.

The mutateProperty must not already exist in the in-memory graph. After write, a new projection is needed to use written properties in subsequent GDS algorithms (the in-memory graph does not see DB writes).

Algorithm Reference

Centrality

PageRank

Measures node influence via incoming relationships and their sources' influence.

-- Stream
CALL gds.pageRank.stream('myGraph', {
  dampingFactor: 0.85,     -- probability of following a link (default 0.85)
  maxIterations: 20,
  tolerance: 0.0000001
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC LIMIT 10

-- Write
CALL gds.pageRank.write('myGraph', {
  writeProperty: 'pagerank',
  dampingFactor: 0.85
})
YIELD nodePropertiesWritten, ranIterations, didConverge

# Python client
pr_df = gds.pageRank.stream(G, dampingFactor=0.85, maxIterations=20)
gds.pageRank.write(G, writeProperty="pagerank", dampingFactor=0.85)
gds.pageRank.mutate(G, mutateProperty="pagerank", dampingFactor=0.85)

Gotchas: Spider traps (closed groups with no outlinks) inflate scores — increase dampingFactor. Negative relationship weights are silently ignored.

Other Centrality Algorithms

Algorithm	Procedure	Best for
Betweenness Centrality	`gds.betweenness`	Bottleneck/bridge nodes
Degree Centrality	`gds.degree`	Most-connected nodes (fast)
Article Rank	`gds.articleRank`	PageRank variant dampening high-degree nodes
Eigenvector	`gds.eigenvector`	Influence via well-connected neighbors
Closeness	`gds.closeness`	Average distance to all other nodes
HITS	`gds.hits`	Authority/hub scores (web-like graphs)

Community Detection

Louvain

Maximizes modularity by hierarchically merging communities. Best general-purpose choice for large graphs.

CALL gds.louvain.stream('myGraph', {
  relationshipWeightProperty: 'weight',   -- optional
  includeIntermediateCommunities: false
})
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS name, communityId

CALL gds.louvain.write('myGraph', { writeProperty: 'community' })
YIELD communityCount, modularity

louvain_df = gds.louvain.stream(G)
gds.louvain.write(G, writeProperty="community")
gds.louvain.mutate(G, mutateProperty="community")

Louvain vs Leiden: Leiden is a refinement of Louvain that avoids poorly connected communities; prefer Leiden when community quality matters more than raw speed.

Weakly Connected Components (WCC)

Identifies disconnected subgraphs (ignoring relationship direction). Run this early to understand graph structure.

CALL gds.wcc.stream('myGraph', {
  threshold: 0.5,          -- optional: only traverse rels with weight above threshold
  minComponentSize: 10     -- optional: only return nodes in components >= 10 nodes
})
YIELD nodeId, componentId

CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount

wcc_df = gds.wcc.stream(G)
gds.wcc.write(G, writeProperty="componentId")

When to use WCC first: Before running expensive algorithms, partition the graph by component and run per-component to avoid wasting computation on disconnected subgraphs.

Other Community Algorithms

Algorithm	Procedure	Notes
Leiden	`gds.leiden`	Higher quality than Louvain; slower
Label Propagation	`gds.labelPropagation`	Fast, good for large graphs; non-deterministic
K-Means	`gds.kmeans`	Requires node embedding properties as input
HDBSCAN	`gds.hdbscan`	Density-based; finds variable-density communities
K-Core Decomposition	`gds.kcore`	Finds dense subgraphs by degree threshold
Triangle Count	`gds.triangleCount`	Counts triangles per node; use before LCC
Local Clustering Coefficient	`gds.localClusteringCoefficient`	Ratio of closed triangles
Strongly Connected Components	`gds.scc`	Directed graphs only

Similarity

K-Nearest Neighbors (KNN)

Finds the k most similar nodes to each node based on node properties (typically embeddings).

CALL gds.knn.stream('myGraph', {
  nodeProperties: ['embedding'],   -- Float[] property (e.g. from FastRP)
  topK: 10,
  sampleRate: 0.5,                 -- accuracy vs speed trade-off (default 0.5)
  similarityCutoff: 0.7            -- only return pairs above this threshold
})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name, gds.util.asNode(node2).name, similarity
ORDER BY similarity DESC

CALL gds.knn.write('myGraph', {
  nodeProperties: ['embedding'],
  topK: 10,
  writeRelationshipType: 'SIMILAR',
  writeProperty: 'score'
})
YIELD relationshipsWritten

knn_df = gds.knn.stream(G, nodeProperties=["embedding"], topK=10)
gds.knn.write(G, nodeProperties=["embedding"], topK=10,
              writeRelationshipType="SIMILAR", writeProperty="score")

Similarity metrics (auto-selected by property type):

Float[] → cosine, Euclidean, or Pearson
Integer[] → Jaccard or Overlap
Scalar → inverse distance

Classic pattern: FastRP mutate → KNN write → query SIMILAR relationships for recommendations.

Node Similarity

Computes Jaccard similarity based on common neighbors (no property needed):

gds.nodeSimilarity.stream(G, similarityCutoff=0.1, topK=10)
gds.nodeSimilarity.write(G, writeRelationshipType="SIMILAR", writeProperty="score")

Path Finding

Algorithm	Procedure	Use case
Dijkstra (single source)	`gds.shortestPath.dijkstra`	Shortest path between two nodes
Dijkstra (all sources)	`gds.allShortestPaths.dijkstra`	All shortest paths from one source
A*	`gds.shortestPath.astar`	Spatial graphs with lat/lon heuristic
Yen's k-Shortest Paths	`gds.shortestPath.yens`	k alternative shortest paths
Bellman-Ford	`gds.bellmanFord`	Graphs with negative weights
Random Walk	`gds.randomWalk`	Sampling graph neighborhoods
BFS / DFS	`gds.bfs` / `gds.dfs`	Traversal order, reachability

-- Dijkstra: shortest path between two nodes
MATCH (source:Location {name: 'A'}), (target:Location {name: 'B'})
CALL gds.shortestPath.dijkstra.stream('myGraph', {
  sourceNode: source,
  targetNode: target,
  relationshipWeightProperty: 'distance'
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN totalCost, [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodes

Node Embeddings

Compute low-dimensional vector representations of nodes for use in ML pipelines.

Algorithm	Tier	Inductive?	Best for
FastRP	Production	Yes (with `propertyRatio=1.0` + `randomSeed`)	Fast, scalable, production ML pipelines
GraphSAGE	Beta	Yes	Feature-rich nodes; generalizes to unseen nodes
Node2Vec	Beta	No (transductive)	Structural similarity; same graph train+predict
HashGNN	Beta	Yes (with `featureProperties` + `randomSeed`)	Fast, GNN-style with limited compute

FastRP

CALL gds.fastRP.mutate('myGraph', {
  embeddingDimension: 256,        -- vector length; 128–512 typical
  iterationWeights: [0.0, 1.0, 1.0],  -- [self, 1-hop, 2-hop] neighborhood weights
  propertyRatio: 0.5,             -- fraction of dims for node properties (requires featureProperties)
  featureProperties: ['score'],   -- node properties to incorporate
  normalizationStrength: -0.5,    -- negative: downplay high-degree hubs
  randomSeed: 42,                 -- set for reproducibility
  mutateProperty: 'embedding'
})
YIELD nodePropertiesWritten

gds.fastRP.mutate(G,
    embeddingDimension=256,
    iterationWeights=[0.0, 1.0, 1.0],
    randomSeed=42,
    mutateProperty="embedding"
)
gds.fastRP.write(G, embeddingDimension=256, writeProperty="embedding", randomSeed=42)

FastRP → KNN pipeline (recommendation / similarity):

# 1. Project
G, _ = gds.graph.project("myGraph", "Product", {"BOUGHT_TOGETHER": {"orientation": "UNDIRECTED"}})

# 2. Embed
gds.fastRP.mutate(G, embeddingDimension=128, randomSeed=42, mutateProperty="emb")

# 3. Find similar nodes
gds.knn.write(G,
    nodeProperties=["emb"],
    topK=10,
    writeRelationshipType="SIMILAR",
    writeProperty="score"
)

# 4. Cleanup
G.drop()

ML Pipelines

GDS supports end-to-end ML pipelines for node classification and link prediction. These manage feature engineering, train/test splits, model training, and prediction in one workflow.

# Node classification pipeline (abbreviated)
pipe, _ = gds.nc_pipe("myPipeline")
pipe.addNodeProperty("fastRP", mutateProperty="emb", embeddingDimension=128, randomSeed=42)
pipe.selectFeatures("emb")
pipe.addLogisticRegression(maxEpochs=100)

model, train_result = pipe.train(G, targetProperty="label", metrics=["ACCURACY"])
print(train_result["modelInfo"]["metrics"])

predictions = model.predict_stream(G)

Algorithm Decision Tree

Centrality (who is important?)
  ├── Influence via network links    → PageRank / ArticleRank
  ├── Bottleneck / bridge nodes      → Betweenness Centrality
  └── Direct connections only        → Degree Centrality

Community Detection (who clusters together?)
  ├── General purpose, fast          → Louvain
  ├── Higher quality communities     → Leiden
  ├── Fast, non-deterministic        → Label Propagation
  └── Is the graph connected?        → WCC (run first to partition)

Similarity / Recommendations
  ├── Node properties / embeddings   → KNN
  └── Common neighbors               → Node Similarity

Path Finding
  ├── Shortest path (positive weights)  → Dijkstra / A*
  ├── k alternative paths              → Yen's
  └── Negative weights                 → Bellman-Ford

Node Embeddings (ML features)
  ├── Production, fast, scalable     → FastRP
  ├── Feature-rich nodes             → GraphSAGE
  ├── Same graph train+predict       → Node2Vec
  └── GNN-style, limited compute     → HashGNN

Common Patterns & Checklist

Full workflow

# 0. Verify GDS
print(gds.server_version())

# 1. Estimate memory
est = gds.graph.project.estimate("Person", "KNOWS")
print(est["requiredMemory"])

# 2. Project
G, _ = gds.graph.project("myGraph", "Person",
                          {"KNOWS": {"orientation": "UNDIRECTED"}})

# 3. Inspect graph
print(G.node_count(), G.relationship_count())

# 4. Stream first to verify algorithm output
df = gds.pageRank.stream(G)
print(df.sort_values("score", ascending=False).head(10))

# 5. Write to DB when satisfied
gds.pageRank.write(G, writeProperty="pagerank", dampingFactor=0.85)

# 6. Always drop to free memory
G.drop()

Built-in test datasets

G = gds.graph.load_cora()          # 2,708 Paper nodes, 5,429 CITES edges
G = gds.graph.load_karate_club()   # 34 Person nodes, 78 KNOWS edges
G = gds.graph.load_imdb()          # 12,772 nodes, heterogeneous
G = gds.graph.load_lastfm()        # 19,914 nodes, user-artist graph

Checklist

gds.version() returns a version (GDS available and licensed)
Memory estimated for large projections before running
Named graph dropped (G.drop()) after use — or context manager used
Algorithm mode chosen: stream (inspect) → mutate (chain) → write (persist)
writeProperty / mutateProperty checked for collision with existing properties
randomSeed set for reproducible embeddings
WCC run first on disconnected graphs to partition before expensive algorithms

MCP Tool Mapping

When the Neo4j MCP server is available:

Operation	MCP tool
`RETURN gds.version()`	`read-cypher`
`gds.pageRank.stream(...)`	`read-cypher`
`gds.pageRank.write(...)`	`write-cypher`
`gds.graph.drop(...)`	`write-cypher`
List available procedures: `CALL gds.list()`	`read-cypher`
List GDS procedures via MCP	`mcp__neo4j__list-gds-procedures` (if available)

Resources

GDS Library Manual
GDS Python Client Docs
Algorithm Reference
Python Client Tutorials
GraphAcademy: GDS Fundamentals — 3–4 hrs; projections, algorithm categories, execution modes
GDS GitHub
Supported Neo4j versions

neo4j-gds-skill