data-catalog
SKILL.md
Data Catalog Patterns
Reference patterns for managing the Dataiku data catalog via the Python API.
Key Concepts
| Concept | What it is | Scope |
|---|---|---|
| Data Collection | A curated group of datasets, visible across projects | Instance-level (client) |
| Metadata | Label, description, tags, checklists, custom key-value pairs | Per dataset or project |
| Meaning | A semantic type for columns (e.g., "Email", "Country Code") | Instance-level (client) |
| Tags | Freeform labels on datasets or projects | Per dataset or project |
Data Collections
List Collections
collections = client.list_data_collections(as_type="dict")
for c in collections:
print(f"{c['displayName']} ({c['id']}) — {c['itemCount']} items")
Create a Collection
dc = client.create_data_collection(
displayName="Customer Data",
description="All customer-related datasets",
tags=["customers", "production"]
)
Add Datasets to a Collection
dc = client.get_data_collection("collection_id")
# Add by dataset handle
ds = project.get_dataset("MY_DATASET")
dc.add_object(ds)
# Add by dict (for cross-project datasets)
dc.add_object({"type": "DATASET", "projectKey": "PROJECT_A", "id": "DATASET_NAME"})
List and Remove Objects
dc = client.get_data_collection("collection_id")
objects = dc.list_objects()
for obj in objects:
raw = obj.get_raw()
print(f" {raw['projectKey']}.{raw['id']}")
# Get as a dataset handle
ds = obj.get_as_dataset()
# Remove from collection
obj.remove()
Update Collection Settings
dc = client.get_data_collection("collection_id")
settings = dc.get_settings()
settings.display_name = "Renamed Collection"
settings.description = "Updated description"
settings.tags = ["new-tag", "production"]
settings.save()
Dataset Metadata
Get and Set Metadata
ds = project.get_dataset("MY_DATASET")
metadata = ds.get_metadata()
# Metadata structure:
# {
# "label": "...",
# "description": "...",
# "tags": ["tag1", "tag2"],
# "checklists": {"checklists": [...]},
# "custom": {"kv": {"key1": "value1"}}
# }
metadata["tags"] = ["cleaned", "production"]
metadata["custom"]["kv"]["owner"] = "data-team"
ds.set_metadata(metadata)
AI-Generated Descriptions
# Generate descriptions for dataset and columns (requires AI Services enabled)
result = ds.generate_ai_description(language="english", save_description=True)
Rate-limited: 1000 requests/day, then throttled to ~60s per call.
Meanings (Semantic Column Types)
List and Create Meanings
# List existing meanings
meanings = client.list_meanings()
# Create a values-list meaning
client.create_meaning(
id="country_code",
label="Country Code",
type="VALUES_LIST",
values=["US", "UK", "FR", "DE", "JP"],
normalizationMode="EXACT",
detectable=True
)
# Create a pattern-based meaning
client.create_meaning(
id="email_address",
label="Email Address",
type="PATTERN",
pattern=r"^[\w.-]+@[\w.-]+\.\w+$",
detectable=True
)
Update a Meaning
meaning = client.get_meaning("country_code")
definition = meaning.get_definition()
definition["entries"].append({"value": "CA"})
meaning.set_definition(definition)
Catalog Indexing
Trigger re-indexing of connections so new tables appear in the catalog:
# Index specific connections
client.catalog_index_connections(connection_names=["my_snowflake", "my_postgres"])
# Index all connections
client.catalog_index_connections(all_connections=True)
Detailed References
- references/data-collections.md — Permissions, completeness checks, cross-project patterns
- references/metadata-and-tags.md — Full metadata structure, project tags, custom metadata
- references/meanings.md — All meaning types, normalization modes, detectable meanings
Weekly Installs
4
Repository
jediv/dataiku-c…-controlGitHub Stars
6
First Seen
Feb 27, 2026
Security Audits
Installed on
gemini-cli4
github-copilot4
codex4
kimi-cli4
cursor4
amp4