data-catalog

SKILL.md

Data Catalog Patterns

Reference patterns for managing the Dataiku data catalog via the Python API.

Key Concepts

Concept What it is Scope
Data Collection A curated group of datasets, visible across projects Instance-level (client)
Metadata Label, description, tags, checklists, custom key-value pairs Per dataset or project
Meaning A semantic type for columns (e.g., "Email", "Country Code") Instance-level (client)
Tags Freeform labels on datasets or projects Per dataset or project

Data Collections

List Collections

collections = client.list_data_collections(as_type="dict")
for c in collections:
    print(f"{c['displayName']} ({c['id']}) — {c['itemCount']} items")

Create a Collection

dc = client.create_data_collection(
    displayName="Customer Data",
    description="All customer-related datasets",
    tags=["customers", "production"]
)

Add Datasets to a Collection

dc = client.get_data_collection("collection_id")

# Add by dataset handle
ds = project.get_dataset("MY_DATASET")
dc.add_object(ds)

# Add by dict (for cross-project datasets)
dc.add_object({"type": "DATASET", "projectKey": "PROJECT_A", "id": "DATASET_NAME"})

List and Remove Objects

dc = client.get_data_collection("collection_id")
objects = dc.list_objects()
for obj in objects:
    raw = obj.get_raw()
    print(f"  {raw['projectKey']}.{raw['id']}")

    # Get as a dataset handle
    ds = obj.get_as_dataset()

    # Remove from collection
    obj.remove()

Update Collection Settings

dc = client.get_data_collection("collection_id")
settings = dc.get_settings()
settings.display_name = "Renamed Collection"
settings.description = "Updated description"
settings.tags = ["new-tag", "production"]
settings.save()

Dataset Metadata

Get and Set Metadata

ds = project.get_dataset("MY_DATASET")
metadata = ds.get_metadata()

# Metadata structure:
# {
#   "label": "...",
#   "description": "...",
#   "tags": ["tag1", "tag2"],
#   "checklists": {"checklists": [...]},
#   "custom": {"kv": {"key1": "value1"}}
# }

metadata["tags"] = ["cleaned", "production"]
metadata["custom"]["kv"]["owner"] = "data-team"
ds.set_metadata(metadata)

AI-Generated Descriptions

# Generate descriptions for dataset and columns (requires AI Services enabled)
result = ds.generate_ai_description(language="english", save_description=True)

Rate-limited: 1000 requests/day, then throttled to ~60s per call.

Meanings (Semantic Column Types)

List and Create Meanings

# List existing meanings
meanings = client.list_meanings()

# Create a values-list meaning
client.create_meaning(
    id="country_code",
    label="Country Code",
    type="VALUES_LIST",
    values=["US", "UK", "FR", "DE", "JP"],
    normalizationMode="EXACT",
    detectable=True
)

# Create a pattern-based meaning
client.create_meaning(
    id="email_address",
    label="Email Address",
    type="PATTERN",
    pattern=r"^[\w.-]+@[\w.-]+\.\w+$",
    detectable=True
)

Update a Meaning

meaning = client.get_meaning("country_code")
definition = meaning.get_definition()
definition["entries"].append({"value": "CA"})
meaning.set_definition(definition)

Catalog Indexing

Trigger re-indexing of connections so new tables appear in the catalog:

# Index specific connections
client.catalog_index_connections(connection_names=["my_snowflake", "my_postgres"])

# Index all connections
client.catalog_index_connections(all_connections=True)

Detailed References

Weekly Installs
4
GitHub Stars
6
First Seen
Feb 27, 2026
Installed on
gemini-cli4
github-copilot4
codex4
kimi-cli4
cursor4
amp4