skills/jediv/dataiku-chat-control/dataset-management

dataset-management

SKILL.md

Dataset Management Patterns

Reference patterns for creating and managing Dataiku datasets via the Python API.

Dataset Types

Type Use When Creation Method
Managed Output of recipes, stored in a connection (SQL, HDFS, etc.) project.new_managed_dataset(name)
Uploaded Importing local files (CSV, Excel, etc.) project.create_upload_dataset(name) or project.create_dataset(name, "UploadedFiles", ...)
SQL Table Pointing to an existing database table project.create_dataset(name, "Snowflake", ...)

Create a Managed Dataset

builder = project.new_managed_dataset("MY_OUTPUT")
builder.with_store_into("connection_name")
ds = builder.create()

# Configure table location (SQL databases)
settings = ds.get_settings()
raw = settings.get_raw()
raw["params"]["schema"] = "MY_SCHEMA"
raw["params"]["table"] = "MY_OUTPUT"
settings.save()

Upload a File

ds = project.create_dataset(
    "my_dataset", "UploadedFiles",
    params={"uploadConnection": "filesystem_managed"}
)

with open("path/to/data.csv", "rb") as f:
    ds.uploaded_add_file(f, "data.csv")

# Auto-detect schema from file contents
settings = ds.autodetect_settings(infer_storage_types=True)
settings.save()

Simpler alternative: Use create_upload_dataset to skip the manual params configuration:

ds = project.create_upload_dataset("my_dataset")

with open("path/to/data.csv", "rb") as f:
    ds.uploaded_add_file(f, "data.csv")

Common Column Types

Dataiku Type Description
string Text
int / bigint Integer / Large integer
double / float Decimal numbers
boolean True/False
date Date only

See references/column-types.md for the full type table.

Core Schema Operations

Get Schema

ds = project.get_dataset("my_dataset")
schema = ds.get_settings().get_schema()
for col in schema["columns"]:
    print(f"{col['name']}: {col['type']}")

Set Schema

settings = ds.get_settings()
settings.set_schema({"columns": [
    {"name": "id", "type": "string"},
    {"name": "amount", "type": "double"},
]})
settings.save()

Auto-detect Schema

settings = dataset.autodetect_settings()
settings.save()

Note: autodetect_settings() is a method on DSSDataset, not on DSSDatasetSettings. It returns a new settings object with the detected schema applied.

See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.

SQL Schema Rule

Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.

For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.

# Normalize column names to uppercase for SQL
raw = settings.get_raw()
for col in raw.get("schema", {}).get("columns", []):
    col["name"] = col["name"].upper()
settings.save()

List Datasets in Project

datasets = project.list_datasets()
for ds in datasets:
    print(f"- {ds['name']} ({ds.get('type', 'unknown')})")

Common Issues

Issue Cause Solution
Schema mismatch Recipe output doesn't match Run autodetect_settings()
Join fails Key type mismatch Check types, cast if needed
Missing columns Schema not updated Rebuild dataset, update schema
Parse errors Wrong type detection Manually set schema

Detailed References

Weekly Installs
4
GitHub Stars
6
First Seen
Feb 27, 2026
Installed on
gemini-cli4
github-copilot4
codex4
kimi-cli4
cursor4
amp4