Dataset Management Patterns

Reference patterns for creating and managing Dataiku datasets via the Python API.

Dataset Types

Type	Use When	Creation Method
Managed	Output of recipes, stored in a connection (SQL, HDFS, etc.)	`project.new_managed_dataset(name)`
Uploaded	Importing local files (CSV, Excel, etc.)	`project.create_upload_dataset(name)` or `project.create_dataset(name, "UploadedFiles", ...)`
SQL Table	Pointing to an existing database table	`project.create_dataset(name, "Snowflake", ...)`

Create a Managed Dataset

builder = project.new_managed_dataset("MY_OUTPUT")
builder.with_store_into("connection_name")
ds = builder.create()

# Configure table location (SQL databases)
settings = ds.get_settings()
raw = settings.get_raw()
raw["params"]["schema"] = "MY_SCHEMA"
raw["params"]["table"] = "MY_OUTPUT"
settings.save()

Upload a File

ds = project.create_dataset(
    "my_dataset", "UploadedFiles",
    params={"uploadConnection": "filesystem_managed"}
)

with open("path/to/data.csv", "rb") as f:
    ds.uploaded_add_file(f, "data.csv")

# Auto-detect schema from file contents
settings = ds.autodetect_settings(infer_storage_types=True)
settings.save()

Simpler alternative: Use create_upload_dataset to skip the manual params configuration:

ds = project.create_upload_dataset("my_dataset")

with open("path/to/data.csv", "rb") as f:
    ds.uploaded_add_file(f, "data.csv")

Common Column Types

Dataiku Type	Description
`string`	Text
`int` / `bigint`	Integer / Large integer
`double` / `float`	Decimal numbers
`boolean`	True/False
`date`	Date only

See references/column-types.md for the full type table.

Core Schema Operations

Get Schema

ds = project.get_dataset("my_dataset")
schema = ds.get_settings().get_schema()
for col in schema["columns"]:
    print(f"{col['name']}: {col['type']}")

Set Schema

settings = ds.get_settings()
settings.set_schema({"columns": [
    {"name": "id", "type": "string"},
    {"name": "amount", "type": "double"},
]})
settings.save()

Auto-detect Schema

settings = dataset.autodetect_settings()
settings.save()

Note: autodetect_settings() is a method on DSSDataset, not on DSSDatasetSettings. It returns a new settings object with the detected schema applied.

See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.

SQL Schema Rule

Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.

For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.

# Normalize column names to uppercase for SQL
raw = settings.get_raw()
for col in raw.get("schema", {}).get("columns", []):
    col["name"] = col["name"].upper()
settings.save()

List Datasets in Project

datasets = project.list_datasets()
for ds in datasets:
    print(f"- {ds['name']} ({ds.get('type', 'unknown')})")

Common Issues

Issue	Cause	Solution
Schema mismatch	Recipe output doesn't match	Run `autodetect_settings()`
Join fails	Key type mismatch	Check types, cast if needed
Missing columns	Schema not updated	Rebuild dataset, update schema
Parse errors	Wrong type detection	Manually set schema

Detailed References

references/column-types.md — Full column type table with Python equivalents
references/schema-operations.md — All schema operations, join compatibility checks, helper functions

dataset-management