manage-data
Manage Data
Guide users through the full dataset lifecycle: create, modify, push, pull, version, and organize benchmark data.
When To Use
- Creating a new benchmark dataset from Python dicts, CSV, or existing data.
- Pushing data to the ZeroEval backend.
- Pulling an existing benchmark dataset for evaluation.
- Working with dataset versions and subsets.
- Setting up git-based data management for a benchmark.
- Adding multimodal data (images, audio, video) to datasets.
- Converting CSV/JSONL to Parquet for git workflows.
Execution Sequence
Follow these steps in order. Load reference files only when the current step requires detailed API usage.
Step 1: Install and Initialize
Check if the SDK is installed. If not, install it:
pip install zeroeval
Verify the CLI is available:
zeroeval --help
Set the API key (if not already configured):
export ZEROEVAL_API_KEY="sk_ze_..."
# or use the interactive setup:
zeroeval setup
Initialize in code:
import zeroeval as ze
ze.init() # reads ZEROEVAL_API_KEY from env
Full setup guide: https://docs.llm-stats.com/python-sdk/installation
Step 2: Determine the Data Workflow
Ask the user which scenario fits:
- Pull an existing benchmark — data already exists on LLM Stats. Go to Step 3a.
- Create from Python — data lives in code as dicts or a CSV file. Go to Step 3b.
- Git-based workflow — user wants to clone, add Parquet files, and push via git. Go to Step 3c.
- Large upload workflow — user has large parquet files and wants a resilient upload/session flow. Go to Step 3d.
Step 3a: Pull an Existing Dataset
dataset = ze.Dataset.pull("benchmark-name")
# Pin to a specific version
dataset = ze.Dataset.pull("benchmark-name", version_number=3)
# Pull a specific subset
dataset = ze.Dataset.pull("benchmark-name", subset="hard")
Rows are DotDict objects — access fields with dot notation (row.question) or dict syntax (row["question"]).
for row in dataset:
print(row.question, row.answer)
# Slice
subset = dataset[0:10]
print(dataset.columns)
Full reference: https://docs.llm-stats.com/python-sdk/datasets/loading
Step 3b: Create from Python
# From a list of dicts
dataset = ze.Dataset("my-benchmark", data=[
{"row_id": "q1", "question": "What is 2+2?", "answer": "4"},
{"row_id": "q2", "question": "Capital of France?", "answer": "Paris"},
], description="Math and geography questions")
# From a CSV file
dataset = ze.Dataset("/path/to/questions.csv")
Include a stable row_id column — it makes resume, comparison, and version diffing reliable.
Modify data before pushing:
dataset.add_rows([{"row_id": "q3", "question": "...", "answer": "..."}])
dataset.update_row(0, {"row_id": "q1", "question": "Updated?", "answer": "Yes"})
dataset.delete_row(2)
Push to the backend:
dataset.push()
print(dataset.version_number) # auto-incremented
For multimodal data, read references/dataset-lifecycle.md for add_image(), add_audio(), add_video(), and add_media_url() usage.
Full reference: https://docs.llm-stats.com/python-sdk/datasets/creating
Step 3c: Git-Based Workflow
This is a good approach for smaller benchmark repos and metadata-heavy workflows. Read references/git-workflow.md for the complete guide.
Quick summary:
git clone https://<user>:<api-key>@git.llm-stats.com/<org>/<slug>.git
cd <slug>
# Add Parquet files (each file = one subset)
mkdir -p data
cp questions.parquet data/default.parquet
git add . && git commit -m "Add benchmark data" && git push
Full reference: https://docs.llm-stats.com/python-sdk/datasets/uploading-data
Step 3d: Large Upload Workflow
Use this when the dataset is large enough that a single Git push is fragile or slow.
Quick summary:
- Create an upload session for the target benchmark.
- Upload Parquet files directly to storage using multipart upload URLs.
- Finalize the session.
- Wait for indexing to move through
received,uploading,indexing, andready.
Step 4: Work with Versions and Subsets
Every push() or git push creates a new version automatically.
# Pull a pinned version
v2 = ze.Dataset.pull("my-benchmark", version_number=2)
# Pull a specific subset
hard = ze.Dataset.pull("my-benchmark", subset="hard")
Default subset resolution order:
- Benchmark's configured default
- File named
default.parquet - Single subset (if only one exists)
- First alphabetically
Full reference: https://docs.llm-stats.com/python-sdk/datasets/versioning-and-subsets
Step 5: Verify with the CLI
Use the CLI to inspect datasets without writing scripts:
# List your datasets
zeroeval datasets list
# Inspect a specific dataset
zeroeval datasets get my-benchmark
# Check versions
zeroeval datasets versions my-benchmark
# Preview rows
zeroeval datasets rows my-benchmark --subset default --limit 5
# JSON output for scripting
zeroeval --output json datasets get my-benchmark
Full CLI reference: https://docs.llm-stats.com/python-sdk/cli/using-the-cli
Step 6: Validate
-
zeroeval datasets listshows the dataset -
zeroeval datasets rows <name>returns expected data - Rows have stable
row_idoridcolumn - Version number incremented after push (
zeroeval datasets versions <name>) - Subsets are correctly detected (for git workflows)
Key Principles
- Use the simplest transport that is reliable. Large parquet payloads should use the upload/session flow. Git is still useful for smaller repos, scorer scripts, and benchmark metadata.
- Always include
row_id. Stable row identity enables resume, comparison, and version diffing. - Parquet only for subsets. Only
.parquetfiles indata/are auto-detected as subsets. CSV and JSONL are stored but not indexed. - Version everything. Each push creates a new version. Pin versions in eval scripts for reproducibility.
- Start small, scale up. Test with a few rows before pushing large datasets, especially with multimodal assets.
More from zeroeval/zeroeval-skills
run-evals
Write tasks, evaluations, and scoring pipelines with the ZeroEval Python SDK. Covers defining @ze.task functions, running evals with dataset.eval(), writing row/column/run evaluators, scoring with column_map, emitting signals, configuring execution (workers, retries, checkpoints), repeating and resuming runs, and inspecting results. Triggers on "run evals", "write evaluation", "benchmark model", "score results", "evaluation pipeline", "task decorator", "scoring function", "column_map", "emit signal", "resume eval", "repeat eval".
15prompt-migration
This skill should be used when users want to migrate hardcoded prompts to ze.prompt for version tracking, feedback collection, judge linkage, and prompt optimization. It covers the full migration workflow for both Python and TypeScript. Triggers on "migrate prompt", "ze.prompt", "hardcoded prompt", "prompt migration", "send feedback", "prompt optimization", "wire feedback", or "connect judges to prompts".
11zeroeval-install
This skill should be used when users want to install, set up, or integrate ZeroEval into their AI application, agent, or pipeline. It covers SDK setup (Python and TypeScript), first-run tracing, ze.prompt migration, and judge recommendations. For non-SDK languages or direct API/OTLP ingestion it routes to the custom-tracing skill. Triggers on "install zeroeval", "set up zeroeval", "add tracing", "integrate zeroeval", "ze.prompt", "add judges", or "monitor my AI app".
10create-judge
This skill should be used when users want to create, design, or configure an automated judge in ZeroEval. It guides through understanding the evaluation goal, choosing binary vs scored evaluation, writing the judge template, designing structured criteria, and creating the judge via dashboard or API. Triggers on "create a judge", "add a judge", "evaluate my LLM output", "set up automated evaluation", "judge template", or "scoring criteria".
10custom-tracing
This skill should be used when users want to send traces to ZeroEval without installing the SDK, using the REST API or OpenTelemetry (OTLP) directly. It covers direct HTTP span ingestion, OTLP collector configuration, and first-trace verification for any language. Triggers on "send traces via API", "direct API tracing", "custom tracing", "manual tracing", "without SDK", "unsupported language", "REST API tracing", "OTLP", "OpenTelemetry", or language cues like "Go", "Ruby", "Java", "Rust", "Elixir", or "PHP".
9