Nebius Batch Inference — Synthetic Data Generation

Run large-scale async LLM jobs at 50% cost, no rate-limit impact. Ideal for generating synthetic training datasets, annotation, evaluation sets, or any offline bulk inference.

Prerequisites

pip install openai
export NEBIUS_API_KEY="your-key"

API base: https://api.tokenfactory.nebius.com/v1/

Limits & pricing

Constraint	Value
Max requests per file	5,000,000
Max file size	10 GB
Completion window	24 hours
Cost vs real-time	50% cheaper
Rate limits	Not consumed

Complete pipeline

1. Build JSONL batch file

Each line = one inference request. All requests must use the same model.

import json, uuid

prompts = [
    "Explain vector databases for beginners.",
    "What is the difference between RAG and fine-tuning?",
    # ... up to 5M prompts
]

with open("batch_requests.jsonl", "w") as f:
    for prompt in prompts:
        f.write(json.dumps({
            "custom_id": str(uuid.uuid4()),   # unique ID to match results
            "url": "/v1/chat/completions",
            "body": {
                "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
                "messages": [
                    {"role": "system", "content": "You are a helpful expert."},
                    {"role": "user",   "content": prompt},
                ],
                "max_tokens": 1024,
                "temperature": 0.7,
            },
        }) + "\n")

2. Upload + create batch job

from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=API_KEY)

with open("batch_requests.jsonl", "rb") as f:
    file_obj = client.files.create(file=f, purpose="batch")

batch = client.batches.create(
    input_file_id=file_obj.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"description": "synthetic-data-gen"},
)
print(f"Batch: {batch.id}  status={batch.status}")

3. Poll until complete

import time

while True:
    batch = client.batches.retrieve(batch.id)
    counts = batch.request_counts
    print(f"status={batch.status}  done={counts.completed}/{counts.total}")
    if batch.status in ("completed", "failed", "cancelled", "expired"):
        break
    time.sleep(30)

4. Download outputs

content = client.files.content(batch.output_file_id)
results = [json.loads(line) for line in content.text.strip().splitlines()]

Each result record:

{
  "custom_id": "...",
  "response": {
    "body": {
      "choices": [{"message": {"content": "The model's response..."}}]
    }
  }
}

5. Export as fine-tuning JSONL

# Build custom_id → original prompt lookup
id_to_prompt = {}
with open("batch_requests.jsonl") as f:
    for line in f:
        req = json.loads(line)
        user_msg = next(m["content"] for m in req["body"]["messages"] if m["role"] == "user")
        id_to_prompt[req["custom_id"]] = user_msg

with open("training.jsonl", "w") as out:
    for rec in results:
        reply  = rec["response"]["body"]["choices"][0]["message"]["content"].strip()
        prompt = id_to_prompt.get(rec["custom_id"], "")
        if len(reply) < 50:      # quality filter
            continue
        out.write(json.dumps({
            "messages": [
                {"role": "user",      "content": prompt},
                {"role": "assistant", "content": reply},
            ]
        }) + "\n")

Tips for synthetic data quality

Use a large teacher model (70B+) to generate, then fine-tune a smaller model — teacher distillation
Set temperature: 0.6–0.8 for diverse yet coherent outputs
Add a quality filter (min length, keyword checks) before using as training data
Run deduplication on custom_id before uploading as training file

Clean up batch files

You can have up to 500 batch files. Delete old ones:

client.files.delete("file_123")

Bundled reference

Read references/batch-format.md when the user asks about JSONL structure, file limits, or output format.

Reference script

Full working script: scripts/05_batch_inference_synthetic.py

Docs: https://docs.tokenfactory.nebius.com/ai-models-inference/batch-inference

nebius-batch-synthetic