together-batch-inference
Together Batch Inference
Overview
Use Together AI's Batch API for large offline workloads where latency is not the primary concern.
Typical fits:
- bulk classification
- synthetic data generation
- dataset transformations
- large summarization or enrichment jobs
- low-cost asynchronous inference
When This Skill Wins
- The user has many independent requests to run
- A JSONL request file is acceptable
- Turnaround time can be minutes or hours instead of seconds
- Lower cost matters more than immediate interactivity
Hand Off To Another Skill
- Use
together-chat-completionsfor real-time requests or tool-calling apps - Use
together-evaluationsfor managed LLM-as-a-judge workflows - Use
together-embeddingsfor retrieval-specific vector generation
Quick Routing
- End-to-end batch workflow
- Start with scripts/batch_workflow.py or scripts/batch_workflow.ts
- Request format, status model, and result downloads
- Operational guidance and batch sizing
Workflow
- Build a JSONL file where each line contains
custom_idandbody. - Upload the file with
purpose="batch-api". - Create the batch with
input_file_id=...and the target endpoint. - Poll until the job is terminal.
- Download output and error files, then reconcile by
custom_id.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Use
input_file_id, not legacy file parameters. - Keep
custom_idstable and meaningful so result reconciliation is easy. - Batch is for independent requests. If the workload depends on shared conversation state, it is probably the wrong tool.
- Always inspect the error file in addition to the success output.
client.batches.create()returns a wrapper; access the batch object viaresponse.job(e.g.,response.job.id).client.batches.retrieve()returns the batch object directly.- For classification or labeling workloads, set
max_tokenslow (e.g., 4), usetemperature: 0, and constrain the system prompt to return only the label. This minimizes output tokens and cost. - Small batches (under 1K requests) typically complete in minutes. The 24-hour completion window is a maximum, not typical.
Resource Map
- API reference and operational guidance: references/api-reference.md
- Python workflow: scripts/batch_workflow.py
- TypeScript workflow: scripts/batch_workflow.ts
Official Docs
More from togethercomputer/skills
together-chat-completions
Real-time and streaming text generation via Together AI's OpenAI-compatible chat/completions API, including multi-turn conversations, tool and function calling, structured JSON outputs, and reasoning models. Reach for it whenever the user wants to build or debug text generation on Together AI, unless they specifically need batch jobs, embeddings, fine-tuning, dedicated endpoints, dedicated containers, or GPU clusters.
40together-images
Text-to-image generation and image editing via Together AI, including FLUX and Kontext models, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
33together-fine-tuning
LoRA, full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads on Together AI. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
32together-embeddings
Dense vector embeddings, semantic search, RAG pipelines, and reranking via Together AI. Generate embeddings with open-source models and rerank results behind dedicated endpoints. Reach for it whenever the user needs vector representations or retrieval quality improvements rather than direct text generation.
31together-evaluations
LLM-as-a-judge evaluation framework on Together AI. Classify, score, and compare model outputs, select judge models, use external-provider judges or targets, poll results and download reports. Reach for it whenever the user wants to benchmark outputs, grade responses, compare A/B variants, or operationalize automated evaluations.
31together-dedicated-containers
Custom Dockerized inference workers on Together AI's managed GPU infrastructure. Build with Sprocket SDK, configure with Jig CLI, submit async queue jobs, and poll results. Reach for it whenever the user needs container-level control rather than a standard model endpoint or raw cluster.
30