together-chat-completions
Together Chat Completions
Overview
Use Together AI's serverless chat/completions API for interactive inference workloads:
- basic text generation
- streaming responses
- multi-turn chat state
- tool and function calling
- structured outputs
- reasoning-capable models
Treat this skill as the default entry point for Together AI text generation unless the task is clearly offline batch processing, vector retrieval, model training, or infrastructure management.
When This Skill Wins
- Build a chatbot, assistant, or text-generation endpoint on Together AI
- Add streaming output to a real-time user experience
- Implement tool calling or function-calling loops
- Constrain model output to JSON or a regex-defined shape
- Choose between standard chat models and reasoning models
- Debug request parameters, model behavior, or response shapes
Hand Off To Another Skill
- Use
together-batch-inferencefor large offline runs, backfills, or lower-cost asynchronous jobs - Use
together-embeddingsfor vector search, semantic retrieval, or reranking - Use
together-fine-tuningwhen the user wants to train or adapt a model - Use
together-dedicated-endpointswhen the user needs always-on single-tenant hosting - Use
together-dedicated-containersortogether-gpu-clustersfor custom infrastructure
Quick Routing
- Basic chat, streaming, or multi-turn state
- Start with references/api-parameters.md
- Use scripts/chat_basic.py or scripts/chat_basic.ts
- OpenAI SDK migration, rate limits, or debug headers
- Parallel async requests
- Tool calling or function calling
- Structured outputs
- Reasoning models or thinking-mode toggles
- Read references/reasoning-models.md
- Start from scripts/reasoning_models.py or scripts/reasoning_models.ts
- Combining tools + structured output, or tools + streaming
- Read the "Combining Tool Calls with Structured Output" section in references/function-calling-patterns.md
- Read the "Streaming Structured Output" section in references/structured-outputs.md
- Model selection, context length, or pricing-aware choices
- Read references/models.md
Workflow
- Confirm that the workload is interactive serverless inference rather than batch, retrieval, or training.
- Pick the smallest model that satisfies latency, quality, and context requirements.
- Decide whether the job needs plain text, tools, structured output, or reasoning.
- Start from the matching script instead of re-deriving request shapes from scratch.
- Pull deeper details from the relevant reference file only when needed.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Use
client.chat.completions.create()for Python andclient.chat.completions.create()for TypeScript. - Preserve full
messageshistory for multi-turn conversations; do not rebuild context from final text only. - For tools, implement the full loop: model tool call -> execute tool -> append tool result -> second model call.
- Prefer
json_schemaover looser JSON modes when the user needs stable machine-readable output. - Use reasoning models only when the task benefits from deeper deliberation; otherwise prefer cheaper standard models.
- To combine tool calling with structured output, use a two-phase approach: Phase 1 sends
tools(noresponse_format), Phase 2 sendsresponse_format(notools) after tool results are appended. - Streaming works with
response_format; accumulate chunks and parse the final concatenated string as JSON. - If the user needs many independent requests, combine this skill with
async_parallel.pyor hand off to batch inference.
Resource Map
- Parameters and response fields: references/api-parameters.md
- OpenAI compatibility, rate-limit headers, and debug headers: references/api-parameters.md
- Function-calling patterns: references/function-calling-patterns.md
- Structured outputs: references/structured-outputs.md
- Reasoning models: references/reasoning-models.md
- Model catalog: references/models.md
Scripts
- scripts/chat_basic.py and scripts/chat_basic.ts: basic chat, streaming, and multi-turn state
- scripts/debug_headers.py and scripts/debug_headers.ts: raw-response inspection for routing, latency, and rate-limit headers
- scripts/async_parallel.py: async Python fan-out for independent requests
- scripts/tool_call_loop.py and scripts/tool_call_loop.ts: full tool-call loop
- scripts/structured_outputs.py and scripts/structured_outputs.ts: schema-guided and regex outputs
- scripts/reasoning_models.py and scripts/reasoning_models.ts: reasoning fields, effort, and hybrid toggles
Official Docs
More from togethercomputer/skills
together-batch-inference
High-volume, asynchronous offline inference at up to 50% lower cost via Together AI's Batch API. Prepare JSONL inputs, upload files, create jobs, poll status, and download outputs. Reach for it whenever the user needs non-interactive bulk inference rather than real-time chat or evaluation jobs.
39together-images
Text-to-image generation and image editing via Together AI, including FLUX and Kontext models, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
33together-fine-tuning
LoRA, full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads on Together AI. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
32together-embeddings
Dense vector embeddings, semantic search, RAG pipelines, and reranking via Together AI. Generate embeddings with open-source models and rerank results behind dedicated endpoints. Reach for it whenever the user needs vector representations or retrieval quality improvements rather than direct text generation.
31together-evaluations
LLM-as-a-judge evaluation framework on Together AI. Classify, score, and compare model outputs, select judge models, use external-provider judges or targets, poll results and download reports. Reach for it whenever the user wants to benchmark outputs, grade responses, compare A/B variants, or operationalize automated evaluations.
31together-dedicated-containers
Custom Dockerized inference workers on Together AI's managed GPU infrastructure. Build with Sprocket SDK, configure with Jig CLI, submit async queue jobs, and poll results. Reach for it whenever the user needs container-level control rather than a standard model endpoint or raw cluster.
30