together-chat-completions
Together Chat Completions
Overview
Use Together AI's serverless chat/completions API for interactive inference workloads:
- basic text generation
- streaming responses
- multi-turn chat state
- tool and function calling
- structured outputs
- reasoning-capable models
Treat this skill as the default entry point for Together AI text generation unless the task is clearly offline batch processing, vector retrieval, model training, or infrastructure management.
When This Skill Wins
- Build a chatbot, assistant, or text-generation endpoint on Together AI
- Add streaming output to a real-time user experience
- Implement tool calling or function-calling loops
- Constrain model output to JSON or a regex-defined shape
- Choose between standard chat models and reasoning models
- Debug request parameters, model behavior, or response shapes
Hand Off To Another Skill
- Use
together-batch-inferencefor large offline runs, backfills, or lower-cost asynchronous jobs - Use
together-embeddingsfor vector search, semantic retrieval, or reranking - Use
together-fine-tuningwhen the user wants to train or adapt a model - Use
together-dedicated-endpointswhen the user needs always-on single-tenant hosting - Use
together-dedicated-containersortogether-gpu-clustersfor custom infrastructure
Quick Routing
- Basic chat, streaming, or multi-turn state
- Start with references/api-parameters.md
- Use scripts/chat_basic.py or scripts/chat_basic.ts
- OpenAI SDK migration, rate limits, or debug headers
- Parallel async requests
- Tool calling or function calling
- Structured outputs
- Reasoning models or thinking-mode toggles
- Read references/reasoning-models.md
- Start from scripts/reasoning_models.py or scripts/reasoning_models.ts
- Combining tools + structured output, or tools + streaming
- Read the "Combining Tool Calls with Structured Output" section in references/function-calling-patterns.md
- Read the "Streaming Structured Output" section in references/structured-outputs.md
- Model selection, context length, or pricing-aware choices
- Read references/models.md
Workflow
- Confirm that the workload is interactive serverless inference rather than batch, retrieval, or training.
- Pick the smallest model that satisfies latency, quality, and context requirements.
- Decide whether the job needs plain text, tools, structured output, or reasoning.
- Start from the matching script instead of re-deriving request shapes from scratch.
- Pull deeper details from the relevant reference file only when needed.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Use
client.chat.completions.create()for Python andclient.chat.completions.create()for TypeScript. - Preserve full
messageshistory for multi-turn conversations; do not rebuild context from final text only. - For tools, implement the full loop: model tool call -> execute tool -> append tool result -> second model call.
- Prefer
json_schemaover looser JSON modes when the user needs stable machine-readable output. - Use reasoning models only when the task benefits from deeper deliberation; otherwise prefer cheaper standard models.
- To combine tool calling with structured output, use a two-phase approach: Phase 1 sends
tools(noresponse_format), Phase 2 sendsresponse_format(notools) after tool results are appended. - Streaming works with
response_format; accumulate chunks and parse the final concatenated string as JSON. - If the user needs many independent requests, combine this skill with
async_parallel.pyor hand off to batch inference.
Resource Map
- Parameters and response fields: references/api-parameters.md
- OpenAI compatibility, rate-limit headers, and debug headers: references/api-parameters.md
- Function-calling patterns: references/function-calling-patterns.md
- Structured outputs: references/structured-outputs.md
- Reasoning models: references/reasoning-models.md
- Model catalog: references/models.md
Scripts
- scripts/chat_basic.py and scripts/chat_basic.ts: basic chat, streaming, and multi-turn state
- scripts/debug_headers.py and scripts/debug_headers.ts: raw-response inspection for routing, latency, and rate-limit headers
- scripts/async_parallel.py: async Python fan-out for independent requests
- scripts/tool_call_loop.py and scripts/tool_call_loop.ts: full tool-call loop
- scripts/structured_outputs.py and scripts/structured_outputs.ts: schema-guided and regex outputs
- scripts/reasoning_models.py and scripts/reasoning_models.ts: reasoning fields, effort, and hybrid toggles
Official Docs
More from zainhas/skills
together-audio
Use this skill for Together AI audio workflows: text-to-speech over REST, streaming, or realtime WebSocket APIs, plus speech-to-text transcription, translation, diarization, timestamps, and live transcription. Reach for it whenever the user needs audio in or audio out on Together AI rather than generic chat generation, image or video creation, or model training.
1together-images
Use this skill for Together AI image workflows: text-to-image generation, image editing with Kontext, FLUX model selection, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
1together-video
Use this skill for Together AI video workflows: text-to-video generation, image-to-video with keyframe control, model and dimension selection, polling asynchronous jobs, and downloading completed videos. Reach for it whenever the user wants motion generation on Together AI rather than still-image generation or text-only inference.
1together-embeddings
Use this skill for Together AI embedding, retrieval, and reranking workflows: generating dense vectors, building semantic search or RAG pipelines, and using rerank models behind dedicated endpoints. Reach for it whenever the user needs vector representations or retrieval quality improvements rather than direct text generation.
1together-gpu-clusters
Use this skill for Together AI GPU clusters and raw infrastructure workflows: provisioning on-demand or reserved clusters, choosing Kubernetes vs Slurm, attaching shared storage, scaling, getting credentials, and operating cluster-backed ML or HPC jobs. Reach for it when the user needs multi-node compute or infrastructure control rather than a managed model endpoint.
1together-fine-tuning
Use this skill for Together AI fine-tuning workflows: LoRA or full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
1