together-chat-completions
Together Chat Completions
Overview
Use Together AI's serverless chat/completions API for interactive inference workloads:
- basic text generation
- streaming responses
- multi-turn chat state
- tool and function calling
- structured outputs
- reasoning-capable models
Treat this skill as the default entry point for Together AI text generation unless the task is clearly offline batch processing, vector retrieval, model training, or infrastructure management.
When This Skill Wins
- Build a chatbot, assistant, or text-generation endpoint on Together AI
- Add streaming output to a real-time user experience
- Implement tool calling or function-calling loops
- Constrain model output to JSON or a regex-defined shape
- Choose between standard chat models and reasoning models
- Debug request parameters, model behavior, or response shapes
Hand Off To Another Skill
- Use
together-batch-inferencefor large offline runs, backfills, or lower-cost asynchronous jobs - Use
together-embeddingsfor vector search, semantic retrieval, or reranking - Use
together-fine-tuningwhen the user wants to train or adapt a model - Use
together-dedicated-endpointswhen the user needs always-on single-tenant hosting - Use
together-dedicated-containersortogether-gpu-clustersfor custom infrastructure
Quick Routing
- Basic chat, streaming, or multi-turn state
- Start with references/api-parameters.md
- Use scripts/chat_basic.py or scripts/chat_basic.ts
- OpenAI SDK migration, rate limits, or debug headers
- Parallel async requests
- Tool calling or function calling
- Structured outputs
- Reasoning models or thinking-mode toggles
- Read references/reasoning-models.md
- Start from scripts/reasoning_models.py or scripts/reasoning_models.ts
- Combining tools + structured output, or tools + streaming
- Read the "Combining Tool Calls with Structured Output" section in references/function-calling-patterns.md
- Read the "Streaming Structured Output" section in references/structured-outputs.md
- Model selection, context length, or pricing-aware choices
- Read references/models.md
Workflow
- Confirm that the workload is interactive serverless inference rather than batch, retrieval, or training.
- Pick the smallest model that satisfies latency, quality, and context requirements.
- Decide whether the job needs plain text, tools, structured output, or reasoning.
- Start from the matching script instead of re-deriving request shapes from scratch.
- Pull deeper details from the relevant reference file only when needed.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Use
client.chat.completions.create()for Python andclient.chat.completions.create()for TypeScript. - Preserve full
messageshistory for multi-turn conversations; do not rebuild context from final text only. - For tools, implement the full loop: model tool call -> execute tool -> append tool result -> second model call.
- Prefer
json_schemaover looser JSON modes when the user needs stable machine-readable output. - Use reasoning models only when the task benefits from deeper deliberation; otherwise prefer cheaper standard models.
- To combine tool calling with structured output, use a two-phase approach: Phase 1 sends
tools(noresponse_format), Phase 2 sendsresponse_format(notools) after tool results are appended. - Streaming works with
response_format; accumulate chunks and parse the final concatenated string as JSON. - If the user needs many independent requests, combine this skill with
async_parallel.pyor hand off to batch inference.
Resource Map
- Parameters and response fields: references/api-parameters.md
- OpenAI compatibility, rate-limit headers, and debug headers: references/api-parameters.md
- Function-calling patterns: references/function-calling-patterns.md
- Structured outputs: references/structured-outputs.md
- Reasoning models: references/reasoning-models.md
- Model catalog: references/models.md
Scripts
- scripts/chat_basic.py and scripts/chat_basic.ts: basic chat, streaming, and multi-turn state
- scripts/debug_headers.py and scripts/debug_headers.ts: raw-response inspection for routing, latency, and rate-limit headers
- scripts/async_parallel.py: async Python fan-out for independent requests
- scripts/tool_call_loop.py and scripts/tool_call_loop.ts: full tool-call loop
- scripts/structured_outputs.py and scripts/structured_outputs.ts: schema-guided and regex outputs
- scripts/reasoning_models.py and scripts/reasoning_models.ts: reasoning fields, effort, and hybrid toggles
Official Docs
More from zainhas/togetherai-skills
together-code-interpreter
Use this skill for Together AI Code Interpreter workflows: remote Python execution, session reuse, file uploads, data analysis, plots, and stateful notebook-like runs through the TCI API. Reach for it whenever the user wants managed remote Python execution on Together AI instead of local execution, raw clusters, or full model hosting.
33together-audio
Text-to-speech and speech-to-text via Together AI, including REST, streaming, and realtime WebSocket TTS, plus transcription, translation, diarization, timestamps, and live STT. Reach for it whenever the user needs audio in or audio out on Together AI rather than chat generation, image or video creation, or model training.
14together-images
Text-to-image generation and image editing via Together AI, including FLUX and Kontext models, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
14together-dedicated-endpoints
Single-tenant GPU endpoints on Together AI with autoscaling and no rate limits. Deploy fine-tuned or uploaded models, size hardware, and manage endpoint lifecycle. Reach for it whenever the user needs predictable always-on hosting rather than serverless inference, custom containers, or raw clusters.
13together-video
Text-to-video and image-to-video generation via Together AI, including keyframe control, model and dimension selection, asynchronous job polling, and video downloads. Reach for it whenever the user wants motion generation on Together AI rather than still-image generation or text-only inference.
12together-fine-tuning
LoRA, full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads on Together AI. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
11