together-chat-completions

Installation
SKILL.md

Together Chat Completions

Overview

Use Together AI's serverless chat/completions API for interactive inference workloads:

  • basic text generation
  • streaming responses
  • multi-turn chat state
  • tool and function calling
  • structured outputs
  • reasoning-capable models

Treat this skill as the default entry point for Together AI text generation unless the task is clearly offline batch processing, vector retrieval, model training, or infrastructure management.

When This Skill Wins

  • Build a chatbot, assistant, or text-generation endpoint on Together AI
  • Add streaming output to a real-time user experience
  • Implement tool calling or function-calling loops
  • Constrain model output to JSON or a regex-defined shape
  • Choose between standard chat models and reasoning models
  • Debug request parameters, model behavior, or response shapes

Hand Off To Another Skill

  • Use together-batch-inference for large offline runs, backfills, or lower-cost asynchronous jobs
  • Use together-embeddings for vector search, semantic retrieval, or reranking
  • Use together-fine-tuning when the user wants to train or adapt a model
  • Use together-dedicated-endpoints when the user needs always-on single-tenant hosting
  • Use together-dedicated-containers or together-gpu-clusters for custom infrastructure

Quick Routing

Workflow

  1. Confirm that the workload is interactive serverless inference rather than batch, retrieval, or training.
  2. Pick the smallest model that satisfies latency, quality, and context requirements.
  3. Decide whether the job needs plain text, tools, structured output, or reasoning.
  4. Start from the matching script instead of re-deriving request shapes from scratch.
  5. Pull deeper details from the relevant reference file only when needed.

High-Signal Rules

  • Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
  • Use client.chat.completions.create() for Python and client.chat.completions.create() for TypeScript.
  • Preserve full messages history for multi-turn conversations; do not rebuild context from final text only.
  • For tools, implement the full loop: model tool call -> execute tool -> append tool result -> second model call.
  • Prefer json_schema over looser JSON modes when the user needs stable machine-readable output.
  • Use reasoning models only when the task benefits from deeper deliberation; otherwise prefer cheaper standard models.
  • To combine tool calling with structured output, use a two-phase approach: Phase 1 sends tools (no response_format), Phase 2 sends response_format (no tools) after tool results are appended.
  • Streaming works with response_format; accumulate chunks and parse the final concatenated string as JSON.
  • If the user needs many independent requests, combine this skill with async_parallel.py or hand off to batch inference.

Resource Map

Scripts

Official Docs

Related skills

More from togethercomputer/skills

Installs
40
GitHub Stars
24
First Seen
Mar 31, 2026