together-dedicated-endpoints
Together Dedicated Endpoints
Overview
Use dedicated endpoints for managed single-tenant model hosting with predictable performance and no shared serverless pool.
Typical fits:
- production inference with stable latency
- fine-tuned model hosting
- uploaded custom model hosting
- autoscaled model APIs
When This Skill Wins
- The user needs always-on or single-tenant hosting
- The model is supported for dedicated deployment
- Fine-tuned or uploaded models must be served as endpoints
- Hardware, scaling, or idle-time settings need explicit control
Hand Off To Another Skill
- Use
together-chat-completionsfor serverless chat inference - Use
together-dedicated-containersfor custom runtimes or nonstandard inference pipelines - Use
together-gpu-clustersfor raw infrastructure or cluster orchestration
Quick Routing
- Create and manage a standard endpoint
- Start with scripts/manage_endpoint.py or scripts/manage_endpoint.ts
- Read references/api-reference.md
- Lifecycle tuning or troubleshooting
- Deploy a fine-tuned model
- Start with scripts/deploy_finetuned.py
- Read references/dedicated-models.md
- Upload and deploy a custom model
- Start with scripts/upload_custom_model.py
- Read references/dedicated-models.md
- Hardware and sizing choices
Workflow
- Confirm that the task needs dedicated hosting instead of serverless or containers.
- Verify model eligibility and inspect available hardware.
- Create the endpoint with explicit scaling and timeout settings.
- Wait for readiness before sending inference traffic.
- Stop or delete the endpoint when the workload no longer needs to run.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Model eligibility and hardware availability are gating constraints; check them early.
- Endpoint management uses endpoint IDs, while inference usually uses the endpoint name as
model. - Autoscaling, auto-shutdown, prompt caching, and speculative decoding materially affect operations and cost.
- For custom or fine-tuned models, do not skip the intermediate verification steps before deployment.
Resource Map
- API reference: references/api-reference.md
- Operational controls and troubleshooting: references/api-reference.md
- Dedicated model guide: references/dedicated-models.md
- Hardware guide: references/hardware-options.md
- Python endpoint lifecycle: scripts/manage_endpoint.py
- TypeScript endpoint lifecycle: scripts/manage_endpoint.ts
- Fine-tuned deployment: scripts/deploy_finetuned.py
- Custom model upload and deployment: scripts/upload_custom_model.py
Official Docs
More from zainhas/skills
together-audio
Use this skill for Together AI audio workflows: text-to-speech over REST, streaming, or realtime WebSocket APIs, plus speech-to-text transcription, translation, diarization, timestamps, and live transcription. Reach for it whenever the user needs audio in or audio out on Together AI rather than generic chat generation, image or video creation, or model training.
1together-images
Use this skill for Together AI image workflows: text-to-image generation, image editing with Kontext, FLUX model selection, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
1together-video
Use this skill for Together AI video workflows: text-to-video generation, image-to-video with keyframe control, model and dimension selection, polling asynchronous jobs, and downloading completed videos. Reach for it whenever the user wants motion generation on Together AI rather than still-image generation or text-only inference.
1together-embeddings
Use this skill for Together AI embedding, retrieval, and reranking workflows: generating dense vectors, building semantic search or RAG pipelines, and using rerank models behind dedicated endpoints. Reach for it whenever the user needs vector representations or retrieval quality improvements rather than direct text generation.
1together-gpu-clusters
Use this skill for Together AI GPU clusters and raw infrastructure workflows: provisioning on-demand or reserved clusters, choosing Kubernetes vs Slurm, attaching shared storage, scaling, getting credentials, and operating cluster-backed ML or HPC jobs. Reach for it when the user needs multi-node compute or infrastructure control rather than a managed model endpoint.
1together-fine-tuning
Use this skill for Together AI fine-tuning workflows: LoRA or full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
1