together-evaluations
Together AI Evaluations
Overview
Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an ad hoc prompt loop.
Core evaluation types:
- Classify: assign outputs to labels
- Score: grade outputs on a numeric scale
- Compare: compare two candidate outputs with bias controls
This skill also covers external providers used as judges or targets when the workflow still runs through Together AI's evaluation system.
When This Skill Wins
- Benchmark prompt variants, models, or product responses
- Grade quality, safety, policy compliance, or task success
- Run A/B comparisons between model outputs
- Build repeatable evaluation jobs with uploaded datasets
- Pull results programmatically after asynchronous execution
Hand Off To Another Skill
- Use
together-chat-completionsfor one-off inference or manual judge prompts - Use
together-batch-inferencefor bulk offline generation rather than evaluation - Use
together-fine-tuningwhen the user wants to improve the model instead of just measure it - Use
together-dedicated-endpointsonly if the evaluation target itself is a dedicated endpoint
Quick Routing
- Classify / Score / Compare job setup
- Start with scripts/run_evaluation.py or scripts/run_evaluation.ts
- Read references/api-reference.md for exact request shapes
- Dataset formatting
- Read the dataset sections in references/api-reference.md
- Dataset columns, Jinja2 templates, or pre-generated responses
- Read the dataset and template sections in references/api-reference.md
- Use
--eval-column,--model-a-column, or--model-b-columnin the scripts
- External providers as judge or target
- Read the model-source and provider sections in references/api-reference.md
- Use the scripts with
--judge-model-source external,--eval-model-source external, or compare-side source flags
- Polling, listing, or downloading results
- Use the retrieval endpoints documented in references/api-reference.md
- Use
--download-resultsin the scripts when you want the per-row JSONL locally
Workflow
- Identify whether the user needs classify, score, or compare.
- Define the dataset schema before writing code.
- Upload the dataset as an eval file and keep the returned file ID.
- Configure judge and target models explicitly, especially when mixing providers.
- Poll status until completion, then download the result file for analysis.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - The current SDK examples in this repo use
check=Falsefor eval uploads because local file validation can misclassify eval datasets. - Treat dataset schema as part of the product contract; inconsistent fields cause downstream confusion.
- Compare evaluations are best when both candidate responses are already present in the dataset.
- Keep judge configuration explicit. Hidden defaults make benchmark interpretation harder.
- Use Together AI's managed evaluation job instead of rebuilding a manual judge loop when repeatability matters.
Resource Map
- Full API reference: references/api-reference.md
- Dataset formats, Jinja2 templates, and provider shortcuts: references/api-reference.md
- Python workflow: scripts/run_evaluation.py
- TypeScript workflow: scripts/run_evaluation.ts
Official Docs
More from togethercomputer/skills
together-chat-completions
Real-time and streaming text generation via Together AI's OpenAI-compatible chat/completions API, including multi-turn conversations, tool and function calling, structured JSON outputs, and reasoning models. Reach for it whenever the user wants to build or debug text generation on Together AI, unless they specifically need batch jobs, embeddings, fine-tuning, dedicated endpoints, dedicated containers, or GPU clusters.
40together-batch-inference
High-volume, asynchronous offline inference at up to 50% lower cost via Together AI's Batch API. Prepare JSONL inputs, upload files, create jobs, poll status, and download outputs. Reach for it whenever the user needs non-interactive bulk inference rather than real-time chat or evaluation jobs.
39together-images
Text-to-image generation and image editing via Together AI, including FLUX and Kontext models, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
33together-fine-tuning
LoRA, full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads on Together AI. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
32together-embeddings
Dense vector embeddings, semantic search, RAG pipelines, and reranking via Together AI. Generate embeddings with open-source models and rerank results behind dedicated endpoints. Reach for it whenever the user needs vector representations or retrieval quality improvements rather than direct text generation.
31together-dedicated-containers
Custom Dockerized inference workers on Together AI's managed GPU infrastructure. Build with Sprocket SDK, configure with Jig CLI, submit async queue jobs, and poll results. Reach for it whenever the user needs container-level control rather than a standard model endpoint or raw cluster.
30