together-evaluations

Installation
SKILL.md

Together AI Evaluations

Overview

Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an ad hoc prompt loop.

Core evaluation types:

  • Classify: assign outputs to labels
  • Score: grade outputs on a numeric scale
  • Compare: compare two candidate outputs with bias controls

This skill also covers external providers used as judges or targets when the workflow still runs through Together AI's evaluation system.

When This Skill Wins

  • Benchmark prompt variants, models, or product responses
  • Grade quality, safety, policy compliance, or task success
  • Run A/B comparisons between model outputs
  • Build repeatable evaluation jobs with uploaded datasets
  • Pull results programmatically after asynchronous execution

Hand Off To Another Skill

  • Use together-chat-completions for one-off inference or manual judge prompts
  • Use together-batch-inference for bulk offline generation rather than evaluation
  • Use together-fine-tuning when the user wants to improve the model instead of just measure it
  • Use together-dedicated-endpoints only if the evaluation target itself is a dedicated endpoint

Quick Routing

Workflow

  1. Identify whether the user needs classify, score, or compare.
  2. Define the dataset schema before writing code.
  3. Upload the dataset as an eval file and keep the returned file ID.
  4. Configure judge and target models explicitly, especially when mixing providers.
  5. Poll status until completion, then download the result file for analysis.

High-Signal Rules

  • Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
  • The current SDK examples in this repo use check=False for eval uploads because local file validation can misclassify eval datasets.
  • Treat dataset schema as part of the product contract; inconsistent fields cause downstream confusion.
  • Compare evaluations are best when both candidate responses are already present in the dataset.
  • Keep judge configuration explicit. Hidden defaults make benchmark interpretation harder.
  • Use Together AI's managed evaluation job instead of rebuilding a manual judge loop when repeatability matters.

Resource Map

Official Docs

Related skills

More from togethercomputer/skills

Installs
31
GitHub Stars
24
First Seen
Mar 31, 2026