together-embeddings
Together Embeddings & Reranking
Overview
Use this skill for semantic retrieval components:
- create embeddings
- batch embeddings
- build retrieval or RAG pipelines
- rerank retrieved candidates
This skill is for retrieval plumbing, not for the final language-model response itself.
When This Skill Wins
- Build vector search or semantic similarity features
- Add embedding generation to a data pipeline
- Improve retrieval quality with reranking
- Assemble a retrieval stage before calling a chat model
Hand Off To Another Skill
- Use
together-chat-completionsfor the final answer-generation step - Use
together-batch-inferencefor very large offline embedding backfills - Use
together-dedicated-endpointswhen reranking requires a dedicated deployment
Quick Routing
- Embeddings API usage
- Read references/api-reference.md
- Start with scripts/embed_and_rerank.py or scripts/embed_and_rerank.ts
- Semantic search (embed, store, query)
- Start with scripts/semantic_search.py -- includes an in-memory vector store, cosine-similarity retrieval, and optional rerank
- RAG pipeline composition
- Start with scripts/rag_pipeline.py
- Model selection and rerank constraints
- Read references/models.md
Workflow
- Confirm that the user needs vectors or retrieval, not direct generation.
- Choose the embedding model and batch shape.
- Generate embeddings for corpus and query paths consistently.
- Retrieve candidates. An in-memory cosine-similarity store works for prototyping and small corpora (see
semantic_search.py). Use a dedicated vector database for production scale. - Rerank only when the extra latency and endpoint requirement are justified. When no dedicated rerank endpoint is available, cosine-similarity ranking is a reasonable fallback.
High-Signal Rules
- Python scripts require the Together v2 SDK (
together>=2.0.0). If the user is on an older version, they must upgrade first:uv pip install --upgrade "together>=2.0.0". - Keep embeddings and reranking conceptually separate; rerank is a second-stage precision step.
- Reranking in this repo assumes a dedicated endpoint. Do not promise serverless rerank unless the product changes. When no endpoint is available, fall back to cosine-similarity ranking.
- The embedding model has a 514-token context limit. Chunk longer documents before embedding.
- The
rag_pipeline.pyexample demonstrates retrieval plus generation; treat generation as a hand-off to chat completions. - Preserve model consistency across indexing and querying.
Resource Map
- API details: references/api-reference.md
- Model guide: references/models.md
- Python embeddings example: scripts/embed_and_rerank.py
- TypeScript embeddings example: scripts/embed_and_rerank.ts
- Python semantic search: scripts/semantic_search.py
- Python RAG pipeline: scripts/rag_pipeline.py
Official Docs
More from togethercomputer/skills
together-chat-completions
Real-time and streaming text generation via Together AI's OpenAI-compatible chat/completions API, including multi-turn conversations, tool and function calling, structured JSON outputs, and reasoning models. Reach for it whenever the user wants to build or debug text generation on Together AI, unless they specifically need batch jobs, embeddings, fine-tuning, dedicated endpoints, dedicated containers, or GPU clusters.
40together-batch-inference
High-volume, asynchronous offline inference at up to 50% lower cost via Together AI's Batch API. Prepare JSONL inputs, upload files, create jobs, poll status, and download outputs. Reach for it whenever the user needs non-interactive bulk inference rather than real-time chat or evaluation jobs.
39together-images
Text-to-image generation and image editing via Together AI, including FLUX and Kontext models, LoRA-based styling, reference-image guidance, and local image downloads. Reach for it whenever the user wants to generate or edit images on Together AI rather than create videos or build text-only chat applications.
33together-fine-tuning
LoRA, full fine-tuning, DPO preference tuning, VLM training, function-calling tuning, reasoning tuning, and BYOM uploads on Together AI. Reach for it whenever the user wants to adapt a model on custom data rather than only run inference, evaluate outputs, or host an existing model.
32together-evaluations
LLM-as-a-judge evaluation framework on Together AI. Classify, score, and compare model outputs, select judge models, use external-provider judges or targets, poll results and download reports. Reach for it whenever the user wants to benchmark outputs, grade responses, compare A/B variants, or operationalize automated evaluations.
31together-dedicated-containers
Custom Dockerized inference workers on Together AI's managed GPU infrastructure. Build with Sprocket SDK, configure with Jig CLI, submit async queue jobs, and poll results. Reach for it whenever the user needs container-level control rather than a standard model endpoint or raw cluster.
30