langsmith

Installation
SKILL.md

LangSmith

When to use this skill

  • The user already chose LangSmith and needs the smallest useful packet for tracing, evals, review, prompt ownership, or trace propagation.
  • A request mentions LangChain tracing, @traceable / traceable, wrap_openai / wrapOpenAI, annotation queues, run trees, trace IDs, datasets, or experiment comparison.
  • The team needs to decide whether a change needs debugging traces, an offline eval gate, an online review queue, Prompt Hub / prompt-registry work, or multi-service lineage.
  • The user has some LangSmith setup already, but confidence is still low and they need an audit / gap review.

Do not use this skill as the main workflow when:

  • The real job is generic dashboards / alerts / SLOs / telemetry ownershipmonitoring-observability
  • The real job is existing-log root-cause triagelog-analysis
  • The real job is code-level reproduction / bug isolationdebugging
  • The real job is deployment / rollout choreographydeployment-automation
  • The real job is generic exported-metrics interpretation or KPI synthesisdata-analysis
  • The real job is runtime guardrails / approvals / policy middleware → the relevant security or policy skill

Core idea

langsmith should behave like a workflow packet router, not a giant SDK encyclopedia.

  1. Label the request before naming APIs.
  2. Choose one primary packet.
  3. Pick the smallest instrumentation / eval / review surface that answers the question.
  4. Return a compact operator brief.
  5. Route adjacent observability, debugging, deployment, and policy work outward fast.

Read these support docs before choosing the packet:

Instructions

Step 1: Normalize the request

Convert the prompt into this intake packet first:

langsmith_packet:
  primary_packet: trace-debug | eval | review | prompt-registry | propagation | audit
  app_shape: chat | rag | tool-calling | agent | batch-job | multi-service | mixed | unknown
  runtime: python | typescript | mixed | cli-only | unknown
  request_type: setup | debug | pre-ship gate | production review | prompt decision | cross-service tracing | audit
  current_evidence: traces | datasets | evaluators | feedback queues | prompt versions | exported runs | none
  confidence_problem: missing traces | noisy traces | no regression gate | weak review rubric | prompt drift | broken lineage | mixed | unknown
  route_after: stay-here | monitoring-observability | log-analysis | debugging | deployment-automation | data-analysis

Choose one primary packet for the run. If two seem plausible, pick the one that reduces uncertainty fastest.

Step 2: Choose the packet

Packet Use when Main output Typical signals
trace-debug The real question is “what happened?” trace-debug packet missing visibility, wrappers/decorators, trace coverage, metadata discipline
eval A change needs pre-ship or regression confidence eval packet dataset creation, evaluators, experiments, baseline/candidate comparison
review Traces exist but trust is weak in production review packet annotation queues, feedback criteria, human scoring, inline review
prompt-registry Prompt/version ownership is the bottleneck prompt packet Prompt Hub, prompt versions, Git sync, compare prompt variants
propagation One request spans services, jobs, tools, or agents propagation packet parent/child trace IDs, cross-service headers, multi-hop lineage
audit LangSmith is present but still not useful audit packet noisy traces, missing IDs, absent evals, unclear prompt ownership

Packet rules:

  • Prefer trace-debug when the team cannot explain the run tree.
  • Prefer eval when the next risk is shipping without a regression gate.
  • Prefer review when real-user output needs human or structured quality feedback.
  • Prefer prompt-registry only when prompt/version choice is the actual bottleneck.
  • Prefer propagation when one request fans out beyond a single service boundary.
  • Prefer audit when the system already “has LangSmith” but confidence is still low.

Step 3: Pick the smallest useful surface

Surface Use when Strengths Risks
Provider wrappers (wrap_openai, wrap_anthropic, wrapOpenAI) Need quick model-call coverage Fastest path to visible traces Misses higher-level workflow context if used alone
Decorators / traceable() Need named spans around app functions Clear span boundaries, reusable Still needs deliberate placement
Manual trace blocks / tracing context Need scoped or temporary visibility Good for audits and suspicious blocks Easy to under-instrument
Annotation queues + feedback criteria Need structured human review Preserves rubric and reviewer flow Review ownership still has to be designed
Dataset + experiment workflow Need benchmark or regression checks Gives pre-ship comparison Can drift into generic eval theory if not bounded
CLI export/list/get flows Need audit/export operations more than instrumentation Good for scripts and inspection Not a substitute for instrumentation design
Cross-service propagation headers / metadata Need end-to-end lineage Preserves parent/child context Easy to forget at service boundaries

Rules:

  • Prefer the smallest surface that answers the present question.
  • Always name project/workspace/environment grouping.
  • Preserve run_id, trace_id, and metadata if humans or downstream tools will review runs later.
  • Include shutdown flush / wait behavior for CLIs, workers, and short-lived jobs.

Step 4: Choose the eval or review layer only when needed

Layer Use when Notes
Deterministic / code evaluator Output can be checked mechanically Best first choice for structure/factual rules
LLM-as-judge (openevals) Quality is subjective or reference-light Useful, but not ground truth
Human review Stakes are high or nuance dominates Needs rubric, reviewer, and cadence
Pairwise comparison Two prompts/models/app versions are being compared Good for baseline vs candidate decisions
Online evaluator Production traffic needs ongoing scoring Pair with explicit alert/review semantics

Do not pretend one layer replaces the others.

Step 5: Return one compact LangSmith brief

Default response shape:

## LangSmith brief
- Packet: trace-debug | eval | review | prompt-registry | propagation | audit
- App shape: ...
- Runtime: ...
- Current evidence: ...

## Chosen surface
- wrappers / decorators / manual trace / dataset+experiment / review queue / prompt registry / propagation headers

## Required config
- `LANGSMITH_API_KEY`
- project/workspace/environment
- endpoint/workspace overrides if self-hosted

## Required IDs / metadata
- `run_id`, `trace_id`, tags, metadata fields, user/session identifiers as needed

## Verification step
- one trace query, one experiment comparison, one queue review check, or one prompt-version sanity check

## Route-out
- next neighboring skill if the bottleneck moves

Keep it compact. The point is to leave the user with one smallest next artifact, not a wall of SDK notes.

Step 6: Packet-specific heuristics

For trace-debug

  • Start from the missing visibility surface: app span, tool calls, retrieval, provider call, or worker boundary.
  • Prefer wrappers plus a few high-value decorators over tracing everything blindly.
  • Return the first trace query/filter the team should run once instrumentation lands.

For eval

  • Define the smallest dataset that reflects the risky change.
  • Separate deterministic checks from LLM-as-judge or human review.
  • Name baseline/candidate comparison, experiment naming, concurrency/cost limits, and the ship/no-ship decision point.

For review

  • Define feedback keys, scale semantics, reviewer ownership, and queue cadence.
  • Preserve run_id / trace_id so feedback attaches to the right artifact later.
  • Route broader support-ops workflows out if the issue is bigger than output review.

For prompt-registry

  • Explicitly decide whether Prompt Hub / LangSmith should own the prompt or whether Git/app config should stay canonical.
  • Connect prompt versions to evals or experiments; do not treat registry as a separate island.
  • Be explicit when public hub browsing is reference-only rather than the team's source of truth.

For propagation

  • Name parent/child headers, metadata handoff rules, and which services must preserve trace context.
  • Keep distributed-systems reliability and alerting concerns routed to generic observability where appropriate.

For audit

  • Rank gaps by severity: missing coverage, bad metadata, absent evals, weak review flow, prompt/version ambiguity, broken flush or lineage.
  • Recommend the first ratchet, not a giant reinstall plan.

Step 7: Route out aggressively

Switch as soon as LangSmith is no longer the main bottleneck:

  • Generic telemetry / dashboards / alerts / SLOsmonitoring-observability
  • Existing-log root-cause triagelog-analysis
  • Reproduction and bug isolationdebugging
  • Rollout / deploy / environment promotiondeployment-automation
  • Exported metrics, experiments, or KPI interpretationdata-analysis
  • Policy / approval / runtime guardrails → the relevant security or policy skill

Examples

Example 1: Trace-debug packet

Prompt:

Add LangSmith to this FastAPI RAG service so we can debug why answer quality regressed after a retriever change.

Good response shape:

  • choose trace-debug
  • recommend wrappers plus targeted higher-level spans if retrieval/tools must be visible
  • define project/tags/metadata and a flush step for short-lived runs
  • route generic service-health observability outward

Example 2: Eval packet

Prompt:

Before we ship this prompt rewrite, build a LangSmith eval workflow with a baseline, a candidate, and one judge for subjective quality.

Good response shape:

  • choose eval
  • define dataset, evaluator mix, experiment naming, and verification step
  • separate deterministic checks from LLM-as-judge and human review
  • keep rollout orchestration out of scope

Example 3: Review packet

Prompt:

We already have LangSmith traces, but support leads still do not trust the AI assistant. Set up a review workflow with human scoring.

Good response shape:

  • choose review
  • define feedback keys, rubric, queue ownership, and run-ID preservation
  • keep the answer grounded in review workflow rather than generic prompt management

Example 4: Prompt-registry packet

Prompt:

Should these prompts live in LangSmith Prompt Hub or stay in Git?

Good response shape:

  • choose prompt-registry
  • compare platform-owned vs Git-owned prompt source of truth
  • connect the choice to experiments and release confidence
  • stay honest when Prompt Hub is optional

Example 5: Propagation packet

Prompt:

One chat request fans out to workers and tools. How do we keep one trace through LangSmith?

Good response shape:

  • choose propagation
  • define parent/child IDs, metadata propagation, and service boundaries
  • route generic distributed-systems observability outward when needed

Best practices

  1. Start with the packet, not the SDK call.
  2. Treat LangSmith as a trace/eval/review substrate, not the entire production control plane.
  3. Preserve run_id, trace_id, tags, and metadata deliberately.
  4. Prefer the smallest useful surface that answers the present question.
  5. Keep deterministic checks, LLM judges, and human review distinct.
  6. Be explicit when Prompt Hub is optional or reference-only.
  7. Include flush/wait behavior for scripts, workers, and cron-style jobs.
  8. End with one verification step and one route-out.

References

Weekly Installs
78
GitHub Stars
11
First Seen
Today