Langfuse Observability

Overview

Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

When to use this skill:

Setting up LLM observability from scratch
Debugging slow or incorrect LLM responses
Tracking token usage and costs
Managing prompts in production
Evaluating LLM output quality
Migrating from LangSmith to Langfuse

OrchestKit Integration:

Status: Migrated from LangSmith (Dec 2025)
Location: backend/app/shared/services/langfuse/
MCP Server: orchestkit-langfuse (optional)

Quick Start

Setup

# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings

langfuse_client = Langfuse(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY,
    host=settings.LANGFUSE_HOST  # Self-hosted or cloud
)

Basic Tracing with @observe

from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)

Session & User Tracking

langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

Core Features Summary

Feature	Description	Reference
Distributed Tracing	Track LLM calls with parent-child spans	`references/tracing-setup.md`
Cost Tracking	Automatic token & cost calculation	`references/cost-tracking.md`
Prompt Management	Version control for prompts	`references/prompt-management.md`
LLM Evaluation	Custom scoring with G-Eval	`references/evaluation-scores.md`
Session Tracking	Group related traces	`references/session-tracking.md`
Experiments API	A/B testing & benchmarks	`references/experiments-api.md`
Multi-Judge Eval	Ensemble LLM evaluation	`references/multi-judge-evaluation.md`

References

Tracing Setup

See: references/tracing-setup.md

Key topics covered:

Initializing Langfuse client with @observe decorator
Creating nested traces and spans
Tracking LLM generations with metadata
LangChain/LangGraph CallbackHandler integration
Workflow integration patterns

Cost Tracking

See: references/cost-tracking.md

Key topics covered:

Automatic cost calculation from token usage
Custom model pricing configuration
Monitoring dashboard SQL queries
Cost tracking per analysis/user
Daily cost trend analysis

Prompt Management

See: references/prompt-management.md

Key topics covered:

Prompt versioning and labels (production/staging/draft)
Template variables with Jinja2 syntax
A/B testing prompt versions
OrchestKit 4-level caching architecture (L1-L4)
Linking prompts to generation spans

LLM Evaluation

See: references/evaluation-scores.md

Key topics covered:

Custom scoring with numeric/categorical values
G-Eval automated quality assessment
Score trends and comparisons
Filtering traces by score thresholds

Session Tracking

See: references/session-tracking.md

Key topics covered:

Grouping traces by session_id
Multi-turn conversation tracking
User and metadata analytics

Experiments API

See: references/experiments-api.md

Key topics covered:

Creating test datasets in Langfuse
Running automated evaluations
Regression testing for LLMs
Benchmarking prompt versions

Multi-Judge Evaluation

See: references/multi-judge-evaluation.md

Key topics covered:

Multiple LLM judges for quality assessment
Weighted scoring across judges
OrchestKit langfuse_evaluators.py integration

Best Practices

Always use @observe decorator for automatic tracing
Set user_id and session_id for better analytics
Add meaningful metadata (content_type, analysis_id, etc.)
Score all production traces for quality monitoring
Use prompt management instead of hardcoded prompts
Monitor costs daily to catch spikes early
Create datasets for regression testing
Tag production vs staging traces

LangSmith Migration Notes

Key Differences:

Aspect	Langfuse	LangSmith
Hosting	Self-hosted, open-source	Cloud-only, proprietary
Cost	Free	Paid
Prompts	Built-in management	External storage needed
Decorator	`@observe`	`@traceable`

External References

Related Skills

observability-monitoring - General observability patterns for metrics, logging, and alerting
llm-evaluation - Evaluation patterns that integrate with Langfuse scoring
llm-streaming - Streaming response patterns with trace instrumentation
prompt-caching - Caching strategies that reduce costs tracked by Langfuse

Key Decisions

Decision	Choice	Rationale
Observability platform	Langfuse (not LangSmith)	Open-source, self-hosted, free, built-in prompt management
Tracing approach	@observe decorator	Automatic, low-overhead instrumentation
Cost tracking	Automatic token counting	Built-in model pricing with custom overrides
Prompt management	Langfuse native	Version control, A/B testing, labels in one place

Capability Details

distributed-tracing

Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:

How do I trace LLM calls across my application?
How to debug slow LLM responses?
Track execution flow in multi-agent workflows
Create nested trace spans

cost-tracking

Keywords: cost, token usage, pricing, budget, spend, expense Solves:

How do I track LLM costs?
Calculate token usage and pricing
Monitor AI budget and spending
Track cost per user or session

prompt-management

Keywords: prompt version, prompt template, prompt control, prompt registry Solves:

How do I version control prompts?
Manage prompts in production
A/B test different prompt versions
Link prompts to traces

llm-evaluation

Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:

How do I evaluate LLM output quality?
Score responses with custom metrics
Track quality trends over time
Compare prompt versions by quality

session-tracking

Keywords: session, user tracking, conversation, group traces Solves:

How do I group related traces?
Track multi-turn conversations
Monitor per-user performance
Organize traces by session

langchain-integration

Keywords: langchain, callback, handler, langgraph integration Solves:

How do I integrate Langfuse with LangChain?
Use CallbackHandler for tracing
Automatic LangGraph workflow tracing
LangChain observability setup

datasets-evaluation

Keywords: dataset, test set, evaluation dataset, benchmark Solves:

How do I create test datasets in Langfuse?
Run automated evaluations
Regression testing for LLMs
Benchmark prompt versions

ab-testing

Keywords: a/b test, experiment, compare prompts, variant testing Solves:

How do I A/B test prompts?
Compare two prompt versions
Experimental prompt evaluation
Statistical prompt testing

monitoring-dashboard

Keywords: dashboard, analytics, metrics, monitoring, queries Solves:

What are the most expensive traces?
Average cost by agent type
Quality score trends
Custom monitoring queries

orchestkit-integration

Keywords: orchestkit, migration, setup, workflow integration Solves:

How does OrchestKit use Langfuse?
Migrate from LangSmith to Langfuse
OrchestKit workflow tracing patterns
Cost tracking per analysis

multi-judge-evaluation

Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:

How do I use multiple LLM judges to evaluate quality?
Set up G-Eval criteria evaluation
Configure weighted scoring across judges
Wire OrchestKit's existing langfuse_evaluators.py

experiments-api

Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:

How do I run experiments across datasets?
A/B test models and prompts systematically
Track quality regression over time
Compare experiment results

langfuse-observability

Langfuse Observability

Overview

Quick Start

Setup

Basic Tracing with @observe

Session & User Tracking

Core Features Summary

References

Tracing Setup

Cost Tracking

Prompt Management

LLM Evaluation

Session Tracking

Experiments API

Multi-Judge Evaluation

Best Practices

LangSmith Migration Notes

External References

Related Skills

Key Decisions

Capability Details

distributed-tracing

cost-tracking

prompt-management

llm-evaluation

session-tracking

langchain-integration

datasets-evaluation

ab-testing

monitoring-dashboard

orchestkit-integration

multi-judge-evaluation

experiments-api