phoenix-observability
Phoenix - AI Observability Platform
Collaborating skills
- AI Engineering: skill:
ai-engineeringfor building the LLM applications that Phoenix observes
Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.
When to Use Phoenix
- Debugging LLM applications with detailed traces and span analysis
- Running systematic evaluations on datasets with LLM-as-judge
- Monitoring production LLM systems with real-time insights
- Building experiment pipelines for prompt/model comparison
- Self-hosted observability without vendor lock-in
Key Features
- Tracing: OpenTelemetry-based trace collection for any LLM framework
- Evaluation: LLM-as-judge evaluators for quality assessment
- Datasets: Versioned test sets for regression testing
- Experiments: Compare prompts, models, and configurations
- Open-source: Self-hosted with PostgreSQL or SQLite
Quick Start
Installation
pip install arize-phoenix
# With specific features
pip install arize-phoenix[embeddings] # Embedding analysis
pip install arize-phoenix-otel # OpenTelemetry config
pip install arize-phoenix-evals # Evaluation framework
Launch Phoenix Server
import phoenix as px
# Launch in notebook
session = px.launch_app()
# View UI
session.view() # Embedded iframe
print(session.url) # http://localhost:6006
Command-line Server
# Start Phoenix server
phoenix serve
# With PostgreSQL backend
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
phoenix serve --port 6006
Basic Tracing
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
# Configure OpenTelemetry with Phoenix
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces"
)
# Instrument OpenAI SDK
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
# All OpenAI calls are now traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
Custom Agents with Decorators
For framework-agnostic agentic systems, use @tracer.agent, @tracer.chain, and @tracer.tool decorators:
from openinference.instrumentation import Instrumentor
from phoenix.otel import register
tracer_provider = register(project_name="custom-agent")
instrumentor = Instrumentor(tracer_provider=tracer_provider)
@instrumentor.agent
def my_agent(query: str) -> str:
context = search_tool(query)
return synthesize_tool(context, query)
@instrumentor.tool
def search_tool(query: str) -> list:
return vector_store.search(query)
@instrumentor.tool
def synthesize_tool(context: list, query: str) -> str:
return llm.generate(query, context)
For detailed tracing patterns, see tracing-setup.md.
Storage Backends
Phoenix supports both SQLite and PostgreSQL for persistent storage:
- SQLite: Simple, file-based storage (default, ideal for development)
- PostgreSQL: Production-ready database for scalability and concurrent access
For detailed configuration examples, see storage-backends.md.
Docker Deployment
For containerized deployment, see docker-deployment.md for:
- Docker compose files for both SQLite and PostgreSQL
- Production-ready configuration
- Multi-container setup
Tracing Setup
For comprehensive tracing setup with OpenTelemetry, see tracing-setup.md:
- Framework-agnostic decorators:
@tracer.agent,@tracer.chain,@tracer.toolfor custom agents - Manual instrumentation with OpenTelemetry API
- Automatic instrumentation for LLM frameworks
- Distributed tracing for multi-service applications
- Custom span attributes and context propagation
Framework Integrations
Phoenix provides auto-instrumentation for many LLM frameworks. For detailed integration guides, see:
- framework-integrations.md: Complete list of supported frameworks
- DSPy, LangChain, LlamaIndex, Agno, AutoGen, CrewAI, and more
- Provider-specific integrations (OpenAI, Anthropic, Bedrock, etc.)
- Platform integrations (Dify, Flowise, LangFlow)
Core Concepts
Traces and Spans
A trace represents a complete execution flow, while spans are individual operations within that trace.
from phoenix.otel import register
from opentelemetry import trace
# Setup tracing
tracer_provider = register(project_name="my-app")
tracer = trace.get_tracer(__name__)
# Create custom spans
with tracer.start_as_current_span("process_query") as span:
span.set_attribute("input.value", query)
# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
context = retriever.search(query)
with tracer.start_as_current_span("generate_response"):
response = llm.generate(query, context)
span.set_attribute("output.value", response)
Projects
Projects organize related traces:
import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
# Or per-trace
from phoenix.otel import register
tracer_provider = register(project_name="experiment-v2")
Evaluation Framework
Built-in Evaluators
from phoenix.evals import (
OpenAIModel,
HallucinationEvaluator,
RelevanceEvaluator,
ToxicityEvaluator,
)
# Setup model for evaluation
eval_model = OpenAIModel(model="gpt-4o")
# Evaluate hallucination
hallucination_eval = HallucinationEvaluator(eval_model)
results = hallucination_eval.evaluate(
input="What is the capital of France?",
output="The capital of France is Paris.",
reference="Paris is the capital of France."
)
Run Evaluations on Dataset
from phoenix import Client
from phoenix.evals import run_evals
client = Client()
# Get spans to evaluate
spans_df = client.get_spans_dataframe(
project_name="my-app",
filter_condition="span_kind == 'LLM'"
)
# Run evaluations
eval_results = run_evals(
dataframe=spans_df,
evaluators=[
HallucinationEvaluator(eval_model),
RelevanceEvaluator(eval_model)
],
provide_explanation=True
)
# Log results back to Phoenix
client.log_evaluations(eval_results)
Client API
Query Traces and Spans
from phoenix import Client
client = Client(endpoint="http://localhost:6006")
# Get spans as DataFrame
spans_df = client.get_spans_dataframe(
project_name="my-app",
filter_condition="span_kind == 'LLM'",
limit=1000
)
# Get specific span
span = client.get_span(span_id="abc123")
# Get trace
trace = client.get_trace(trace_id="xyz789")
Log Feedback
from phoenix import Client
client = Client()
# Log user feedback
client.log_annotation(
span_id="abc123",
name="user_rating",
annotator_kind="HUMAN",
score=0.8,
label="helpful",
metadata={"comment": "Good response"}
)
Environment Variables
| Variable | Description | Default |
|---|---|---|
PHOENIX_PORT |
HTTP server port | 6006 |
PHOENIX_HOST |
Server bind address | 127.0.0.1 |
PHOENIX_GRPC_PORT |
gRPC/OTLP port | 4317 |
PHOENIX_SQL_DATABASE_URL |
Database connection | SQLite temp |
PHOENIX_WORKING_DIR |
Data storage directory | OS temp |
PHOENIX_ENABLE_AUTH |
Enable authentication | false |
PHOENIX_SECRET |
JWT signing secret | Required if auth enabled |
Best Practices
- Use projects: Separate traces by environment (dev/staging/prod)
- Add metadata: Include user IDs, session IDs for debugging
- Evaluate regularly: Run automated evaluations in CI/CD
- Version datasets: Track test set changes over time
- Monitor costs: Track token usage via Phoenix dashboards
- Self-host: Use PostgreSQL for production deployments
Common Issues
Traces Not Appearing
from phoenix.otel import register
# Verify endpoint
tracer_provider = register(
project_name="my-app",
endpoint="http://localhost:6006/v1/traces" # Correct endpoint
)
# Force flush
from opentelemetry import trace
trace.get_tracer_provider().force_flush()
Database Connection Issues
# Verify PostgreSQL connection
psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
# Check Phoenix logs
phoenix serve --log-level debug
Resources
- Documentation: https://docs.arize.com/phoenix
- Repository: https://github.com/Arize-ai/phoenix
- Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
- Version: 12.0.0+
- License: Apache 2.0
More from mguinada/agent-skills
refactor
TDD-based code refactoring preserving behavior through tests. Use Red-Green-Refactor cycles to apply refactoring patterns one test-verified change at a time. **TRIGGERS**: 'clean up code', 'make code simpler', 'reduce complexity', 'refactor this', 'apply DRY', 'extract method', 'remove duplication'. **DISTINCT FROM**: Adding features (use /tdd) or fixing bugs. **PROACTIVE**: Auto-invoke when test-covered code has complexity (functions >50 lines, high cyclomatic complexity, duplication).
16ai-engineering
Build AI agents and agentic workflows. Use when designing/building/debugging agentic systems: choosing workflows vs agents, implementing prompt patterns (chaining/routing/parallelization/orchestrator-workers/evaluator-optimizer), building autonomous agents with tools, designing ACI/tool specs, or troubleshooting/optimizing implementations. **PROACTIVE ACTIVATION**: Auto-invoke when building agentic applications, designing workflows vs agents, or implementing agent patterns. **DETECTION**: Check for agent code (MCP servers, tool defs, .mcp.json configs), or user mentions of \"agent\", \"workflow\", \"agentic\", \"autonomous\". **USE CASES**: Designing agentic systems, choosing workflows vs agents, implementing prompt patterns, building agents with tools, designing ACI/tool specs, troubleshooting/optimizing agents.
13git-commit
Generate concise, descriptive git commit messages following best practices. Use when creating git commits from staged changes, crafting commit messages, or reviewing commit message quality. Use when the user says /commit or asks to create a git commit. **PROACTIVE ACTIVATION**: Auto-invoke when staged changes detected or user asks to commit/save work. **DETECTION**: Run git status - if staged changes exist, offer to commit. User says \"commit\", \"save\", \"done with feature\". **USE CASES**: Staged changes detected, work completed, user wants to save progress.
12tdd
Guide Test-Driven Development workflow (Red-Green-Refactor) for new features, bug fixes, and refactoring. Supports both Python (pytest) and Ruby (RSpec). Use when writing tests, implementing features, or following TDD methodology. **PROACTIVE ACTIVATION**: Auto-invoke when implementing features or fixing bugs in projects with test infrastructure. **DETECTION**: Check for tests/ directory, pytest.ini, pyproject.toml with pytest config, spec/ directory, .rspec file, or *_spec.rb files. **USE CASES**: Writing production code, fixing bugs, adding features, legacy code characterization.
11create-pr
Creates GitHub pull requests with properly formatted titles, a body matching the project's PR template, and appropriate type/scope labels. Automatically creates labels if they don't exist. Use when creating PRs, submitting changes for review, or when the user says /pr or asks to create a pull request. **PROACTIVE ACTIVATION**: Auto-invoke when a branch has commits ahead of main and the user signals the work is ready. **DETECTION**: Run git log origin/main..HEAD - if commits exist and user signals readiness, offer to open a PR. User says \"open a PR\", \"ready for review\", \"this is done\", \"let's merge\", \"submit this\". **USE CASES**: Feature or fix complete, user finished a series of commits and mentions review or merging.
11prompt-engineering
Creates system prompts, writes tool descriptions, and structures agent instructions for agentic systems. Use when the user asks to create, generate, or design prompts for AI agents, especially for tool-using agents, planning agents, or autonomous systems. **PROACTIVE ACTIVATION**: Auto-invoke when designing prompts for agents, tools, or agentic workflows in AI projects. **DETECTION**: Check for agent/tool-related code, prompt files, or user mentions of \"prompt\", \"agent\", \"LLM\". **USE CASES**: Designing system prompts, tool descriptions, agent instructions, prompt optimization, reducing hallucinations.
10