app-observability
Grafana Cloud Application Observability Skill
Overview
Grafana Cloud provides three tightly related application monitoring products:
- Application Observability (APM) - RED metrics from OTel traces, service inventory, service maps
- Frontend Observability - RUM/Faro SDK for browser apps, session replay, web vitals
- AI Observability - LLM/model monitoring via OpenLIT + OTel, token/cost/latency metrics
All three integrate with Grafana Tempo (traces), Loki (logs), and Pyroscope (profiles) for full-stack correlation.
Application Observability (APM)
What It Is
Application Observability is a pre-built APM experience in Grafana Cloud built on top of OpenTelemetry. It generates RED (Rate, Error, Duration) metrics from distributed traces via span metrics, then surfaces them in:
- Service Inventory - table of all services with RED metrics at a glance
- Service Overview - per-service RED metrics, top operations, error breakdown
- Service Map - node graph of service dependencies with flow visualization
- Operations view - per-endpoint RED metrics with p50/p95/p99 latency
How Metrics Are Generated
Application Observability does NOT rely on traditional Prometheus scraping. Metrics come from span metrics - aggregations computed from OTel trace data:
- Source: OTel traces sent to Grafana Tempo or Grafana Alloy
- Generation method: Tempo's metrics-generator OR the
spanmetricsconnector in Alloy/OTel Collector - Result: Prometheus-compatible metrics stored in Grafana Mimir
Key generated metric names:
- Via Tempo metrics-generator:
traces_spanmetrics_calls_total,traces_spanmetrics_duration_seconds - Via OTel Collector spanmetrics connector:
traces_span_metrics_calls_total,traces_span_metrics_duration_seconds
Required OTel Resource Attributes
These attributes MUST be present on all spans for Application Observability to work:
| Attribute | Grafana Label | Purpose |
|---|---|---|
service.name |
service_name / part of job |
Identifies the service |
service.namespace |
part of job label |
Groups services; job = namespace/service.name |
deployment.environment |
deployment_environment |
Env filter (prod/dev/staging) |
The job label is constructed as:
service.namespace/service.namewhen namespace is setservice.namealone when no namespace
Additional recommended attributes:
service.version- shown in service overviewk8s.cluster.name- for K8s environmentsk8s.namespace.name- Kubernetes namespacecloud.region- for multi-region setups
Setting Environment Variables for OTel SDK
export OTEL_SERVICE_NAME="my-api"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=myteam,deployment.environment=production,service.version=1.2.3"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
Grafana Alloy Configuration (River syntax)
Alloy acts as a local OTel Collector and forwards data to Grafana Cloud:
// Receive traces, metrics, logs from instrumented apps
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
metrics = [otelcol.processor.resourcedetection.default.input]
logs = [otelcol.processor.resourcedetection.default.input]
traces = [otelcol.processor.resourcedetection.default.input]
}
}
// Auto-detect host/cloud metadata
otelcol.processor.resourcedetection "default" {
detectors = ["env", "system", "gcp", "aws", "azure"]
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
}
}
// Batch for efficiency
otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
logs = [otelcol.exporter.otlphttp.grafana_cloud.input]
traces = [otelcol.exporter.otlphttp.grafana_cloud.input]
}
}
// Auth
otelcol.auth.basic "grafana_cloud" {
username = env("GRAFANA_CLOUD_INSTANCE_ID")
password = env("GRAFANA_CLOUD_API_KEY")
}
// Export to Grafana Cloud OTLP endpoint
otelcol.exporter.otlphttp "grafana_cloud" {
client {
endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")
auth = otelcol.auth.basic.grafana_cloud.handler
}
}
Required environment variables for Alloy:
GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-<region>.grafana.net/otlp
GRAFANA_CLOUD_INSTANCE_ID=<your-instance-id>
GRAFANA_CLOUD_API_KEY=<your-api-key>
Service Map
The Service Map uses Tempo's metrics-generator to produce service graph metrics:
- Node graph shows services as nodes, HTTP/gRPC calls as edges
- Edge thickness indicates request rate; color indicates error rate
- Clicking a node navigates to Service Overview
- Requires
span.kind(CLIENT/SERVER) on spans for directional edges
Enable in Tempo (managed by Grafana Cloud automatically):
service-graphsmetrics generator enabled by default in Grafana Cloud Tempo- Uses
traces_service_graph_request_total,traces_service_graph_request_failed_totalmetrics
Integration with Traces, Logs, Profiles
Application Observability provides one-click correlation:
- Traces: Click any metric spike to open exemplar traces in Grafana Tempo
- Logs: Service logs shown in Service Overview; correlated via
service.namelabel - Profiles: "Go to profiles" button in Service Overview when Pyroscope is configured
- Frontend: Link from Application Observability to Frontend Observability for the same service
Frontend Observability (Faro)
What It Is
Grafana Faro is an open-source JavaScript/TypeScript SDK for Real User Monitoring (RUM). It instruments browser applications to capture:
- Web vitals: Core Web Vitals (LCP, CLS, INP) and additional performance metrics
- Errors: Unhandled exceptions, rejected promises with stack traces
- Sessions: User journeys, page views, navigation timing
- Logs: Custom log messages from frontend code
- Traces: Distributed traces via OpenTelemetry-JS (correlates with backend spans)
- Session replay: Rrweb-based DOM recording for reproducing user issues
Data flows: Faro SDK -> Grafana Alloy (faro receiver) OR Grafana Cloud OTLP endpoint -> Loki (logs) + Tempo (traces) + Mimir (metrics)
Faro SDK Packages
@grafana/faro-core # Core SDK - signals, transports, API
@grafana/faro-web-sdk # Web instrumentations + transports
@grafana/faro-web-tracing # OpenTelemetry-JS distributed tracing
@grafana/faro-react # React-specific integrations (error boundary, router)
Basic JavaScript Setup (npm)
npm install @grafana/faro-web-sdk
# or
yarn add @grafana/faro-web-sdk
import {
initializeFaro,
getWebInstrumentations,
} from '@grafana/faro-web-sdk';
const faro = initializeFaro({
url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
app: {
name: 'my-frontend-app',
version: '1.0.0',
environment: 'production',
},
instrumentations: [
...getWebInstrumentations({
captureConsole: true,
}),
],
});
// Manual API usage
faro.api.pushLog(['User clicked checkout button']);
faro.api.pushError(new Error('Payment failed'));
faro.api.pushEvent('button_click', { button: 'checkout' });
CDN Setup (no bundler)
<script src="https://unpkg.com/@grafana/faro-web-sdk@latest/dist/library/faro-web-sdk.iife.js"></script>
<script>
const { initializeFaro, getWebInstrumentations } = GrafanaFaroWebSdk;
initializeFaro({
url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
app: { name: 'my-app', version: '1.0.0' },
instrumentations: [...getWebInstrumentations()],
});
</script>
React Setup with Tracing
npm install @grafana/faro-react @grafana/faro-web-tracing
import { initializeFaro, getWebInstrumentations } from '@grafana/faro-web-sdk';
import { TracingInstrumentation } from '@grafana/faro-web-tracing';
import {
createReactRouterV6DataOptions,
ReactIntegration,
withFaroRouterInstrumentation,
} from '@grafana/faro-react';
import { createBrowserRouter, RouterProvider } from 'react-router-dom';
const faro = initializeFaro({
url: 'https://faro-collector-prod-<region>.grafana.net/collect/<app-key>',
app: {
name: 'my-react-app',
version: '1.0.0',
environment: 'production',
},
instrumentations: [
...getWebInstrumentations({ captureConsole: true }),
new TracingInstrumentation(),
new ReactIntegration({
router: createReactRouterV6DataOptions({}),
}),
],
});
const router = withFaroRouterInstrumentation(
createBrowserRouter([
{ path: '/', element: <Home /> },
{ path: '/about', element: <About /> },
])
);
function App() {
return <RouterProvider router={router} />;
}
Session Configuration
initializeFaro({
url: '...',
app: { name: 'my-app' },
sessionTracking: {
enabled: true,
persistent: true,
maxSessionPersistenceTime: 4 * 60 * 60 * 1000, // 4 hours in ms
samplingRate: 1, // 1 = 100%, 0.5 = 50% of sessions
onSessionChange: (oldSession, newSession) => {
console.log('Session changed', newSession.id);
},
},
instrumentations: [...getWebInstrumentations()],
});
Getting the Collector URL
- In Grafana Cloud, go to Connections (left menu) > search "Frontend Observability"
- Click the Frontend Observability card
- Navigate to Web SDK Configuration tab
- Copy the
urlvalue - this is your unique collector endpoint - Paste into your
initializeFaro({ url: '...' })call
What Faro Captures Automatically
When using getWebInstrumentations():
- Page views and navigation timing
- Core Web Vitals (LCP, CLS, INP - replaces FID in Faro v2)
- JavaScript errors and unhandled rejections
- Console errors/warnings (when
captureConsole: true) - Resource loading performance
- User interactions (clicks, form events)
- Fetch/XHR request timing
Correlation with Backend Traces
When TracingInstrumentation is included, Faro:
- Injects
traceparent/tracestateheaders into outgoing fetch/XHR requests - Creates spans for each HTTP call
- Links browser session to backend traces in Tempo
- Enables "Frontend to Backend" trace waterfall in Grafana
AI Observability
What It Is
AI Observability monitors generative AI and LLM applications in production. Built on OTel GenAI semantic conventions and the OpenLIT instrumentation library.
Monitors:
- LLM API calls (OpenAI, Anthropic, Cohere, Google, etc.)
- Vector databases (Pinecone, Weaviate, Chroma, etc.)
- AI frameworks (LangChain, CrewAI, LlamaIndex)
- Model Context Protocol (MCP) servers
- GPU utilization
- AI evaluation quality (hallucination, toxicity, bias)
Key Metrics (OTel GenAI Semantic Conventions)
| Metric | Description |
|---|---|
gen_ai_usage_input_tokens_total |
Total input/prompt tokens consumed |
gen_ai_usage_output_tokens_total |
Total output/completion tokens consumed |
gen_ai_usage_cost_USD_sum |
Total cost in USD |
gen_ai_client_operation_duration |
Latency per LLM call (histogram) |
gen_ai_client_token_usage |
Token usage histogram |
Trace spans capture:
- Model name (
gen_ai.request.model) - Temperature, top_p parameters
- Full prompts and completions (configurable)
- Provider (
gen_ai.system:openai,anthropic, etc.) - Time to first token (TTFT)
Python Setup with OpenLIT
pip install openlit openai anthropic cohere
import openlit
import openai
# One-line initialization - auto-instruments all supported LLM libraries
openlit.init()
# Optional parameters
openlit.init(
application_name="my-ai-app",
environment="production",
)
# Your existing code works unchanged - OpenLIT intercepts all LLM calls
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
OTel Environment Variables
export OTEL_SERVICE_NAME="my-ai-app"
export OTEL_DEPLOYMENT_ENVIRONMENT="production"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-<region>.grafana.net/otlp"
# Base64 encode "instanceID:apiToken"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64-encoded-instanceid:apitoken>"
To get the credentials:
- In Grafana Cloud, go to My Account > Stack > OpenTelemetry
- Generate a token and copy the OTLP endpoint
AI Evaluations and Guards
# Hallucination detection
evals = openlit.evals.Hallucination(
provider="openai",
api_key=os.getenv("OPENAI_API_KEY")
)
result = evals.measure(
prompt=user_message,
contexts=["Your knowledge base content here"],
text=llm_answer
)
# Content safety guard
guard = openlit.guard.All(
provider="openai",
api_key=os.getenv("OPENAI_API_KEY")
)
guard.detect(text=user_message)
Prebuilt Dashboards
Once metrics arrive, Grafana Cloud auto-populates five dashboards:
- GenAI Observability - request rates, latency percentiles, costs
- GenAI Evaluations - hallucination, bias, toxicity scores
- Vector Database Observability - query latency, index ops
- MCP Observability - tool call rates, errors
- GPU Monitoring - utilization, memory, temperature
Setup Path
- In Grafana Cloud: Connections > search "AI Observability" > click the card
- Follow the UI wizard to get your OTLP endpoint and API key
- Set the environment variables
pip install openlitand callopenlit.init()at app startup- Deploy - dashboards populate automatically within minutes
Full-Stack Correlation Summary
| Signal | Product | Storage | Query Language |
|---|---|---|---|
| Metrics (RED) | App Observability | Mimir | PromQL |
| Traces | Tempo | Tempo | TraceQL |
| Logs | Loki | Loki | LogQL |
| Profiles | Pyroscope | Pyroscope | - |
| Browser RUM | Faro/Frontend Obs | Loki + Tempo | - |
| LLM metrics | AI Observability | Mimir | PromQL |
Correlation keys:
service.name/service_namelinks all signals for a service- Trace exemplars embed trace IDs in metric data points (RED metrics -> traces)
traceIDin logs enables log-to-trace correlationprofileID/ time range enables trace-to-profile correlation- Faro injects
traceparentheaders to link browser sessions to backend traces
Common Tasks
Find Why a Service Has High Latency
- App Observability > Service Inventory > click service
- In Service Overview: check p95/p99 latency trend in Operations panel
- Click a high-latency operation > "View traces" to open exemplar traces in Tempo
- In Tempo trace: use "Go to profiles" to see CPU profile at that time
- Check correlated logs in the Logs panel of Service Overview
Debug a Frontend Error
- Frontend Observability > Errors panel > click error
- View stack trace, browser, OS, session info
- Click "View session replay" to see what the user did
- Check correlated backend trace if
TracingInstrumentationis configured
Monitor LLM Cost Drift
- AI Observability dashboard > GenAI Observability
- Use
gen_ai_usage_cost_USD_summetric to see cost by model/provider - Set alert on cost threshold or token usage spike
- Drill into traces to see which prompts are consuming the most tokens
References
- App Observability docs: https://grafana.com/docs/grafana-cloud/monitor-applications/application-observability/
- Frontend Observability docs: https://grafana.com/docs/grafana-cloud/monitor-applications/frontend-observability/
- Faro Web SDK GitHub: https://github.com/grafana/faro-web-sdk
- AI Observability docs: https://grafana.com/docs/grafana-cloud/monitor-applications/ai-observability/
- Alloy for App Observability: https://grafana.com/docs/opentelemetry/collector/grafana-alloy/
- OpenLIT: https://openlit.io/