dt-obs-frontends

Installation
SKILL.md

Frontend Observability Skill

Monitor web and mobile frontends using Real User Monitoring (RUM) with DQL queries. This skill targets the new RUM experience only; do not use classic RUM data.

Overview

This skill helps you:

  • Monitor Core Web Vitals and frontend performance
  • Track user sessions, engagement, and behavior
  • Analyze errors and correlate with backend traces
  • Optimize mobile app startup and stability
  • Diagnose performance issues with detailed timing analysis

Data Sources:

  • Metrics: timeseries with dt.frontend.* (trends, alerting)
  • Events: fetch user.events (individual page views, requests, clicks, errors)
  • Sessions: fetch user.sessions (session-level aggregates: duration, bounce, counts)

Quick Reference

Common Metrics

  • dt.frontend.user_action.count - User action volume
  • dt.frontend.user_action.duration - User action duration
  • dt.frontend.request.count - Request volume
  • dt.frontend.request.duration - Request latency (ms)
  • dt.frontend.error.count - Error counts
  • dt.frontend.session.active.estimated_count - Active sessions
  • dt.frontend.user.active.estimated_count - Unique users
  • dt.frontend.web.page.cumulative_layout_shift - CLS metric
  • dt.frontend.web.navigation.dom_interactive - DOM interactive time
  • dt.frontend.web.page.first_input_delay - FID metric (legacy; prefer INP)
  • dt.frontend.web.page.largest_contentful_paint - LCP metric
  • dt.frontend.web.page.interaction_to_next_paint - INP metric
  • dt.frontend.web.navigation.load_event_end - Load event end
  • dt.frontend.web.navigation.time_to_first_byte - Time to first byte

Common Filters

  • frontend.name - Filter by frontend name (e.g. my-frontend)
  • dt.rum.user_type - Exclude synthetic monitoring
  • geo.country.iso_code - Geographic filtering
  • device.type - Mobile, desktop, tablet
  • browser.name - Browser filtering

Common Timeseries Dimensions

Use these for dt.frontend.* timeseries splits and breakdowns:

  • frontend.name - Frontend name
  • geo.country.iso_code
  • device.type
  • browser.name
  • os.name
  • user_type - real_user, synthetic, robot
fetch user.events, from: now() - 2h
| filter characteristics.has_page_summary == true
| summarize page_views = count(), by: {frontend.name}
| sort page_views desc

Event Characteristics

  • characteristics.has_page_summary - Page views (web)
  • characteristics.has_view_summary - Views (mobile)
  • characteristics.has_navigation - Navigation events
  • characteristics.has_user_interaction - Clicks, forms, etc.
  • characteristics.has_request - Network request events
  • characteristics.has_error - Error events
  • characteristics.has_crash - Mobile crashes
  • characteristics.has_long_task - Long JavaScript tasks
  • characteristics.has_csp_violation - CSP violations

Full event model: https://docs.dynatrace.com/docs/semantic-dictionary/model/rum/user-events

Session Data (user.sessions)

user.sessions contains session-level aggregates produced by the session aggregation service from user.events. Field names differ from user.events — sessions use underscores where events use dots.

Session identity and context:

  • dt.rum.session.id — Session ID (NOT dt.rum.session_id)
  • dt.rum.instance.id — Instance ID
  • frontend.name - array of frontends involved in session
  • dt.rum.application.typeweb or mobile
  • dt.rum.user_typereal_user, synthetic, or robot

Session aggregates (underscore naming — NOT dot):

Field Description ⚠️ NOT this
navigation_count Number of navigations navigation.count
user_interaction_count Clicks, form submissions user_interaction.count
user_action_count User actions user_action.count
request_count XHR/fetch requests request.count
event_count Total events in session event.count
page_summary_count Page views (web) page_summary.count
view_summary_count Views (mobile/SPA) view_summary.count

Error fields (dot naming — same as events):

  • error.count, error.exception_count, error.http_4xx_count, error.http_5xx_count
  • error.anr_count, error.csp_violation_count, error.has_crash

Session lifecycle:

  • start_time, end_time, duration (nanoseconds)
  • end_reasontimeout, synthetic_execution_finished, etc.
  • characteristics.is_bounce — Boolean bounce flag
  • characteristics.has_replay — Session replay available

User identity:

  • dt.rum.user_tag — User identifier (typically email, username or customerId), set via dtrum.identifyUser() API call in the instrumented frontend. Not always populated — only present when the frontend explicitly calls identifyUser().
  • When dt.rum.user_tag is empty, dt.rum.instance.id is often the only user differentiator. The value is a random ID assigned by the RUM agent on the client side, so it is not personally identifiable but can be used to distinguish unique users when user_tag is not set. On web this is based on a persistent cookie, so it can be deleted by the user.
  • The user tag is a session-level field — query it from user.sessions, not user.events (where it may be empty even if the session has one).

Client/device context:

  • browser.name, browser.version, device.type, os.name
  • geo.country.iso_code, client.ip, client.isp

Synthetic-only fields:

  • dt.entity.synthetic_test, dt.entity.synthetic_location, dt.entity.synthetic_test_step

Time window behavior:

  • fetch user.sessions, from: X, to: Y only returns sessions that started in [X, Y] — NOT sessions that were merely active during that window.
  • Sessions can last 8h+ (the aggregation service waits 30+ minutes of inactivity before closing a session).
  • To find all sessions active during a time window, extend the lookback by at least 8 hours: e.g., to cover events from the last 24h, query fetch user.sessions, from: now() - 32h.
  • This matters for correlation queries (e.g., matching user.events to user.sessions by session ID) — a narrow user.sessions window will miss long-running sessions and produce false "orphans."

Session creation delay:

  • The session aggregation service waits for ~30+ minutes of inactivity before closing a session and writing the user.sessions record.
  • This means recent events (last ~1 hour) will not yet have a matching user.sessions entry — this is normal, not a data gap.
  • When correlating user.events with user.sessions, exclude recent data (e.g., use to: now() - 1h) to avoid counting in-progress sessions as orphans.

Zombie sessions (events without a user.sessions record):

  • Not every dt.rum.session.id in user.events will have a corresponding user.sessions record. The session aggregation service intentionally skips zombie sessions — sessions with no real user activity (zero navigations and zero user interactions).
  • Zombie sessions contain only background, machine-driven activity (e.g., automatic XHR requests, heartbeats) with no page views or clicks. Serializing them would add no value to users.
  • When correlating user.events with user.sessions, expect a large number of unmatched session IDs. This is by design, not a data gap. Filter to sessions with activity before diagnosing orphans:
    fetch user.events, from: now() - 2h, to: now() - 1h
    | filter isNotNull(dt.rum.session.id)
    | summarize navs = countIf(characteristics.has_navigation == true),
        interactions = countIf(characteristics.has_user_interaction == true),
        by: {dt.rum.session.id}
    | filter navs > 0 or interactions > 0
    

Example — bounce rate and session quality:

fetch user.sessions, from: now() - 24h
| filter dt.rum.user_type == "real_user"
| summarize
    total_sessions = count(),
    bounces = countIf(characteristics.is_bounce == true),
    zero_activity = countIf(toLong(navigation_count) == 0 and toLong(user_interaction_count) == 0),
    avg_duration_s = avg(toLong(duration)) / 1000000000
| fieldsAdd bounce_rate_pct = round((bounces * 100.0) / total_sessions, decimals: 1)

Performance Thresholds

  • LCP: Good <2.5s | Poor >4.0s
  • INP: Good <200ms | Poor >500ms
  • CLS: Good <0.1 | Poor >0.25
  • Cold Start: Good <3s | Poor >5s
  • Long Tasks: >50ms problematic, >250ms severe

Core Workflows

1. Web Performance Monitoring

Track Core Web Vitals, page performance, and request latency for SEO and UX optimization.

Primary Files:

  • references/WebVitals.md - Core Web Vitals (LCP, INP, CLS)
  • references/performance-analysis.md - Request and page performance

Common Queries:

  • All Core Web Vitals summary
  • Web Vitals by page/device
  • Request duration SLA monitoring
  • Page load performance trends

2. User Session & Behavior Analysis

Understand user engagement, navigation patterns, and session characteristics. Analyze button clicks, form interactions, and user journeys.

Data source choice:

  • Use fetch user.sessions for session-level analysis (bounce rate, session duration, session counts)
  • Use fetch user.events for event-level detail (individual clicks, navigation timing, specific pages)

Primary Files:

  • references/user-sessions.md - Session tracking and user analytics
  • references/performance-analysis.md - Navigation and engagement patterns

Common Queries:

  • Active sessions by frontend
  • Sessions by custom property
  • Bounce rate analysis (use user.sessions with characteristics.is_bounce)
  • Session quality (zero-activity sessions via navigation_count, user_interaction_count)
  • Click analysis on UI elements (use user.events with characteristics.has_user_interaction)
  • External referrers (traffic sources)

3. Error Tracking & Debugging

Monitor error rates, analyze exceptions, and correlate frontend issues with backend.

Primary Files:

  • references/error-tracking.md - Error analysis and debugging
  • references/performance-analysis.md - Trace correlation

Common Queries:

  • Error rate monitoring
  • JavaScript exceptions by type
  • Failed requests with backend traces
  • Request timing breakdown

4. Mobile Frontend Monitoring

Track mobile app performance, startup times, and crash analytics for iOS and Android. Analyze app version performance and device-specific issues.

Primary Files:

  • references/mobile-monitoring.md - App starts, crashes, and mobile-specific metrics

Common Queries:

  • Cold start performance by app version (iOS, Android)
  • Warm start and hot start metrics
  • Crash rate by device model and OS version
  • ANR events (Android)
  • Native crash signals
  • App version comparison

5. Advanced Performance Optimization

Deep performance diagnostics including JavaScript profiling, main thread blocking, UI jank analysis, and geographic performance.

Primary Files:

  • references/performance-analysis.md - Advanced diagnostics and long tasks

Common Queries:

  • Long JavaScript tasks blocking main thread
  • UI jank and rendering delays
  • Tasks >50ms impacting responsiveness
  • Third-party long tasks (iframes)
  • Single-page app performance issues
  • Geographic performance distribution
  • Performance degradation detection

Best Practices

  1. Use metrics for trends, events for debugging

    • Metrics: Timeseries dashboards, alerting, capacity planning
    • Events: Root cause analysis, detailed diagnostics
  2. Filter by frontend in multi-app environments

    • Always use frontend.name for clarity
  3. Match interval to time range

    • 5m intervals for hours, 1h for days, 1d for weeks
  4. Exclude synthetic traffic when analyzing real users

    • Filter dt.rum.user_type to focus on genuine behavior
  5. Combine metrics with events for complete insights

    • Start with metric trends, drill into events for details
  6. Extend user.sessions time window for correlation queries

    • user.sessions only returns sessions that started in the query window
    • Sessions can last 8h+, so extend lookback by at least 8h when joining with user.events

Slow Page Load Playbook

Start by segmenting the problem by page, browser, geo location, and dt.rum.user_type.

Heuristics:

  • High TTFB -> slow backend
  • High LCP with normal TTFB -> render bottleneck
  • High CLS -> layout shifts (late-loading content, ads, fonts)
  • Long tasks dominate -> JavaScript execution bottlenecks (heavy frameworks, large bundles)

Backend latency (high TTFB)

fetch user.events
| filter frontend.name == "my-frontend" and characteristics.has_request == true
| filter page.url.path == "/checkout"
| summarize avg_ttfb = avg(request.time_to_first_byte), avg_duration = avg(duration)

If TTFB is high, analyze backend spans by correlating frontend events with backend traces using dt.rum.trace_id.

Heavy JavaScript execution (long tasks)

Long tasks by page:

fetch user.events, from: now() - 2h
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {frontend.name, page.url.path}
| sort total_blocking_time desc
| limit 20

Long tasks by script source:

fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {long_task.attribution.container_src}
| sort total_blocking_time desc
| limit 20

Large JavaScript bundles

fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| filter endsWith(url.full, ".js")
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20

Large resources

fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20

Cache effectiveness

fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_request == true
| fieldsAdd cache_status = if(
   performance.incomplete_reason == "local_cache" or performance.transfer_size == 0 and
   (performance.encoded_body_size > 0 or performance.decoded_body_size > 0),
   "cached",
   else: if(performance.transfer_size > 0, "network", else: "uncached")
  )
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   by: {url.domain, cache_status}

Compression waste

fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.encoded_body_size) and isNotNull(performance.decoded_body_size)
| filter performance.encoded_body_size > 0
| fieldsAdd
   expansion_ratio = performance.decoded_body_size / performance.encoded_body_size,
   wasted_bytes = performance.decoded_body_size - performance.encoded_body_size
| summarize
   requests = count(),
   avg_expansion_ratio = avg(expansion_ratio),
   total_wasted_bytes = sum(wasted_bytes),
   by: {request.url.host, request.url.path}
| sort total_wasted_bytes desc
| limit 50

Network issues

Compare by location and domain when TTFB is high but backend performance is good:

fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {geo.country.iso_code, request.url.domain}
| sort p95_duration desc
| limit 50

Analyze DNS time:

fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.domain_lookup_start) and isNotNull(performance.domain_lookup_end)
| fieldsAdd dns_ms = performance.domain_lookup_end - performance.domain_lookup_start
| summarize
   request_count = count(),
   avg_dns_ms = avg(dns_ms),
   p75_dns_ms = percentile(dns_ms, 75),
   p95_dns_ms = percentile(dns_ms, 95),
   by: {request.url.domain}
| sort p95_dns_ms desc
| limit 50

Analyze by protocol (http/1.1, h2, h3):

fetch user.events
| filter characteristics.has_request
| summarize cnt = count(), by: {url.domain, performance.next_hop_protocol}
| sort cnt desc
| limit 50

Third-party dependencies

Analyze request performance by domain:

fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {request.url.domain}
| sort p95_duration desc
| limit 50

Troubleshooting

Handling Zero Results

When queries return no data, follow this diagnostic workflow:

  1. Validate Timeframe

    • Check if timeframe is appropriate for the data type
    • RUM data may have delay (1-2 minutes for recent events)
    • Verify timeframe syntax: now()-1h to now() or similar
    • Try expanding timeframe: now()-24h for initial exploration
  2. Verify frontend Configuration

    • Confirm frontend is instrumented and sending RUM data
    • Check frontend.name filter is correct
    • Test without frontend filter to see if any RUM data exists
    • Verify frontend name matches the environment
  3. Check Data Availability

    • Run basic query: fetch user.events | limit 1
    • If no events exist, RUM may not be configured
    • Check if timeframe predates frontend deployment
    • Verify user has access to the environment
  4. Review Query Syntax

    • Validate filters aren't too restrictive
    • Check for typos in field names or metric names
    • Test query incrementally: start simple, add filters gradually
    • Verify characteristics filters match event types

When to Ask User for Clarification:

  • No RUM data exists in environment → "Is RUM configured for this frontend?"
  • Timeframe unclear → "What time period should I analyze?"
  • Expected data missing → "Has this frontend sent data recently?"

Handling Anomalous Results

When query results seem unexpected or suspicious:

Unexpected High Values:

  • Metric spikes: Verify interval aggregation (avg vs. max vs. sum)
  • Session counts: Check for bot traffic or synthetic monitoring
  • Error rates: Confirm error definition matches expectations
  • Performance degradation: Look for deployment or infrastructure changes

Unexpected Low Values:

  • Missing sessions: Verify dt.rum.user_type filter isn't excluding real users
  • Low request counts: Check if frontend filter is too narrow
  • Few errors: Confirm error characteristics filter is correct
  • Missing mobile data: Verify platform-specific fields exist

Inconsistent Data:

  • Metrics vs. Events mismatch: Different aggregation methods are expected
  • Geographic anomalies: Check timezone assumptions
  • Device distribution skew: May reflect actual user base
  • Version mismatches: Verify app version filtering logic

Decision Tree: Ask vs. Investigate

Query returns unexpected results
├─ Is this a zero-result scenario?
│  ├─ YES → Follow "Handling Zero Results" workflow
│  └─ NO → Continue
├─ Can I validate the result independently?
│  ├─ YES → Run validation query
│  │        ├─ Validation confirms result → Report findings
│  │        └─ Validation contradicts → Investigate further
│  └─ NO → Continue
├─ Is the anomaly clearly explained by data?
│  ├─ YES → Report with explanation
│  └─ NO → Continue
├─ Do I need domain knowledge to interpret?
│  ├─ YES → Ask user for context
│  │        Example: "The error rate is 15%. Is this expected for your frontend?"
│  └─ NO → Continue
└─ Is the issue ambiguous or requires clarification?
   ├─ YES → Ask specific question with data context
   │        Example: "I see two frontends named 'web-app'. Which frontend name should I use?"
   └─ NO → Investigate and report findings with caveats

Common Investigation Steps

For Performance Issues:

  1. Compare to baseline: Query same metric for previous week
  2. Segment by dimension: Break down by device, browser, geography
  3. Check for outliers: Use percentiles (p50, p95, p99) vs. averages
  4. Correlate with deployments: Filter by app version or time windows

For Data Availability Issues:

  1. Start broad: Query all RUM data without filters
  2. Add filters incrementally: Isolate which filter eliminates data
  3. Check related metrics: If events missing, try timeseries
  4. Validate entity relationships: Confirm frontend-to-service links

For Unexpected Patterns:

  1. Expand timeframe: Look for historical context
  2. Cross-reference data sources: Compare events and metrics
  3. Check sampling: Verify no sampling is affecting results
  4. Consider external factors: Holidays, outages, traffic changes

Red Flags: When to Stop and Ask

Always ask the user when:

  • ❌ No RUM data exists anywhere in the environment
  • ❌ Multiple frontends match the user's description
  • ❌ Results contradict user's stated expectations explicitly
  • ❌ Data suggests monitoring is misconfigured
  • ❌ Query requires business context (e.g., "acceptable error rate")
  • ❌ Timeframe is ambiguous and affects interpretation significantly

Example clarifying questions:

  • "I found two frontends named 'checkout'. Which one: checkout-web or checkout-mobile?"
  • "The query returns 0 results for the past hour. Should I expand the timeframe, or do you expect real-time data?"
  • "The average LCP is 8 seconds, which exceeds the 4-second threshold. Is this frontend known to have performance issues?"
  • "I see only synthetic traffic. Should I include dt.rum.user_type='REAL_USER' to focus on real users?"

When to Use This Skill

Use frontend-observability skill when:

  • Monitoring web or mobile frontend performance
  • Analyzing Core Web Vitals for SEO
  • Tracking user sessions, engagement, or behavior
  • Analyzing click events and button interactions
  • Debugging frontend errors or slow requests
  • Correlating frontend issues with backend traces
  • Optimizing mobile app startup or crash rates (iOS, Android)
  • Analyzing app version performance
  • Diagnosing UI jank and main thread blocking
  • Analyzing security compliance (CSP violations)
  • Profiling JavaScript performance (long tasks)

Do NOT use for:

  • Backend service monitoring (use services skill)
  • Infrastructure metrics (use infrastructure skill)
  • Log analysis (use logs skill)
  • Business process monitoring (use business-events skill)

Progressive Disclosure

Always Available

  • FrontendBasics.md - RUM fundamentals and quick reference

Loaded by Workflow

  • Web Performance: WebVitals.md, performance-analysis.md
  • User Behavior: user-sessions.md, performance-analysis.md
  • Error Analysis: error-tracking.md, performance-analysis.md
  • Mobile Apps: mobile-monitoring.md

Load on Explicit Request

  • Advanced diagnostics (long tasks, user actions)
  • Security compliance (CSP violations, visibility tracking)
  • Specialized mobile features (platform-specific phases)

Reference Files

Core Reference Documents

  • references/WebVitals.md - Core Web Vitals monitoring
  • references/user-sessions.md - Session and user analytics
  • references/error-tracking.md - Error analysis and debugging
  • references/mobile-monitoring.md - Mobile app performance and crashes
  • references/performance-analysis.md - Advanced performance diagnostics
Weekly Installs
71
GitHub Stars
40
First Seen
8 days ago
Installed on
codex70
opencode70
kimi-cli69
gemini-cli69
deepagents69
antigravity69