monitor-resources
Monitor Resources
Monitor system-level resources including AMP processors, CPU, memory, I/O, and physical capacity using real-time MCP resources to ensure optimal system performance and identify capacity constraints.
π Enhanced Capabilities
This skill now leverages real-time resource monitoring!
With tdwm-mcp v1.5.0, this skill provides:
- β REAL-TIME RESOURCE METRICS - Instant CPU/memory/I/O data without overhead
- β AMP LOAD MONITORING - Track processor utilization and skew
- β CAPACITY ANALYSIS - Understand headroom and growth runway
- β TOP CONSUMER IDENTIFICATION - Find users/queries consuming most resources
- β INTEGRATED VIEW - Resources + sessions + queries in one analysis
Instructions
When to Use This Skill
- User asks about system performance or capacity
- Need to check if system is overloaded or underutilized
- Investigating resource bottlenecks or performance degradation
- Capacity planning or health check requests
- Correlating resource usage with query performance
Available MCP Tools
Resource Monitoring:
monitor_amp_load- Track AMP processor utilization and skewmonitor_awt- View AWT task resource usageshow_top_users- Identify users consuming most resourcesmonitor_config- View system configuration parameters
Related Monitoring:
list_sessions- See sessions consuming resourcesshow_tdwm_summary- Workload distribution contextlist_resources- List available system resources
Available MCP Resources (NEW β¨)
Real-Time Resource Data:
tdwm://system/physical-resources- CPU, memory, I/O utilizationtdwm://system/amp-load- AMP processor load and skew metricstdwm://system/sessions- Session resource consumptiontdwm://system/summary- Workload resource distribution
Reference:
tdwm://reference/resource-thresholds- Healthy vs warning vs critical levelstdwm://reference/amp-skew-causes- Common causes of AMP skew
Step-by-Step Workflow
Phase 1: Quick Assessment (Use Resources First)
-
Get Real-Time Resource Overview
- Read resource:
tdwm://system/physical-resources - Instant snapshot of CPU, memory, I/O utilization
- Identify if system is under stress
- Provides capacity headroom visibility
- Read resource:
-
Check AMP Load Balance
- Read resource:
tdwm://system/amp-load - Shows AMP processor utilization
- Identifies load skew (imbalance)
- Provides per-AMP breakdown
- Read resource:
Phase 2: Detailed Analysis (Use Tools)
-
Assess Overall System Load
- Use
monitor_amp_loadfor detailed AMP processor data - Identify if system is CPU-bound or has capacity
- Look for imbalanced load across AMPs (skew factor)
- Track peak vs average utilization
- Use
-
Check Task-Level Resources
- Use
monitor_awtto view AWT task allocation - Identify if task slots are exhausted
- Check for queries waiting on resources
- Understand task concurrency limits
- Use
-
Identify Top Consumers
- Use
show_top_usersto find resource-heavy users/queries - Determine if resource usage is legitimate or problematic
- Prioritize optimization efforts
- Correlate with session and query data
- Use
-
Review System Configuration
- Use
monitor_configto verify system settings - Check resource allocation parameters
- Validate configuration against best practices
- Use
Phase 3: Correlation and Reporting
-
Correlate with Workload Activity
- Read resource:
tdwm://system/summary - Correlate resource usage with workload distribution
- Identify which workloads are consuming resources
- Understand if usage patterns are expected
- Read resource:
-
Analyze Session Resource Consumption
- Read resource:
tdwm://system/sessions - Identify individual sessions consuming excessive resources
- Correlate high-resource sessions with users/queries
- Determine if termination needed
- Read resource:
-
Report Findings
- Summarize system health: green (healthy), amber (warning), red (critical)
- Identify bottlenecks or constraints
- Provide utilization percentages and trends
- Recommend actions if issues found
Examples
Example 1: Quick System Health Check (Fast)
Scenario: "Is the system running okay right now?"
Action (Resource-First Approach):
1. Read real-time physical resources:
tdwm://system/physical-resources
β CPU: 45% utilized (healthy)
β Memory: 62% utilized (healthy)
β I/O: 38% utilized (healthy)
β Disk: 71% capacity used
2. Check AMP load balance:
tdwm://system/amp-load
β Average AMP CPU: 43%
β Max AMP CPU: 51%
β Min AMP CPU: 38%
β Skew factor: 1.13 (healthy, < 1.5)
3. Review workload distribution:
tdwm://system/summary
β 47 active queries distributed across 3 workloads
β Resources well-balanced
4. Report:
"β
System Health: HEALTHY
Resource Utilization:
- CPU: 45% (55% headroom available)
- Memory: 62% (38% headroom available)
- I/O: 38% (62% headroom available)
- AMP Load: Balanced (skew factor 1.13)
Current Activity:
- 47 active queries, well-distributed
- No resource bottlenecks detected
- System has capacity for additional workload"
Result: Complete health check in seconds, clear green status
Example 2: Investigate Performance Degradation (Critical)
Scenario: "Why are queries running slow? System feels sluggish."
Action (Resource Analysis):
1. Check physical resources:
tdwm://system/physical-resources
β CPU: 94% utilized (β οΈ CRITICAL)
β Memory: 89% utilized (β οΈ WARNING)
β I/O: 87% utilized (β οΈ WARNING)
β All metrics in red zone
2. Check AMP load:
tdwm://system/amp-load
β Average AMP CPU: 92%
β Max AMP CPU: 98%
β Min AMP CPU: 76%
β Skew factor: 1.29 (moderate skew)
3. Identify top consumers:
show_top_users()
β User: etl_batch - 42% of CPU
β User: adhoc_user01 - 23% of CPU
β User: report_user - 15% of CPU
β Top 3 users = 80% of total CPU
4. Check sessions for top consumer:
tdwm://system/sessions
β Filter for etl_batch
β 8 active sessions, all long-running (>30 min)
β All from ETL workload
5. Review workload distribution:
tdwm://system/summary
β ETL workload: 67% of resources
β Interactive workload: 23% of resources (suffering)
β Ad-hoc workload: 10% of resources
6. Check for throttle delays:
tdwm://system/delayed-queries
β 15 queries delayed
β 12 from Interactive workload (waiting for resources)
β 3 from Ad-hoc workload
7. Report:
"π΄ System Health: CRITICAL - Resource Exhaustion
Root Cause:
- System at 94% CPU capacity
- ETL workload consuming 67% of resources
- User 'etl_batch' alone using 42% of system CPU
Impact:
- Interactive queries delayed (12 queries waiting)
- Average query time increased
- System approaching capacity limit
Immediate Actions Required:
1. URGENT: Review etl_batch sessions (8 long-running queries)
β Determine if queries are stuck or legitimately long
β Consider aborting if runaway queries detected
2. SHORT-TERM: Increase throttle on ETL workload
β Limit ETL concurrent queries to protect Interactive
β Use optimize-throttles skill
3. MEDIUM-TERM: Schedule ETL jobs during off-peak hours
β Current ETL running during business hours
β Move to nighttime window if possible"
Result: Clear root cause identified, specific actions recommended with priority
Example 3: Diagnose AMP Skew Issue (Performance)
Scenario: "Queries are slow even though CPU shows only 60% utilized"
Action (AMP Skew Analysis):
1. Check overall CPU:
tdwm://system/physical-resources
β CPU: 62% average (seems okay?)
2. Check AMP load detail:
tdwm://system/amp-load
β Average AMP CPU: 62%
β Max AMP CPU: 95% (β οΈ AMP 7 maxed out!)
β Min AMP CPU: 38%
β Skew factor: 2.5 (π΄ SEVERE SKEW)
3. Analyze per-AMP breakdown:
monitor_amp_load()
β AMP 7: 95% (bottleneck!)
β AMP 3: 89%
β AMP 12: 87%
β AMPs 1,2,4-6,8-11: 45-55% (normal)
β AMPs 13-16: 38-42% (underutilized)
4. Identify queries hitting skewed AMPs:
show_top_users()
β User: report_generator - heavy queries
β Queries accessing specific tables
5. Check session SQL:
list_sessions()
β Filter for report_generator
β All queries accessing SALES_FACT table
β Large table with potential distribution issue
6. Review reference:
tdwm://reference/amp-skew-causes
β Cause #1: Poor primary index design
β Cause #2: NULL values in PI column
β Cause #3: Hot values in PI column
7. Report:
"β οΈ System Health: DEGRADED - AMP Skew Issue
Problem:
- System shows 62% average CPU (seems healthy)
- BUT AMP 7 is at 95% (bottleneck!)
- Skew factor 2.5 = SEVERE SKEW
Impact:
- Queries wait for overloaded AMP 7
- 60% of AMPs underutilized
- System effectively at 95% capacity (limited by one AMP)
Root Cause Analysis:
- Report queries all hitting SALES_FACT table
- Table data concentrated on specific AMPs
- Indicates poor primary index or data distribution
Recommended Actions:
1. IMMEDIATE: Review SALES_FACT table design
β Check primary index column
β Analyze data distribution (HELP STATISTICS)
β Look for NULL values or hot keys
2. SHORT-TERM: Consider table redistribution
β Redesign primary index for better distribution
β Test with EXPLAIN plans
3. LONG-TERM: Implement data distribution monitoring
β Regular skew checks
β Alert on skew factor > 2.0"
Result: Identified hidden bottleneck (AMP skew), specific table and recommendations
Example 4: Capacity Planning Analysis (Proactive)
Scenario: "Do we have capacity for 25% more users next quarter?"
Action (Capacity Planning):
1. Get current resource baseline:
tdwm://system/physical-resources
β CPU: 67% average (peak hours)
β Memory: 74% average
β I/O: 58% average
β Disk: 68% capacity used
2. Analyze peak utilization (tool for historical data):
monitor_amp_load()
β Peak CPU (last 30 days): 85%
β Average CPU (last 30 days): 52%
β Peak typically 10:00 AM - 3:00 PM
3. Calculate current headroom:
- CPU: 33% headroom at average, 15% at peak
- Memory: 26% headroom
- I/O: 42% headroom
- Comfortable threshold: Keep peak < 80%
4. Project 25% user increase:
- Assume linear scaling (conservative)
- Current peak: 85%
- With 25% more: 85% Γ 1.25 = 106% (π΄ OVER CAPACITY)
5. Analyze workload composition:
tdwm://system/summary
β Interactive: 45% of resources
β ETL: 35% of resources
β Ad-hoc: 20% of resources
β User increase likely impacts Interactive + Ad-hoc
6. Recalculate with workload-specific growth:
- Interactive/Ad-hoc: 65% of total, growing 25%
- ETL: 35% of total, no growth
- New peak: (65% Γ 1.25) + 35% = 81% + 35% = 116% (still over!)
7. Determine capacity gap:
- Need: 116% capacity
- Have: 100% capacity
- Gap: 16% additional capacity needed
- OR reduce peak utilization by optimizing queries
8. Report:
"π Capacity Planning Analysis: 25% User Growth
Current State:
- Average: 67% CPU, 74% Memory
- Peak: 85% CPU (10 AM - 3 PM)
- Headroom at peak: 15%
With 25% User Growth:
- Projected peak: 106-116% CPU
- Result: π΄ INSUFFICIENT CAPACITY
Capacity Gap:
- Need: 16% additional capacity
- Options to address:
Option 1: Hardware Expansion
- Add 2-3 nodes to cluster (20% capacity increase)
- Cost: ~$X (hardware + licensing)
- Timeline: 8-12 weeks
Option 2: Query Optimization + Throttling
- Optimize top 20 slow queries (target 15% improvement)
- Adjust throttle limits to smooth peak load
- Cost: Development time only
- Timeline: 4-6 weeks
- May achieve 10-15% efficiency gain
Option 3: Hybrid Approach
- Optimize queries (10% gain)
- Add 1 node (10% capacity)
- Adjust workload distribution
- Cost: Partial hardware investment
- Timeline: 6-8 weeks
RECOMMENDATION:
Start Option 2 immediately (query optimization), then reassess.
If growth is certain, plan Option 1 in parallel (long lead time)."
Result: Detailed capacity analysis with multiple options and costs
Example 5: Real-Time Resource Correlation (Advanced)
Scenario: "Why is I/O spiking even though CPU is only 50%?"
Action (Multi-Dimensional Analysis):
1. Confirm I/O spike:
tdwm://system/physical-resources
β CPU: 48% (normal)
β Memory: 65% (normal)
β I/O: 91% (π΄ CRITICAL spike!)
2. Check if AMP-related:
tdwm://system/amp-load
β AMP CPU: 45-52% across all AMPs (balanced)
β No CPU bottleneck
3. Identify I/O-heavy sessions:
list_sessions()
β Sort by I/O consumption
β Top session: analytics_user, session 5001
β 67% of total I/O from this one session!
4. Get SQL for high-I/O session:
show_session_sql_text(session_id=5001)
β Query: Full table scan on 50TB fact table
β No WHERE clause (scanning entire table!)
β No indexes being used
5. Check query band:
monitor_session_query_band()
β Session 5001: APP=AD_HOC_ANALYTICS
β Classified in Ad-hoc workload
6. Identify pattern:
list_sessions()
β Filter for AD_HOC_ANALYTICS application
β 5 similar sessions, all doing large scans
β All from same user: analytics_user
7. Calculate impact:
- 5 sessions Γ 13-15% I/O each = 65-75% total I/O
- Remainder (25-35%) from normal queries
- Ad-hoc queries blocking I/O for other workloads
8. Report:
"π΄ Resource Issue: I/O Bottleneck
Problem:
- I/O at 91% despite normal CPU (48%)
- Type: I/O-bound, not CPU-bound
Root Cause:
- User 'analytics_user' running 5 concurrent full table scans
- Each query scanning 50TB fact table
- No WHERE clauses, no indexes
- Consuming 65-75% of total system I/O
Impact:
- Other queries experiencing I/O delays
- Reports taking 3-5x longer than normal
- System I/O-bound, not CPU-bound
Immediate Actions:
1. URGENT: Abort analytics_user sessions
β Use control-sessions skill
β Free up I/O capacity immediately
2. SHORT-TERM: Contact analytics_user
β Queries need WHERE clauses
β Should use summary tables, not raw fact table
β Educate on query best practices
3. MEDIUM-TERM: Create throttle for full table scans
β Use optimize-throttles skill
β Add FTSCAN sub-criteria to Ad-hoc throttle
β Limit full scans to 2 concurrent
4. LONG-TERM: Implement query governor
β Reject queries scanning >X TB without approval
β Enforce WHERE clause requirements"
Result: Identified I/O bottleneck root cause, multi-level action plan
Best Practices
Resource-First Approach (NEW β¨)
- START with resources for instant overview (
tdwm://system/physical-resources) - Resources are faster and don't add load to system
- Use tools for detailed analysis and historical data
- Combine resources + tools + session data for complete picture
Multi-Dimensional Analysis
- Check multiple resource types - bottleneck may not be where expected
- CPU, memory, I/O, disk, AMP load, task slots - all matter
- Don't assume high CPU = performance issue (could be I/O, AMP skew, etc.)
- Correlate resources with workload and session activity
AMP Skew Understanding
- AMP skew (imbalanced load) indicates data distribution issues
- Skew factor > 1.5 = warning, > 2.0 = critical
- Average CPU can look healthy while one AMP is maxed out
- Skew caused by: poor PI design, NULL values, hot keys, small tables
Capacity Planning
- Establish baseline metrics for comparison
- Track peak vs average utilization (peak drives capacity needs)
- Monitor trends over time, not just point-in-time snapshots
- Consider both average and peak when planning capacity
- Factor in growth projections and seasonal patterns
- Keep peak utilization < 80% for headroom
Health Status Thresholds (Reference: tdwm://reference/resource-thresholds)
- Green (Healthy): < 70% average, < 80% peak
- Amber (Warning): 70-85% average, 80-90% peak
- Red (Critical): > 85% average, > 90% peak
Correlation Best Practices
- Always correlate resources with workload distribution
- Identify which workloads/users are consuming resources
- Check if usage patterns are expected for time of day
- Look for anomalies: unexpected consumers, unusual patterns
Related Skills
- Use monitor-sessions skill for session-level resource analysis
- Use monitor-queries skill to correlate queries with resource usage
- Use control-sessions skill to abort high-resource sessions
- Use optimize-throttles skill to manage resource consumption
- Use emergency-response skill for critical resource exhaustion