dt-obs-hosts
Infrastructure Hosts Skill
Monitor and manage host and process infrastructure including CPU, memory, disk, network, and technology inventory.
What This Skill Does
- Discover and inventory hosts across cloud and on-premise environments
- Monitor host resource utilization (CPU, memory, disk, network)
- Track process resource consumption and lifecycle
- Analyze container and Kubernetes infrastructure
- Discover services via listening ports
- Manage technology stack versions and compliance
- Attribute infrastructure costs by cost center and product
- Validate data quality and metadata completeness
- Plan capacity and detect resource saturation
- Correlate infrastructure health across layers
When to Use This Skill
Use this skill when the user needs to:
- Inventory: "Show me all Linux hosts in AWS us-east-1"
- Monitor: "What hosts have high CPU usage?"
- Troubleshoot: "Which processes are consuming the most memory?"
- Discover: "What databases are running in production?"
- Plan: "Track Kubernetes version distribution for upgrade planning"
- Cost: "Calculate infrastructure costs by cost center"
- Security: "Find all processes listening on port 22"
- Compliance: "Identify hosts running EOL Java versions"
- Quality: "Check data completeness for AWS hosts"
- Optimize: "Find rightsizing candidates based on utilization"
Core Concepts
Entities
- HOST - Physical or virtual machines (cloud or on-premise)
- PROCESS - Running processes and process groups
- CONTAINER - Kubernetes containers
- NETWORK_INTERFACE - Host network interfaces
- DISK - Host disk volumes
Metrics Categories
- Host Metrics -
dt.host.cpu.*,dt.host.memory.*,dt.host.disk.*,dt.host.net.* - Process Metrics -
dt.process.cpu.*,dt.process.memory.*,dt.process.io.*,dt.process.network.* - Inventory - OS type, cloud provider, technology stack, versions
- Cost -
dt.cost.costcenter,dt.cost.product - Quality - Metadata completeness, version compliance
Alert Thresholds
- CPU/Memory/Disk: 80% warning, 90% critical
- Network: >70% high, >85% saturated
- Disk Latency: >20ms bottleneck
- Network Errors: Drop rate >1%, error rate >0.1%
- Swap: >30% warning, >50% critical
Key Workflows
1. Host Discovery and Classification
Discover hosts, classify by OS/cloud, inventory resources.
smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc
OS Types: LINUX, WINDOWS, AIX, SOLARIS, ZOS
→ For cloud-specific attributes, see references/inventory-discovery.md
2. Resource Utilization Monitoring
Monitor CPU, memory, disk, network across hosts.
timeseries {
cpu = avg(dt.host.cpu.usage),
memory = avg(dt.host.memory.usage),
disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc
High utilization threshold: 80% warning, 90% critical
→ For detailed CPU analysis, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md
3. Process Resource Analysis
Identify top resource consumers at process level.
timeseries {
cpu = avg(dt.process.cpu.usage),
memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20
→ For process I/O analysis, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md
4. Technology Stack Inventory
Discover and track software technologies and versions.
smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc
Common Technologies: Java, Node.js, Python, .NET, databases, web servers, messaging systems
→ For version compliance checks, see references/inventory-discovery.md
5. Service Discovery via Ports
Map listening ports to services for security and inventory.
smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50
Well-known ports: 80 (HTTP), 443 (HTTPS), 22 (SSH), 3306 (MySQL), 5432 (PostgreSQL)
→ For comprehensive port mapping, see references/inventory-discovery.md
6. Container and Kubernetes Monitoring
Track container distribution and K8s workload types.
smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc
Workload Types: deployment, daemonset, statefulset, job, cronjob
Note: Container image names/versions NOT available in smartscape.
→ For K8s version tracking, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md
7. Cost Attribution and Chargeback
Calculate infrastructure costs by cost center.
smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize
host_count = count(),
total_cores = sum(toLong(host.logical.cpu.cores)),
total_memory_gb = sum(memory_gb),
by: {dt.cost.costcenter}
| sort total_cores desc
→ For product-level cost tracking, see references/inventory-discovery.md
8. Infrastructure Health Correlation
Correlate host and process metrics for cross-layer analysis.
timeseries {
host_cpu = avg(dt.host.cpu.usage),
host_memory = avg(dt.host.memory.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
host_name = getNodeName(dt.smartscape.host),
process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc
Health scoring: Critical if any resource >90%, warning if >80%
→ For multi-resource saturation detection, see references/host-metrics.md
Common Query Patterns
Pattern 1: Smartscape Discovery
Use smartscapeNodes to discover and classify entities.
smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>
Pattern 2: Timeseries Performance
Use timeseries to analyze metrics over time.
timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>
Pattern 3: Cross-Layer Correlation
Correlate host and process metrics.
timeseries {
host_cpu = avg(dt.host.cpu.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
Pattern 4: Entity Enrichment with Lookup
Enrich data with entity attributes. After lookup, reference fields with lookup. prefix.
timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
smartscapeNodes HOST
| fields id, cpuCores, memoryTotal
], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024
Tags and Metadata
Important Notes
- Generic
tagsfield is NOT populated in smartscape queries - Use specific tag fields:
tags:azure[*],tags:environment - Use custom metadata:
host.custom.metadata[*]
Available Tags
- Azure Tags:
tags:azure[dt_owner_team],tags:azure[dt_cloudcost_capability] - Environment:
tags:environment - Custom Metadata:
host.custom.metadata[OperatorVersion],host.custom.metadata[Cluster] - Cost:
dt.cost.costcenter,dt.cost.product
→ For complete tag reference, see references/inventory-discovery.md
Cloud-Specific Attributes
AWS
cloud.provider == "aws"aws.region,aws.availability_zone,aws.account.idaws.resource.id,aws.resource.nameaws.state(running, stopped, terminated)
Azure
cloud.provider == "azure"azure.location,azure.subscription,azure.resource.groupazure.status,azure.provisioning_stateazure.resource.sku.name(VM size)
Kubernetes
k8s.cluster.name,k8s.cluster.uidk8s.namespace.name,k8s.node.name,k8s.pod.namek8s.workload.name,k8s.workload.kind
→ For multi-cloud analysis, see references/inventory-discovery.md
Best Practices
Alerting
- Use percentiles (p95, p99) for latency metrics
- Use
max()for resource limits - Use
avg()for utilization trends - Set multi-level thresholds (warning at 80%, critical at 90%)
Time Windows
- Real-time: 5-15 minute windows
- Trends: 24 hours to 7 days
- Capacity planning: 30-90 days
Query Optimization
- Use filters early in the pipeline
- Limit results with
| limit N - Use specific entity types in smartscapeNodes
- Aggregate before enrichment (lookup)
Data Quality
- Validate metadata completeness (target >90%)
- Check for duplicate host names
- Ensure cost tag coverage
- Monitor data freshness (lifetime.end)
Limitations and Notes
Smartscape Limitations
- Container image names/versions NOT available in smartscape
- Generic
tagsfield NOT populated (use specific tag namespaces) - Process metadata varies by process type
Platform-Specific
dt.host.cpu.iowaitavailable on Linux only- AIX has specific CPU metrics (entitlement, physc)
- Inode metrics available on Linux only
Best Practices
- Use
getNodeName()to get human-readable names - Convert bytes to GB for readability:
/ 1024 / 1024 / 1024 - Round aggregated values:
round(value, decimals: 1) - Use
isNotNull()checks before array operations
When to Load References
This skill uses progressive disclosure. Start here for 80% of use cases. Load reference files for detailed specifications when needed.
Load host-metrics.md when:
- Analyzing CPU component breakdown (user, system, iowait, steal)
- Investigating memory pressure and swap usage
- Troubleshooting disk I/O latency
- Diagnosing network packet drops or errors
Load process-monitoring.md when:
- Analyzing process-level I/O patterns
- Investigating TCP connection quality
- Detecting resource exhaustion (file descriptors, threads)
- Tracking GC suspension time
Load container-monitoring.md when:
- Analyzing container lifecycle and churn
- Tracking Kubernetes version distribution
- Managing OneAgent operator versions
- Planning K8s cluster upgrades
Load inventory-discovery.md when:
- Performing security audits via port discovery
- Implementing cost attribution and chargeback
- Validating data quality and metadata completeness
- Managing multi-cloud infrastructure
References
- host-metrics.md - Detailed host CPU, memory, disk, and network monitoring
- process-monitoring.md - Process-level CPU, memory, I/O, and network analysis
- container-monitoring.md - Container inventory, Kubernetes versions, and operator management
- inventory-discovery.md - Host/process discovery, technology inventory, cost attribution, and data quality