Proxmox VE Operations Expertise

AI Agent Skill: Operational knowledge for managing Proxmox Virtual Environment infrastructure

Overview

This skill provides AI agents with operational expertise for Proxmox VE, covering:

VM and LXC lifecycle management - From creation to decommissioning
Storage operations - Configuration, content management, backup strategies
High Availability - HA groups, resource management, failover
Cluster operations - Multi-node management, migration, replication
Certificate management - Installation, renewal, ACME integration
ACME configuration - Provider setup, certificate ordering, automation
Notifications - Target configuration, delivery verification, alerting
Troubleshooting - Common issues, API quirks, resolution patterns
Security - Permission models, API token best practices
Performance - Monitoring, resource optimization

Target audience: AI agents performing day-to-day Proxmox operations, infrastructure automation, or incident response.

Architecture Overview

Proxmox VE Cluster Concepts

Node: Physical server running Proxmox VE

Hosts VMs and LXC containers
Provides local storage
Participates in cluster quorum

Storage: Shared or local storage backends

Types: Directory, LVM, ZFS, Ceph, NFS, iSCSI
Content types: Images, ISOs, backups, templates
Can be node-local or cluster-shared

Networking: Virtual networking infrastructure

Linux bridges for VM/LXC connectivity
VLANs for network segmentation
SDN (Software Defined Networking) for advanced scenarios

Cluster: Group of nodes working together

Shared configuration via pmxcfs
HA for automatic failover
Live migration between nodes

Operations Playbook

1. VM Lifecycle Management

Create → Configure → Monitor → Backup → Delete

Creation

1. Get next available VMID
2. Create VM with basic config (CPU, memory, OS type)
3. Add disk(s) from storage
4. Configure network interface(s)
5. Set boot order
6. Start VM

Key considerations:

Choose appropriate storage for disk (performance vs capacity)
Use virtio drivers for best performance (requires guest support)
Configure QEMU guest agent for better management

Configuration

1. Review current config
2. Resize resources (CPU, memory, disk) as needed
3. Add/remove network interfaces
4. Configure firewall rules
5. Set up snapshots for rollback capability

Best practices:

Snapshot before major changes
Use cloud-init for automated provisioning
Enable QEMU guest agent for graceful operations

Monitoring

1. Check VM status (running, stopped, paused)
2. Monitor resource usage (CPU, memory, disk I/O)
3. Review task history for recent operations
4. Check logs for errors or warnings

Metrics to watch:

CPU usage and steal time
Memory pressure and swap usage
Disk I/O wait times
Network throughput

Backup

1. Create snapshot for quick rollback
2. Schedule backup job (vzdump)
3. Verify backup completed successfully
4. Test restore periodically
5. Prune old backups to manage space

Backup strategies:

Snapshot mode: Fast, requires storage support
Suspend mode: Pauses VM during backup
Stop mode: Stops VM for consistent backup

Decommissioning

1. Create final backup
2. Stop VM gracefully
3. Remove from HA if configured
4. Delete VM and associated disks
5. Clean up firewall rules
6. Update documentation

2. LXC Container Management

Containers vs VMs:

Lighter weight (shared kernel)
Faster startup times
Lower overhead
Less isolation than VMs

Container Operations

1. Create from template
2. Configure resources (CPU, memory, swap)
3. Add mount points for storage
4. Configure network
5. Start container
6. Access via console or SSH

Key differences from VMs:

Use mp0, mp1 for mount points (not disk0, disk1)
No BIOS/UEFI configuration
Direct kernel access (privileged) or restricted (unprivileged)
Faster snapshot/restore operations

3. Storage Management

Storage Configuration

1. List available storage
2. Add new storage backend (NFS, Ceph, etc.)
3. Configure content types (images, backups, ISOs)
4. Set storage as default for specific content
5. Monitor storage usage

Content Management

1. Upload ISOs/templates to storage
2. Download from URL to storage
3. List storage content
4. Delete unused content
5. Restore files from backups

Backup Management

1. Create backup jobs (manual or scheduled)
2. Configure retention policy
3. Prune old backups automatically
4. Restore from backup
5. Verify backup integrity

Backup best practices:

Use compression to save space
Store backups on separate storage
Test restore procedures regularly
Document backup schedules
Monitor backup job success/failure

4. High Availability (HA)

HA Configuration

1. Create HA group (define node priorities)
2. Add VM/LXC to HA management
3. Configure HA settings (max relocate, max restart)
4. Monitor HA status
5. Test failover scenarios

HA States:

started: Resource running on assigned node
stopped: Resource intentionally stopped
fence: Node fenced, resource will be restarted elsewhere
error: HA manager encountered an error

When to use HA:

Critical services requiring high uptime
Automatic failover needed
Cluster has 3+ nodes (for quorum)

5. Migration

Live Migration (Online)

1. Verify target node has resources
2. Check shared storage access
3. Initiate migration
4. Monitor migration progress
5. Verify VM running on new node

Requirements:

Shared storage for VM disks
Network connectivity between nodes
Compatible CPU types (or CPU flags masked)

Offline Migration

1. Stop VM/LXC
2. Migrate to target node
3. Start on new node

Use cases:

No shared storage available
Maintenance on source node
CPU incompatibility

Troubleshooting Guide

Common Issues

1. VM Won't Start

Symptoms: Start operation fails or VM immediately stops

Causes:

Insufficient resources on node
Storage unavailable
Lock file present
Configuration error

Resolution:

1. Check node resources (memory, CPU)
2. Verify storage is mounted and accessible
3. Remove lock file if stale
4. Review VM config for errors
5. Check logs: /var/log/pve/tasks/

2. Migration Fails

Symptoms: Migration operation errors or times out

Causes:

Network connectivity issues
Storage not shared
CPU incompatibility
Insufficient resources on target

Resolution:

1. Verify network between nodes
2. Check storage is accessible from both nodes
3. Review CPU flags compatibility
4. Ensure target node has capacity
5. Try offline migration if live fails

3. Backup Job Fails

Symptoms: Backup task shows error status

Causes:

Insufficient storage space
VM locked by another operation
Snapshot creation failed
Network timeout (for remote storage)

Resolution:

1. Check storage space availability
2. Verify no other operations running on VM
3. Try manual backup to isolate issue
4. Review backup job logs
5. Prune old backups to free space

4. HA Failover Not Working

Symptoms: VM doesn't restart on another node after failure

Causes:

Cluster quorum lost
HA service not running
Fencing not configured
All nodes in HA group unavailable

Resolution:

1. Check cluster quorum status
2. Verify HA service running on all nodes
3. Review HA group configuration
4. Check fencing configuration
5. Manually start VM if needed

5. Storage Performance Issues

Symptoms: Slow VM performance, high I/O wait

Causes:

Storage backend overloaded
Network bottleneck (for remote storage)
Disk cache settings suboptimal
Too many VMs on same storage

Resolution:

1. Monitor storage backend performance
2. Check network throughput to storage
3. Adjust VM disk cache settings
4. Distribute VMs across multiple storage
5. Consider faster storage tier

More troubleshooting: See proxmox-troubleshooting.md

Security Best Practices

API Token Management

Token creation:

Create dedicated user for automation
Assign minimal required permissions
Generate API token (not password)
Store token securely (environment variables, secrets manager)
Rotate tokens periodically

Permission model:

Use roles to group permissions
Assign roles to users/tokens
Follow principle of least privilege
Audit permission usage regularly

Access Control

User management:

Use realms (PAM, LDAP, AD) for authentication
Create groups for role-based access
Assign users to groups
Review access periodically

Network security:

Restrict API access by IP (firewall rules)
Use SSL/TLS for API connections
Enable two-factor authentication for users
Monitor authentication logs

Performance Optimization

Resource Allocation

CPU:

Don't overcommit CPU cores excessively
Use CPU limits for non-critical VMs
Pin CPUs for latency-sensitive workloads
Monitor CPU steal time

Memory:

Enable ballooning for dynamic allocation
Set appropriate memory limits
Monitor swap usage (should be minimal)
Use hugepages for large memory VMs

Disk I/O:

Use virtio-scsi for best performance
Enable discard/TRIM for SSDs
Configure appropriate I/O scheduler
Monitor disk latency and throughput

Monitoring Strategy

Key metrics:

Node CPU, memory, disk usage
VM resource consumption
Storage performance (IOPS, latency)
Network throughput
Task completion times

Monitoring tools:

Built-in Proxmox metrics (RRD data)
External monitoring (Prometheus, Grafana)
Log aggregation (syslog, ELK stack)
Alerting for critical thresholds

Operational Workflows

For detailed step-by-step workflows, see:

proxmox-workflows.md - Common operational patterns

For troubleshooting details, see:

proxmox-troubleshooting.md - API quirks and solutions

Quick Reference

VM States

running: VM is powered on
stopped: VM is powered off
paused: VM execution suspended
suspended: VM state saved to disk

Storage Types

dir: Directory-based storage
lvm: LVM volume groups
zfs: ZFS pools
ceph: Ceph RBD
nfs: NFS shares
iscsi: iSCSI targets

Backup Modes

snapshot: Fast, requires storage support
suspend: Pauses VM during backup
stop: Stops VM for backup

HA States

started: Running on assigned node
stopped: Intentionally stopped
fence: Node fenced, restarting elsewhere
error: HA manager error

License

MIT License - Part of @bldg-7/proxmox-mcp package

proxmox-admin

Proxmox VE Operations Expertise

Overview

Architecture Overview

Proxmox VE Cluster Concepts

Operations Playbook

1. VM Lifecycle Management

Creation

Configuration

Monitoring

Backup

Decommissioning

2. LXC Container Management

Container Operations

3. Storage Management

Storage Configuration

Content Management

Backup Management

4. High Availability (HA)

HA Configuration

5. Migration

Live Migration (Online)

Offline Migration

Troubleshooting Guide

Common Issues

1. VM Won't Start

2. Migration Fails

3. Backup Job Fails

4. HA Failover Not Working

5. Storage Performance Issues

Security Best Practices

API Token Management

Access Control

Performance Optimization

Resource Allocation

Monitoring Strategy

Operational Workflows

Quick Reference

VM States

Storage Types

Backup Modes

HA States

License