proxmox-admin
Proxmox VE Operations Expertise
AI Agent Skill: Operational knowledge for managing Proxmox Virtual Environment infrastructure
Overview
This skill provides AI agents with operational expertise for Proxmox VE, covering:
- VM and LXC lifecycle management - From creation to decommissioning
- Storage operations - Configuration, content management, backup strategies
- High Availability - HA groups, resource management, failover
- Cluster operations - Multi-node management, migration, replication
- Certificate management - Installation, renewal, ACME integration
- ACME configuration - Provider setup, certificate ordering, automation
- Notifications - Target configuration, delivery verification, alerting
- Troubleshooting - Common issues, API quirks, resolution patterns
- Security - Permission models, API token best practices
- Performance - Monitoring, resource optimization
Target audience: AI agents performing day-to-day Proxmox operations, infrastructure automation, or incident response.
Architecture Overview
Proxmox VE Cluster Concepts
Node: Physical server running Proxmox VE
- Hosts VMs and LXC containers
- Provides local storage
- Participates in cluster quorum
Storage: Shared or local storage backends
- Types: Directory, LVM, ZFS, Ceph, NFS, iSCSI
- Content types: Images, ISOs, backups, templates
- Can be node-local or cluster-shared
Networking: Virtual networking infrastructure
- Linux bridges for VM/LXC connectivity
- VLANs for network segmentation
- SDN (Software Defined Networking) for advanced scenarios
Cluster: Group of nodes working together
- Shared configuration via pmxcfs
- HA for automatic failover
- Live migration between nodes
Operations Playbook
1. VM Lifecycle Management
Create → Configure → Monitor → Backup → Delete
Creation
1. Get next available VMID
2. Create VM with basic config (CPU, memory, OS type)
3. Add disk(s) from storage
4. Configure network interface(s)
5. Set boot order
6. Start VM
Key considerations:
- Choose appropriate storage for disk (performance vs capacity)
- Use virtio drivers for best performance (requires guest support)
- Configure QEMU guest agent for better management
Configuration
1. Review current config
2. Resize resources (CPU, memory, disk) as needed
3. Add/remove network interfaces
4. Configure firewall rules
5. Set up snapshots for rollback capability
Best practices:
- Snapshot before major changes
- Use cloud-init for automated provisioning
- Enable QEMU guest agent for graceful operations
Monitoring
1. Check VM status (running, stopped, paused)
2. Monitor resource usage (CPU, memory, disk I/O)
3. Review task history for recent operations
4. Check logs for errors or warnings
Metrics to watch:
- CPU usage and steal time
- Memory pressure and swap usage
- Disk I/O wait times
- Network throughput
Backup
1. Create snapshot for quick rollback
2. Schedule backup job (vzdump)
3. Verify backup completed successfully
4. Test restore periodically
5. Prune old backups to manage space
Backup strategies:
- Snapshot mode: Fast, requires storage support
- Suspend mode: Pauses VM during backup
- Stop mode: Stops VM for consistent backup
Decommissioning
1. Create final backup
2. Stop VM gracefully
3. Remove from HA if configured
4. Delete VM and associated disks
5. Clean up firewall rules
6. Update documentation
2. LXC Container Management
Containers vs VMs:
- Lighter weight (shared kernel)
- Faster startup times
- Lower overhead
- Less isolation than VMs
Container Operations
1. Create from template
2. Configure resources (CPU, memory, swap)
3. Add mount points for storage
4. Configure network
5. Start container
6. Access via console or SSH
Key differences from VMs:
- Use
mp0,mp1for mount points (not disk0, disk1) - No BIOS/UEFI configuration
- Direct kernel access (privileged) or restricted (unprivileged)
- Faster snapshot/restore operations
3. Storage Management
Storage Configuration
1. List available storage
2. Add new storage backend (NFS, Ceph, etc.)
3. Configure content types (images, backups, ISOs)
4. Set storage as default for specific content
5. Monitor storage usage
Content Management
1. Upload ISOs/templates to storage
2. Download from URL to storage
3. List storage content
4. Delete unused content
5. Restore files from backups
Backup Management
1. Create backup jobs (manual or scheduled)
2. Configure retention policy
3. Prune old backups automatically
4. Restore from backup
5. Verify backup integrity
Backup best practices:
- Use compression to save space
- Store backups on separate storage
- Test restore procedures regularly
- Document backup schedules
- Monitor backup job success/failure
4. High Availability (HA)
HA Configuration
1. Create HA group (define node priorities)
2. Add VM/LXC to HA management
3. Configure HA settings (max relocate, max restart)
4. Monitor HA status
5. Test failover scenarios
HA States:
- started: Resource running on assigned node
- stopped: Resource intentionally stopped
- fence: Node fenced, resource will be restarted elsewhere
- error: HA manager encountered an error
When to use HA:
- Critical services requiring high uptime
- Automatic failover needed
- Cluster has 3+ nodes (for quorum)
5. Migration
Live Migration (Online)
1. Verify target node has resources
2. Check shared storage access
3. Initiate migration
4. Monitor migration progress
5. Verify VM running on new node
Requirements:
- Shared storage for VM disks
- Network connectivity between nodes
- Compatible CPU types (or CPU flags masked)
Offline Migration
1. Stop VM/LXC
2. Migrate to target node
3. Start on new node
Use cases:
- No shared storage available
- Maintenance on source node
- CPU incompatibility
Troubleshooting Guide
Common Issues
1. VM Won't Start
Symptoms: Start operation fails or VM immediately stops
Causes:
- Insufficient resources on node
- Storage unavailable
- Lock file present
- Configuration error
Resolution:
1. Check node resources (memory, CPU)
2. Verify storage is mounted and accessible
3. Remove lock file if stale
4. Review VM config for errors
5. Check logs: /var/log/pve/tasks/
2. Migration Fails
Symptoms: Migration operation errors or times out
Causes:
- Network connectivity issues
- Storage not shared
- CPU incompatibility
- Insufficient resources on target
Resolution:
1. Verify network between nodes
2. Check storage is accessible from both nodes
3. Review CPU flags compatibility
4. Ensure target node has capacity
5. Try offline migration if live fails
3. Backup Job Fails
Symptoms: Backup task shows error status
Causes:
- Insufficient storage space
- VM locked by another operation
- Snapshot creation failed
- Network timeout (for remote storage)
Resolution:
1. Check storage space availability
2. Verify no other operations running on VM
3. Try manual backup to isolate issue
4. Review backup job logs
5. Prune old backups to free space
4. HA Failover Not Working
Symptoms: VM doesn't restart on another node after failure
Causes:
- Cluster quorum lost
- HA service not running
- Fencing not configured
- All nodes in HA group unavailable
Resolution:
1. Check cluster quorum status
2. Verify HA service running on all nodes
3. Review HA group configuration
4. Check fencing configuration
5. Manually start VM if needed
5. Storage Performance Issues
Symptoms: Slow VM performance, high I/O wait
Causes:
- Storage backend overloaded
- Network bottleneck (for remote storage)
- Disk cache settings suboptimal
- Too many VMs on same storage
Resolution:
1. Monitor storage backend performance
2. Check network throughput to storage
3. Adjust VM disk cache settings
4. Distribute VMs across multiple storage
5. Consider faster storage tier
More troubleshooting: See proxmox-troubleshooting.md
Security Best Practices
API Token Management
Token creation:
- Create dedicated user for automation
- Assign minimal required permissions
- Generate API token (not password)
- Store token securely (environment variables, secrets manager)
- Rotate tokens periodically
Permission model:
- Use roles to group permissions
- Assign roles to users/tokens
- Follow principle of least privilege
- Audit permission usage regularly
Access Control
User management:
- Use realms (PAM, LDAP, AD) for authentication
- Create groups for role-based access
- Assign users to groups
- Review access periodically
Network security:
- Restrict API access by IP (firewall rules)
- Use SSL/TLS for API connections
- Enable two-factor authentication for users
- Monitor authentication logs
Performance Optimization
Resource Allocation
CPU:
- Don't overcommit CPU cores excessively
- Use CPU limits for non-critical VMs
- Pin CPUs for latency-sensitive workloads
- Monitor CPU steal time
Memory:
- Enable ballooning for dynamic allocation
- Set appropriate memory limits
- Monitor swap usage (should be minimal)
- Use hugepages for large memory VMs
Disk I/O:
- Use virtio-scsi for best performance
- Enable discard/TRIM for SSDs
- Configure appropriate I/O scheduler
- Monitor disk latency and throughput
Monitoring Strategy
Key metrics:
- Node CPU, memory, disk usage
- VM resource consumption
- Storage performance (IOPS, latency)
- Network throughput
- Task completion times
Monitoring tools:
- Built-in Proxmox metrics (RRD data)
- External monitoring (Prometheus, Grafana)
- Log aggregation (syslog, ELK stack)
- Alerting for critical thresholds
Operational Workflows
For detailed step-by-step workflows, see:
- proxmox-workflows.md - Common operational patterns
For troubleshooting details, see:
- proxmox-troubleshooting.md - API quirks and solutions
Quick Reference
VM States
- running: VM is powered on
- stopped: VM is powered off
- paused: VM execution suspended
- suspended: VM state saved to disk
Storage Types
- dir: Directory-based storage
- lvm: LVM volume groups
- zfs: ZFS pools
- ceph: Ceph RBD
- nfs: NFS shares
- iscsi: iSCSI targets
Backup Modes
- snapshot: Fast, requires storage support
- suspend: Pauses VM during backup
- stop: Stops VM for backup
HA States
- started: Running on assigned node
- stopped: Intentionally stopped
- fence: Node fenced, restarting elsewhere
- error: HA manager error
License
MIT License - Part of @bldg-7/proxmox-mcp package