linux-administration

SKILL.md

Linux Administration Skill

Comprehensive Linux system administration and automation for Debian/Ubuntu/Mint environments


Service Management (systemd)

Essential Commands

# Status and logs
systemctl status service-name
journalctl -u service-name -n 100 --no-pager
journalctl -u service-name --since "1 hour ago"

# Control
systemctl start|stop|restart|reload service-name
systemctl enable|disable service-name
systemctl daemon-reload  # After editing unit files

Debugging Failed Services

systemctl status service-name --no-pager -l
journalctl -u service-name -p err --no-pager
systemctl list-dependencies service-name
systemd-analyze verify /etc/systemd/system/service-name.service

Custom Service Template

# /etc/systemd/system/myservice.service
[Unit]
Description=My Service Description
After=network.target

[Service]
Type=simple
User=user
WorkingDirectory=/path/to/scripts
ExecStart=/path/to/venv/bin/python /path/to/scripts/service.py
Restart=on-failure
RestartSec=10s
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Automation Patterns

Core Principles

  1. Error Handling: set -euo pipefail (bash), try/except (Python)
  2. Logging: Always log to file AND stdout
  3. Idempotency: Scripts safe to run multiple times
  4. Configuration: Use config files, not hardcoded values
  5. Notifications: Alert on failures, not just successes

Bash Script Template

#!/bin/bash
set -euo pipefail

SCRIPT_NAME="$(basename "$0")"
LOG_FILE="/var/log/${SCRIPT_NAME%.sh}.log"
LOCK_FILE="/tmp/${SCRIPT_NAME%.sh}.lock"

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"; }

cleanup() { rm -f "$LOCK_FILE"; }
trap cleanup EXIT
if [ -f "$LOCK_FILE" ]; then log "ERROR: Already running"; exit 1; fi
touch "$LOCK_FILE"

log "Starting $SCRIPT_NAME"
# Main logic here
log "Completed $SCRIPT_NAME"

Python Script Template

#!/path/to/venv/bin/python
import sys, logging
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('/var/log/script.log'), logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger(__name__)

def main():
    logger.info("Starting")
    try:
        pass  # Main logic
    except Exception as e:
        logger.error(f"Failed: {e}")
        sys.exit(1)
    logger.info("Completed")

if __name__ == '__main__':
    main()

Cron vs Systemd Timers

Feature Cron Systemd Timer
Logging Manual Automatic (journalctl)
Missed runs Lost Persistent=true catches up
Dependencies None Can require other services

Recommendation: Use systemd timers for production.

Systemd Timer Setup

# /etc/systemd/system/task.timer
[Unit]
Description=Task Timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target
# /etc/systemd/system/task.service
[Unit]
Description=Task

[Service]
Type=oneshot
User=user
ExecStart=/path/to/script.sh
StandardOutput=journal
StandardError=journal
sudo systemctl daemon-reload && sudo systemctl enable --now task.timer
systemctl list-timers --all

Cron Syntax

# minute hour day month weekday command
0 2 * * * /path/to/scripts/backup.sh >> /var/log/backup.log 2>&1
*/15 * * * * /path/to/scripts/health_check.sh
0 18 * * 1-5 /path/to/scripts/report.sh

Log Analysis

journalctl Patterns

journalctl -n 100 -o short-precise
journalctl -f
journalctl --since "2025-10-02 14:00" --until "2025-10-02 15:00"
journalctl -p err
journalctl -g "error|fail|exception" -n 500

Application Log Patterns

grep " 5[0-9][0-9] " /var/log/nginx/access.log | tail -20
awk '$NF > 1.0' /var/log/nginx/access.log | tail -20
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

Network Troubleshooting

Diagnostics

ping -c 4 google.com
nslookup hostname.domain
nc -zv host.domain 443
ss -tulpn | grep LISTEN
ip route show
traceroute host.domain

Firewall (ufw)

sudo ufw status verbose
sudo ufw allow 443/tcp comment 'HTTPS'
sudo ufw allow from 192.0.2.0/24 to any port 22
sudo ufw status numbered && sudo ufw delete 5

Static IP (netplan)

# /etc/netplan/01-netcfg.yaml
network:
  version: 2
  ethernets:
    ens33:
      addresses: [192.0.2.100/24]
      gateway4: 192.0.2.1
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]
# Apply: sudo netplan apply

Container Management

For Docker/Podman container operations, see container-testing skill.


Performance Monitoring

htop
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10
iostat -x 1 5
df -h
du -sh /var/log/* | sort -rh | head -10
find /home -type f -size +100M -exec ls -lh {} \;

Package Management (apt)

sudo apt update && sudo apt upgrade -y
sudo apt install package-name -y
sudo apt remove package-name
sudo apt --fix-broken install
sudo dpkg --configure -a

User and Permission Management

sudo adduser username
sudo usermod -aG groupname username
sudo chown -R user:group directory/
chmod 755 script.sh
namei -l /path/to/file
getfacl file && setfacl -m u:username:rwx file

Troubleshooting Workflows

Service Won't Start

systemctl status service-name
journalctl -u service-name -n 50 --no-pager
systemctl cat service-name
namei -l /etc/service-name/config.conf

High CPU

ps aux --sort=-%cpu | head -5
strace -p PID && lsof -p PID
kill PID

Disk Full

df -h
du -sh /* | sort -rh | head -10
sudo journalctl --vacuum-time=7d
find /tmp -type f -atime +7 -delete

Network Issues

ip addr show && ip route show
ping -c 4 $(ip route | grep default | awk '{print $3}')
cat /etc/resolv.conf && nslookup google.com
sudo ufw status

Automation Validation Checklist

Before deploying:

  • Script runs successfully manually
  • Error handling tested (force failures)
  • Logs written and readable
  • Idempotent (run twice, same result)
  • Timer/cron syntax verified
  • Permissions correct

Pattern Recognition: The same 20 commands solve 80% of problems. Good automation disappears. Bad automation wakes you at 2 AM.

Weekly Installs
1
GitHub Stars
28
First Seen
11 days ago
Installed on
amp1
cline1
openclaw1
opencode1
cursor1
kimi-cli1