AI system maintenance and optimization concept

February 4, 2026 Operations 6 min read

LLM Maintenance Playbook: Keep Your Self-Hosted AI Running Smoothly

Self-hosting an LLM isn't a set-it-and-forget-it operation. Models need updating, GPUs need monitoring, logs need rotating, and performance needs benchmarking. Without a maintenance routine, small issues compound into 3 AM outages.

This is the maintenance playbook we follow for every production deployment. Steal it, adapt it, automate it.

Daily Tasks (Automated)

#!/bin/bash
# daily-llm-maintenance.sh — run via cron at 6:00 AM

# 1. Health check
TTFT=$(curl -s -w "%{time_total}" -o /dev/null \
  http://localhost:11434/api/generate \
  -d '{"model":"llama3:8b","prompt":"test","stream":false}')

if (( $(echo "$TTFT > 10.0" | bc -l) )); then
  echo "ALERT: TTFT is ${TTFT}s" | slack-notify "#ops"
fi

# 2. GPU memory check
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
GPU_TOTAL=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
USAGE=$((GPU_MEM * 100 / GPU_TOTAL))
if [ $USAGE -gt 95 ]; then
  echo "ALERT: GPU memory at ${USAGE}%" | slack-notify "#ops"
  # Restart Ollama to clear leaked memory
  systemctl restart ollama
fi

# 3. Rotate logs older than 30 days
find /var/log/llm-audit/ -name "*.jsonl" -mtime +30 -exec gzip {} \;

# 4. Disk space check
DISK_USAGE=$(df -h /root/.ollama | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
  echo "ALERT: Model disk at ${DISK_USAGE}%" | slack-notify "#ops"
fi

Weekly Tasks

Review error logs — check for repeated failures, timeout patterns, or unusual request volumes
Run benchmark suite — send a fixed set of prompts and compare response quality and speed to your baseline
Check for model updates — new quantizations or model versions may offer better performance
Review GPU temperatures — sustained temps above 85°C indicate cooling issues

# Weekly benchmark script
#!/bin/bash
PROMPTS=("Summarize this contract clause: ..." \
         "Classify this support ticket: ..." \
         "Extract the dates from this email: ...")

for prompt in "${PROMPTS[@]}"; do
  START=$(date +%s%N)
  RESPONSE=$(curl -s http://localhost:11434/api/generate \
    -d "{\"model\":\"llama3:8b\",\"prompt\":\"$prompt\",\"stream\":false}")
  END=$(date +%s%N)
  ELAPSED=$(( (END - START) / 1000000 ))
  echo "$(date): ${ELAPSED}ms" >> /var/log/llm-benchmark.log
done

Monthly Tasks

Full backup of model weights, configuration, and prompt templates
Security audit — verify no unauthorized API access, check firewall rules
Cost review — electricity, hosting, and compare to cloud API pricing
Evaluate new models — run your benchmark suite against newer open-source models to see if switching would improve quality or reduce resource usage
Driver updates — keep NVIDIA drivers current, but test on staging first

Emergency Runbook

Symptom	Likely Cause	Fix
TTFT > 10s	Memory leak	Restart serving process
GPU at 100%, low TPS	Context too long	Reduce max_tokens or add summarization
OOM errors	KV cache overflow	Reduce concurrent requests or context window
Gibberish output	Corrupted model file	Re-download model, verify checksum
Connection refused	Process crashed	Check logs, restart, increase restart limit

FAQ

How often should I restart the LLM serving process?

Once daily during off-peak hours is a good default. Some serving frameworks (especially older versions of vLLM) have memory leaks that only clear on restart. Ollama is generally more stable, but daily restarts are cheap insurance.

Should I auto-update models in production?

Never. Always pin a specific model version. Test new versions in staging first with your benchmark suite. A model update that looks minor can change output formatting, tone, or accuracy in ways that break downstream systems.

How do I handle model rollbacks?

Keep the previous model version downloaded locally. Use a symlink or environment variable to control which version is active. Rolling back should take under 60 seconds.

Don't want to maintain AI infrastructure yourself?

We offer fully managed AI operations — deployment, monitoring, maintenance, and support.

Book a Free SaaS Waste Audit