
LLM Maintenance Playbook: Keep Your Self-Hosted AI Running Smoothly
Self-hosting an LLM isn't a set-it-and-forget-it operation. Models need updating, GPUs need monitoring, logs need rotating, and performance needs benchmarking. Without a maintenance routine, small issues compound into 3 AM outages.
This is the maintenance playbook we follow for every production deployment. Steal it, adapt it, automate it.
Daily Tasks (Automated)
#!/bin/bash
# daily-llm-maintenance.sh — run via cron at 6:00 AM
# 1. Health check
TTFT=$(curl -s -w "%{time_total}" -o /dev/null \
http://localhost:11434/api/generate \
-d '{"model":"llama3:8b","prompt":"test","stream":false}')
if (( $(echo "$TTFT > 10.0" | bc -l) )); then
echo "ALERT: TTFT is ${TTFT}s" | slack-notify "#ops"
fi
# 2. GPU memory check
GPU_MEM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
GPU_TOTAL=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)
USAGE=$((GPU_MEM * 100 / GPU_TOTAL))
if [ $USAGE -gt 95 ]; then
echo "ALERT: GPU memory at ${USAGE}%" | slack-notify "#ops"
# Restart Ollama to clear leaked memory
systemctl restart ollama
fi
# 3. Rotate logs older than 30 days
find /var/log/llm-audit/ -name "*.jsonl" -mtime +30 -exec gzip {} \;
# 4. Disk space check
DISK_USAGE=$(df -h /root/.ollama | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 85 ]; then
echo "ALERT: Model disk at ${DISK_USAGE}%" | slack-notify "#ops"
fi
Weekly Tasks
- Review error logs — check for repeated failures, timeout patterns, or unusual request volumes
- Run benchmark suite — send a fixed set of prompts and compare response quality and speed to your baseline
- Check for model updates — new quantizations or model versions may offer better performance
- Review GPU temperatures — sustained temps above 85°C indicate cooling issues
# Weekly benchmark script
#!/bin/bash
PROMPTS=("Summarize this contract clause: ..." \
"Classify this support ticket: ..." \
"Extract the dates from this email: ...")
for prompt in "${PROMPTS[@]}"; do
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"llama3:8b\",\"prompt\":\"$prompt\",\"stream\":false}")
END=$(date +%s%N)
ELAPSED=$(( (END - START) / 1000000 ))
echo "$(date): ${ELAPSED}ms" >> /var/log/llm-benchmark.log
done
Monthly Tasks
- Full backup of model weights, configuration, and prompt templates
- Security audit — verify no unauthorized API access, check firewall rules
- Cost review — electricity, hosting, and compare to cloud API pricing
- Evaluate new models — run your benchmark suite against newer open-source models to see if switching would improve quality or reduce resource usage
- Driver updates — keep NVIDIA drivers current, but test on staging first
Emergency Runbook
| Symptom | Likely Cause | Fix |
|---|---|---|
| TTFT > 10s | Memory leak | Restart serving process |
| GPU at 100%, low TPS | Context too long | Reduce max_tokens or add summarization |
| OOM errors | KV cache overflow | Reduce concurrent requests or context window |
| Gibberish output | Corrupted model file | Re-download model, verify checksum |
| Connection refused | Process crashed | Check logs, restart, increase restart limit |
FAQ
How often should I restart the LLM serving process?
Once daily during off-peak hours is a good default. Some serving frameworks (especially older versions of vLLM) have memory leaks that only clear on restart. Ollama is generally more stable, but daily restarts are cheap insurance.
Should I auto-update models in production?
Never. Always pin a specific model version. Test new versions in staging first with your benchmark suite. A model update that looks minor can change output formatting, tone, or accuracy in ways that break downstream systems.
How do I handle model rollbacks?
Keep the previous model version downloaded locally. Use a symlink or environment variable to control which version is active. Rolling back should take under 60 seconds.
Don't want to maintain AI infrastructure yourself?
We offer fully managed AI operations — deployment, monitoring, maintenance, and support.
Book a Free SaaS Waste Audit