LLM monitoring dashboard with performance metrics

February 8, 2026 Operations 5 min read

Monitoring Self-Hosted LLMs: Metrics That Actually Matter

Getting a model running locally is the easy part. Keeping it running reliably in production without degraded performance, runaway memory usage, or silent quality drift — that's where most teams stumble.

Having managed self-hosted LLM infrastructure for enterprise clients, I've narrowed down the metrics that actually matter from the ones that look impressive on dashboards but don't help you fix problems.

The Four Metrics That Matter

1. Time to First Token (TTFT)

How long until the user sees the first word of a response. For interactive applications, keep this under 500ms. If TTFT starts creeping up, you're either overloading the GPU or your request queue is backing up.

# Measure TTFT with curl
time curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3:8b","prompt":"Hello","stream":true}' \
  | head -1

2. Tokens Per Second (TPS)

Throughput matters for user experience. Llama 3 8B should produce 30–60 tokens/second on a modern GPU. If you're seeing less than 15, something is wrong — check GPU utilization and thermal throttling.

3. GPU Memory Utilization

Monitor this continuously. LLMs consume memory for model weights AND for the KV cache (which grows with context length). A model that runs fine with short prompts might OOM with long conversations.

# Watch GPU metrics in real-time
watch -n 1 nvidia-smi

# Or get structured output for logging
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu \
  --format=csv -l 5

4. Output Quality Score

The hardest to measure, but the most important. Set up periodic automated evaluations that send known prompts and check answers. Use an LLM-as-judge approach or simple string matching for factual queries. Quality can drift if you quietly update a model version or change quantization.

Monitoring Stack Recommendation

Langfuse (open source) — traces every LLM call with latency, tokens, cost, and user feedback. Self-hostable.
Prometheus + Grafana — for GPU metrics, request queues, and system health. Industry standard.
Custom health check endpoint — a simple API that sends a test prompt every 60 seconds and logs the response time.

# Simple health check script (cron every minute)
#!/bin/bash
RESPONSE=$(curl -s -w "%{time_total}" -o /dev/null \
  http://localhost:11434/api/generate \
  -d '{"model":"llama3:8b","prompt":"Respond with OK","stream":false}')

if (( $(echo "$RESPONSE > 5.0" | bc -l) )); then
  echo "WARNING: LLM response time ${RESPONSE}s" | \
    mail -s "LLM Health Alert" ops@yourcompany.com
fi

Common Problems and Fixes

Slow TTFT after hours — usually memory leak in serving framework. Restart the service daily via cron.
GPU at 100% but low TPS — context length too long. Implement sliding window or summarization.
Inconsistent quality — temperature set too high or model version changed. Pin model versions.

FAQ

How often should I monitor LLM health?

Continuously for GPU metrics (every 10–30 seconds). Every 1–5 minutes for inference health checks. Daily for output quality evaluations.

What's a reasonable uptime target for self-hosted LLMs?

99.5% is realistic for a single-GPU setup with proper monitoring and auto-restart. For 99.9% and above, you need redundant instances with load balancing.

Need production-grade LLM infrastructure?

We deploy and monitor self-hosted AI systems that run reliably 24/7.

Book a Free SaaS Waste Audit