
GPU Buyer's Guide for Self-Hosted AI: What You Actually Need
The most common question we hear from businesses exploring self-hosted AI: "What hardware do I need?" The answer depends entirely on what you're running, how many concurrent users you have, and whether speed or cost is your priority.
After deploying local LLMs across dozens of different hardware configurations, here's a no-nonsense breakdown.
The Golden Rule: VRAM Is Everything
For GPU inference, the bottleneck is almost always VRAM (Video RAM). The model's weights need to fit entirely in VRAM for optimal performance. If they spill over into system RAM, inference slows to a crawl.
| Model Size | Quantization | VRAM Needed | Recommended GPU |
|---|---|---|---|
| 7–8B | Q4 | 5–6 GB | RTX 3060 12GB / M1 Pro 16GB |
| 13B | Q4 | 9–10 GB | RTX 4070 12GB / M2 Pro 16GB |
| 34B | Q4 | 20–22 GB | RTX 4090 24GB / M3 Max 36GB |
| 70B | Q4 | 38–42 GB | 2× RTX 4090 / M3 Ultra 128GB |
| 70B | FP16 | 140+ GB | 2× A100 80GB / H100 |
Platform Breakdown
Apple Silicon (M1/M2/M3/M4)
Surprisingly strong for local AI work. Apple's unified memory architecture means the GPU can access all system RAM, so a Mac Studio with 128 GB unified memory can run 70B models in Q4 quantization. Throughput is lower than dedicated NVIDIA GPUs, but for single-user workloads, it's excellent — and completely silent.
# Check your Mac's memory
system_profiler SPHardwareDataType | grep Memory
# Run a 70B model on a high-memory Mac
ollama run llama3:70b-instruct-q4_K_M
NVIDIA (Consumer: RTX series)
The RTX 4090 with 24 GB VRAM is the workhorse of local AI. For most teams, one or two 4090s cover 90% of use cases. Multi-GPU setups are straightforward with vLLM's tensor parallelism.
NVIDIA (Enterprise: A100 / H100)
For production deployments with multiple concurrent users, enterprise GPUs are worth the investment. Each A100 80GB can serve a 70B model in Q4 with room for KV cache. H100s are another step up in throughput but priced accordingly.
AMD (MI series)
AMD's ROCm stack has improved significantly. The MI300X with its 192 GB HBM3 memory is compelling for large models. Software ecosystem is still catching up to CUDA, but for teams willing to invest in optimization, the price-per-VRAM is unbeatable.
Cost Comparison: Buy vs. Cloud
The breakeven point depends on usage. If you're running inference 8+ hours per day, owning hardware pays for itself within 3–6 months. For bursty, low-volume workloads, cloud GPU instances (RunPod, Lambda, or AWS G5) are more cost-effective.
FAQ
Can I run LLMs on a CPU only?
Yes. Ollama with GGUF models runs entirely on CPU. A modern 16-core processor with 32+ GB RAM can run 8B models at usable speeds (5–10 tokens/second). Not fast enough for production serving, but fine for development and testing.
Should I buy used enterprise GPUs?
Used A100 40GB cards can be found for $5,000–$8,000 and are excellent values. Just ensure they haven't been used in mining and come with warranty. The A6000 is another great option at 48 GB VRAM.
How much power does a GPU server draw?
A single RTX 4090 draws about 450W under load. A dual-4090 server with CPU and cooling will pull 800–1,200W. Factor in roughly $100–$200/month in electricity costs depending on your region.
Not sure what hardware to get?
We design and deploy AI infrastructure tailored to your workload. From single-GPU setups to multi-node clusters.
Book a Free SaaS Waste Audit