LLM Quantization Explained: Run 70B Models on Consumer Hardware
A 70-billion parameter model in full precision (FP16) weighs about 140 GB. That's more RAM than most workstations have. Quantization compresses those weights — sometimes by 75% — with surprisingly little impact on output quality. It's the single most important technique for running serious models on real-world hardware.
What Quantization Actually Does
In simple terms: it reduces the precision of each number in the model's weight matrix. Instead of storing each weight as a 16-bit floating-point number (FP16), you store it as an 8-bit integer (INT8) or even 4-bit (INT4). Fewer bits per weight means less memory, faster loading, and faster inference.
| Precision | Bits/Weight | 70B Model Size | RAM Needed |
|---|---|---|---|
| FP16 | 16 | ~140 GB | 140+ GB |
| INT8 (Q8) | 8 | ~70 GB | 72+ GB |
| INT4 (Q4_K_M) | 4–5 | ~40 GB | 42+ GB |
| INT3 (Q3_K_S) | 3–4 | ~30 GB | 32+ GB |
The Three Quantization Formats You Should Know
GGUF (for Ollama and llama.cpp)
This is what Ollama uses under the hood. GGUF is a file format that packages quantized weights in a way that's optimized for CPU inference. If you're running models locally on a Mac or Linux machine without a dedicated GPU, this is your format.
# Download a Q4 quantized model via Ollama
ollama pull llama3:70b-instruct-q4_K_M
# Or convert your own with llama.cpp
./quantize model-f16.gguf model-q4_K_M.gguf Q4_K_M
GPTQ (for GPU inference)
GPTQ is designed specifically for NVIDIA GPUs. It quantizes weights using a clever calibration process that minimizes accuracy loss. If you have a GPU with 24 GB VRAM (like an RTX 4090), GPTQ lets you run 70B models that would normally need much more memory.
AWQ (Activation-Aware Weight Quantization)
AWQ is the newest approach. It protects "important" weights from aggressive quantization while compressing the rest. In benchmarks, AWQ consistently outperforms GPTQ at the same bit width, especially on reasoning tasks.
Practical Recommendation
- Mac / CPU only: Use GGUF via Ollama. Q4_K_M is the sweet spot for quality vs. size.
- NVIDIA GPU: Use AWQ if your framework supports it, otherwise GPTQ.
- Production serving: Use vLLM with AWQ for high-throughput deployments.
FAQ
Does quantization degrade output quality?
At Q4 (4-bit), most users cannot tell the difference in real tasks. Academic benchmarks show 1–3% accuracy loss compared to FP16. At Q3 and below, quality starts to degrade noticeably for complex reasoning.
Can I quantize any model?
Yes. Tools like llama.cpp (for GGUF) and AutoGPTQ work with most Hugging Face models. The model just needs to be in a supported architecture (LLaMA, Mistral, Qwen, etc.).
What's the best quantization for Apple Silicon?
Q4_K_M via Ollama. Apple's unified memory architecture means your M1/M2/M3 can use all available RAM for model weights, making it one of the best consumer platforms for local LLMs.
Running into hardware limits?
We optimize model deployment for your specific infrastructure — from single-GPU setups to multi-node clusters.
Book a Free SaaS Waste Audit