Interactive Tool · MLX-first

Local AI Capacity
Planner

Plan Qwen3.5 and Gemma-4 deployments on Apple Silicon (M1–M5 Pro/Max/Ultra) and NVIDIA consumer GPUs. Calculate VRAM footprint, KV cache, concurrent agent capacity, and MLX token throughput — before you commit.

🤖 Model

🖥️ Hardware

🍎 Apple Silicon · Unified Memory

🟢 NVIDIA Consumer GPUs (≥12 GB)

⚖️ Quantization

Runtime / Framework

🔀 Concurrency & Context

Concurrent Agents 2

1481216

Context Window 32k tokens

4k8k16k32k64k128k

Model Weights

14.0 GB

Q4_K_M · 27B params

KV Cache Total

1.8 GB

2 agents × 16k ctx

Total Memory

15.8 GB

weights + KV overhead

MLX Speed

29 t/s

tokens/sec per agent

Breakdown

Agent Slots

💡 Tips

Quick Reference — Q4_K_M Fit & MLX Speed

Hardware	RAM	BW	Qwen3.5-27B	Qwen3.5-35B	Gemma-4-26B	Gemma-4-31B
M4 Pro · 24 GB	24 GB	273 GB/s	~20 t/s ✅	~16 t/s ✅	~21 t/s ✅	~18 t/s ✅
M4 Pro · 48 GB	48 GB	273 GB/s	~20 t/s ✅	~16 t/s ✅	~21 t/s ✅	~18 t/s ✅
M3 Max · 36 GB	36 GB	400 GB/s	~30 t/s ✅	~23 t/s ✅	~31 t/s ✅	~26 t/s ✅
M4 Max · 48 GB	48 GB	546 GB/s	~42 t/s ✅	~32 t/s ✅	~43 t/s ✅	~36 t/s ✅
M2/M1 Max · 32 GB	32 GB	400 GB/s	~30 t/s ✅	~23 t/s ✅	~31 t/s ✅	~26 t/s ✅
2× RTX 3060 · 24 GB	24 GB	720 GB/s*	~27 t/s ✅	~21 t/s ✅	~28 t/s ✅	~23 t/s ✅
RTX 3080 Ti · 12 GB	12 GB	912 GB/s	❌ OOM	❌ OOM	❌ OOM	❌ OOM
RTX 3090 · 24 GB	24 GB	936 GB/s	~35 t/s ✅	~27 t/s ✅	~36 t/s ✅	~30 t/s ✅
RTX 4090 · 24 GB	24 GB	1008 GB/s	~37 t/s ✅	~29 t/s ✅	~39 t/s ✅	~32 t/s ✅
2× RTX 3090 · 48 GB	48 GB	1872 GB/s	~69 t/s ✅	~54 t/s ✅	~72 t/s ✅	~60 t/s ✅

* 2× RTX 3060 bandwidth is estimated for PCIe-shared split inference (llama.cpp tensor parallel). Not NVLink — effective BW is lower than aggregate peak.
MLX speeds are for Apple Silicon. NVIDIA speeds use vLLM/llama.cpp estimates at Q4_K_M. Actual results depend on batch size and system load.

Need help designing a production local-AI stack for your team?

Talk to Dataxad

Local AI Capacity Planner