Local inference using llama.cpp on Apple Silicon hardware. Tested on M2 Air (24GB unified memory).
Models evaluated on M2 Air with 24GB RAM:
| Model | Size | Speed | Quality |
|---|---|---|---|
| Qwen3 8B | 8B params | Fast | Good for coding tasks |
| DeepSeek-R1-Distill-Qwen-7B | 7B params | Fast | Reasoning-focused |
| Mistral Nemo 12B | 12B params | Medium | Balanced |
| Mistral Small 3.1 24B | 24B params | Slow | Best quality, memory-limited |
llama.cpp with Metal acceleration. Models in GGUF format from HuggingFace. Server mode exposes OpenAI-compatible API on localhost.