For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.
curl -fsSL https://ollama.com/install.sh | sh. Pull model: ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on :11434.docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama.your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs.Authorization: Bearer <key>.new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format.nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS.| Tool | Best For | Price |
|---|---|---|
| Ollama | Easiest setup | Free |
| vLLM | High throughput | Free |
| Llama.cpp | CPU / edge | Free |
| Caddy | HTTPS proxy | Free |
| Hetzner GPU | Cheap GPU VPS | $70-500/mo |
Q: Ollama vs vLLM? Ollama: simple, slow. vLLM: complex, 10x faster at scale.
Q: Which GPU for production? RTX 4090 (24GB) for indie. A100 (80GB) for scale. H100 for frontier.
Q: Can I run on CPU only? Yes — 8B quantized model on 16GB RAM. ~5-10 tok/sec. Fine for batch jobs.
Q: Is this cheaper than OpenAI API? Above ~5M tokens/mo, yes. Below that, APIs are cheaper.
Q: Data privacy vs hosted API? 100% on-prem. Data never leaves your VPS.
Q: Can I serve multiple models? Yes — vLLM supports multi-model serving with routing.
Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.
Free newsletter
Join thousands of creators and builders. One email a week — practical AI tips, platform updates, and curated reads.
No spam · Unsubscribe anytime
Comments
Sign in to join the conversation
No comments yet. Be the first to share your thoughts!