How to Deploy an LLM on a VPS with AI in 2026 (Step-by-Step Guide) | Misar.AI | Misar.Blog

Quick Answer

For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.

Setup time: 1-3 hours
Cost: $5/mo CPU VPS to $500/mo A100
Throughput: 10-200 tokens/sec depending on setup

What You'll Need

VPS with 16GB+ RAM (CPU) or GPU VPS (RunPod, Vast.ai, Hetzner GPU)
Docker installed
Domain + Caddy/Traefik for HTTPS
Ollama or vLLM

Steps

Choose model size by hardware.

CPU (16GB RAM): Llama 3.1 8B Q4_K_M, Qwen 2.5 7B
Single GPU 24GB (RTX 3090/4090): Llama 3.1 70B Q4, Qwen 2.5 32B
A100 80GB: Llama 3.1 70B full, Mixtral 8x22B Q4

Install Ollama (simplest). curl -fsSL https://ollama.com/install.sh | sh. Pull model: ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on :11434.
Or install vLLM (production). docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama.
Put HTTPS in front. Caddy one-liner: your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs.
Add auth. Ollama/vLLM don't ship auth. Use a simple Node.js or Caddy basic-auth proxy — reject requests without Authorization: Bearer <key>.
Test with OpenAI SDK. new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format.
Monitor. Track GPU utilization (nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS.
Scale. Multi-GPU with vLLM tensor parallelism. Or run multiple single-GPU nodes behind a load balancer.

Common Mistakes

Running 70B on 16GB: OOM crashes. Check quantization fits VRAM/RAM.
No rate limiting: One runaway client saturates the box. Add per-key limits.
Ignoring context window: Each 1K ctx = ~500MB KV cache. Don't over-allocate.
No HTTPS: Browsers & most OpenAI SDKs refuse HTTP. Always use Caddy/Traefik.
Skipping quantization: Full-precision 8B needs 16GB VRAM. Q4_K_M needs 5GB with minor quality loss.

Top Tools

Tool	Best For	Price
Ollama	Easiest setup	Free
vLLM	High throughput	Free
Llama.cpp	CPU / edge	Free
Caddy	HTTPS proxy	Free
Hetzner GPU	Cheap GPU VPS	$70-500/mo

FAQs

Q: Ollama vs vLLM? Ollama: simple, slow. vLLM: complex, 10x faster at scale.

Q: Which GPU for production? RTX 4090 (24GB) for indie. A100 (80GB) for scale. H100 for frontier.

Q: Can I run on CPU only? Yes — 8B quantized model on 16GB RAM. ~5-10 tok/sec. Fine for batch jobs.

Q: Is this cheaper than OpenAI API? Above ~5M tokens/mo, yes. Below that, APIs are cheaper.

Q: Data privacy vs hosted API? 100% on-prem. Data never leaves your VPS.

Q: Can I serve multiple models? Yes — vLLM supports multi-model serving with routing.

Conclusion

Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.

How to Deploy an LLM on a VPS with AI in 2026 (Step-by-Step Guide)

Quick Answer

What You'll Need

Steps

Common Mistakes

Top Tools

FAQs

Conclusion

Enjoying this? Get weekly AI tips free.

More like this

Comments

More from Misar.AI

The Ultimate Guide to the Future of AI and Humanity in 2026 (Everything You Need to Know)

The Ultimate Guide to AI Video Generation in 2026 (Everything You Need to Know)

The Ultimate Guide to AI Image Generation in 2026 (Everything You Need to Know)