Self-Hosting AI Models
When cloud APIs don't fit — privacy requirements, air-gapped networks, or pure economics — self-hosting is the answer.
Why Self-Host?
Four compelling reasons:
1. Data Privacy
- Sensitive data never leaves your network
- No third-party logs of your prompts or responses
- Required for classified environments, healthcare (HIPAA), finance
2. Cost at Scale
- Zero per-token cost after hardware investment
- Break-even at ~10,000 requests/day for medium models
- 80% cheaper than cloud at high volume
3. Air-Gapped Operations
- Military, government, critical infrastructure
- No internet dependency
- Full operational control
4. Customization
- Fine-tune models on your proprietary data
- Custom tokenizers for domain-specific vocabulary
- Full control over model versions and updates
Ollama — The Easiest Path
Ollama makes running local models as simple as Docker:
Install:
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Or with Homebrew
brew install ollama
Run a model:
ollama run llama3.3 # Chat with Llama 3.3 70B
ollama run qwen3:30b # Qwen 3 30B
ollama run phi4 # Microsoft Phi-4 14B
ollama run codellama:34b # Code-specialized Llama
Use as an API (OpenAI-compatible):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Connect to your tools:
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="not-needed"
# Now every OpenAI-compatible tool uses your local model
vLLM — Production-Grade Serving
For production deployments that need performance:
pip install vllm
# Serve a model
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768
Why vLLM over Ollama for production:
- Continuous batching — handles many concurrent requests efficiently
- PagedAttention — uses GPU memory more efficiently
- Higher throughput — 2-4x faster than naive serving
- OpenAI-compatible API built in
Ollama vs. vLLM:
| Feature | Ollama | vLLM |
|---|---|---|
| Setup | 1 minute | 30 minutes |
| Best for | Development, small teams | Production, high volume |
| Throughput | Good | Excellent |
| Concurrent users | 5-10 | 100+ |
| GPU utilization | Good | Optimized |
| Model format | GGUF (quantized) | Full weights |
GPU Requirements
Choosing the right hardware:
| Model Size | Min VRAM | Recommended GPU | Approximate Cost |
|---|---|---|---|
| 7B params | 8 GB | RTX 4070 | $500 |
| 14B params | 12 GB | RTX 4080 | $1,000 |
| 30B params | 24 GB | RTX 4090 | $1,600 |
| 70B params | 48 GB | 2x RTX 4090 or A6000 | $3,200+ |
| 100B+ params | 80 GB | A100 or H100 | $10,000+ |
Cloud GPU rental (when you don't want to buy):
| Provider | A100 (80GB) | H100 | Per Month |
|---|---|---|---|
| Lambda Labs | $1.10/hr | $2.49/hr | $800-1,800 |
| RunPod | $1.64/hr | $3.89/hr | $1,200-2,800 |
| AWS (p5) | $8.00/hr | $12.00/hr | $5,760-8,640 |
Cost comparison: Running Llama 3.3 70B on 2x A100 costs ~$1,600/month in cloud GPUs. At 20,000 requests/day, this replaces ~$6,000/month in Claude Sonnet API costs.
The Hybrid Architecture
Most organizations benefit from combining cloud and self-hosted:
┌─────────────────────────────────────────┐
│ Model Prism Gateway │
├────────────┬────────────┬───────────────┤
│ │ │ │
▼ ▼ ▼ ▼
Ollama OpenAI Anthropic AWS Bedrock
(local) (cloud) (cloud) (managed)
│ │ │ │
Simple Creative Complex Compliant
tasks tasks reasoning workloads
Free/token $$ $$$ $$
Routing rules:
- Simple tasks (classification, extraction) → Ollama (free)
- Standard tasks (code, docs) → Cloud API (balanced)
- Complex tasks (architecture, analysis) → Frontier cloud model
- Regulated data → Bedrock or Ollama (compliance)
Failover: If Ollama is overloaded → fallback to cloud API. If cloud is down → fallback to Ollama for critical operations.
Self-Hosting Checklist
Before going self-hosted:
- Calculate break-even: How many requests/day? At what model tier?
- Choose hardware: Buy GPUs, rent cloud GPUs, or use existing servers?
- Pick your stack: Ollama (simple) or vLLM (production)?
- Select models: Which open-source models match your quality needs?
- Plan for updates: How will you update to newer model versions?
- Set up monitoring: GPU utilization, request latency, queue depth
- Configure backup: What happens when the GPU server goes down?
- Security: API authentication, network isolation, access logging
- Integration: Connect to Model Prism for unified routing and tracking
Start small: Run Ollama on a developer's workstation for a week. Measure quality and speed. Then decide whether to invest in dedicated GPU infrastructure.
---quiz question: At approximately what request volume does self-hosting become cheaper than cloud APIs? options:
- { text: "At any volume — self-hosting is always cheaper", correct: false }
- { text: "Around 10,000+ requests per day for medium-sized models", correct: true }
- { text: "Only at 1 million+ requests per day", correct: false } feedback: Self-hosting has a fixed GPU cost but zero per-token cost. The break-even point depends on model size and cloud pricing, but typically falls around 10,000 daily requests for medium-sized models.
---quiz question: What is the key advantage of vLLM over Ollama for production deployments? options:
- { text: "vLLM is easier to install", correct: false }
- { text: "vLLM handles many concurrent requests efficiently with continuous batching", correct: true }
- { text: "vLLM supports more model formats", correct: false } feedback: vLLM uses continuous batching and PagedAttention to serve 100+ concurrent users efficiently, achieving 2-4x higher throughput than simpler serving solutions. Ollama is easier but designed for development and small teams.
---quiz question: What is the "hybrid architecture" for AI model hosting? options:
- { text: "Running two copies of the same model for redundancy", correct: false }
- { text: "Combining self-hosted models for simple/private tasks with cloud APIs for complex tasks", correct: true }
- { text: "Using both GPUs and CPUs on the same server", correct: false } feedback: The hybrid architecture routes simple and privacy-sensitive tasks to local models (free, private) while sending complex tasks to cloud APIs (higher quality). A gateway like Model Prism manages the routing automatically.