Self-Hosting AI Models

When cloud APIs don't fit — privacy requirements, air-gapped networks, or pure economics — self-hosting is the answer.

Why Self-Host?

Four compelling reasons:

1. Data Privacy

Sensitive data never leaves your network
No third-party logs of your prompts or responses
Required for classified environments, healthcare (HIPAA), finance

2. Cost at Scale

Zero per-token cost after hardware investment
Break-even at ~10,000 requests/day for medium models
80% cheaper than cloud at high volume

3. Air-Gapped Operations

Military, government, critical infrastructure
No internet dependency
Full operational control

4. Customization

Fine-tune models on your proprietary data
Custom tokenizers for domain-specific vocabulary
Full control over model versions and updates

Ollama — The Easiest Path

Ollama makes running local models as simple as Docker:

Install:

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or with Homebrew
brew install ollama

Run a model:

ollama run llama3.3        # Chat with Llama 3.3 70B
ollama run qwen3:30b       # Qwen 3 30B
ollama run phi4             # Microsoft Phi-4 14B
ollama run codellama:34b    # Code-specialized Llama

Use as an API (OpenAI-compatible):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Connect to your tools:

export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="not-needed"
# Now every OpenAI-compatible tool uses your local model

vLLM — Production-Grade Serving

For production deployments that need performance:

pip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768

Why vLLM over Ollama for production:

Continuous batching — handles many concurrent requests efficiently
PagedAttention — uses GPU memory more efficiently
Higher throughput — 2-4x faster than naive serving
OpenAI-compatible API built in

Ollama vs. vLLM:

Feature	Ollama	vLLM
Setup	1 minute	30 minutes
Best for	Development, small teams	Production, high volume
Throughput	Good	Excellent
Concurrent users	5-10	100+
GPU utilization	Good	Optimized
Model format	GGUF (quantized)	Full weights

GPU Requirements

Choosing the right hardware:

Model Size	Min VRAM	Recommended GPU	Approximate Cost
7B params	8 GB	RTX 4070	$500
14B params	12 GB	RTX 4080	$1,000
30B params	24 GB	RTX 4090	$1,600
70B params	48 GB	2x RTX 4090 or A6000	$3,200+
100B+ params	80 GB	A100 or H100	$10,000+

Cloud GPU rental (when you don't want to buy):

Provider	A100 (80GB)	H100	Per Month
Lambda Labs	$1.10/hr	$2.49/hr	$800-1,800
RunPod	$1.64/hr	$3.89/hr	$1,200-2,800
AWS (p5)	$8.00/hr	$12.00/hr	$5,760-8,640

Cost comparison: Running Llama 3.3 70B on 2x A100 costs ~$1,600/month in cloud GPUs. At 20,000 requests/day, this replaces ~$6,000/month in Claude Sonnet API costs.

The Hybrid Architecture

Most organizations benefit from combining cloud and self-hosted:

┌─────────────────────────────────────────┐
│           Model Prism Gateway           │
├────────────┬────────────┬───────────────┤
│            │            │               │
▼            ▼            ▼               ▼
Ollama       OpenAI       Anthropic    AWS Bedrock
(local)      (cloud)      (cloud)      (managed)
│            │            │               │
Simple       Creative     Complex      Compliant
tasks        tasks        reasoning    workloads
Free/token   $$           $$$          $$

Routing rules:

Simple tasks (classification, extraction) → Ollama (free)
Standard tasks (code, docs) → Cloud API (balanced)
Complex tasks (architecture, analysis) → Frontier cloud model
Regulated data → Bedrock or Ollama (compliance)

Failover: If Ollama is overloaded → fallback to cloud API. If cloud is down → fallback to Ollama for critical operations.

Self-Hosting Checklist

Before going self-hosted:

Calculate break-even: How many requests/day? At what model tier?
Choose hardware: Buy GPUs, rent cloud GPUs, or use existing servers?
Pick your stack: Ollama (simple) or vLLM (production)?
Select models: Which open-source models match your quality needs?
Plan for updates: How will you update to newer model versions?
Set up monitoring: GPU utilization, request latency, queue depth
Configure backup: What happens when the GPU server goes down?
Security: API authentication, network isolation, access logging
Integration: Connect to Model Prism for unified routing and tracking

Start small: Run Ollama on a developer's workstation for a week. Measure quality and speed. Then decide whether to invest in dedicated GPU infrastructure.

---quiz question: At approximately what request volume does self-hosting become cheaper than cloud APIs? options:

{ text: "At any volume — self-hosting is always cheaper", correct: false }
{ text: "Around 10,000+ requests per day for medium-sized models", correct: true }
{ text: "Only at 1 million+ requests per day", correct: false } feedback: Self-hosting has a fixed GPU cost but zero per-token cost. The break-even point depends on model size and cloud pricing, but typically falls around 10,000 daily requests for medium-sized models.

---quiz question: What is the key advantage of vLLM over Ollama for production deployments? options:

{ text: "vLLM is easier to install", correct: false }
{ text: "vLLM handles many concurrent requests efficiently with continuous batching", correct: true }
{ text: "vLLM supports more model formats", correct: false } feedback: vLLM uses continuous batching and PagedAttention to serve 100+ concurrent users efficiently, achieving 2-4x higher throughput than simpler serving solutions. Ollama is easier but designed for development and small teams.

---quiz question: What is the "hybrid architecture" for AI model hosting? options:

{ text: "Running two copies of the same model for redundancy", correct: false }
{ text: "Combining self-hosted models for simple/private tasks with cloud APIs for complex tasks", correct: true }
{ text: "Using both GPUs and CPUs on the same server", correct: false } feedback: The hybrid architecture routes simple and privacy-sensitive tasks to local models (free, private) while sending complex tasks to cloud APIs (higher quality). A gateway like Model Prism manages the routing automatically.