Ollama
A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.
A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.
Why Ollama Matters
Ollama is not trying to be a high-throughput production serving engine. Its value proposition is developer experience: ollama run llama3 and you have a working LLM in seconds. No Python environment, no CUDA toolkit configuration, no HuggingFace token dance. This makes it the fastest path from “I want to try a model” to “I have a model running locally.”
For enterprise teams, Ollama fills three specific gaps: local development parity with production LLM APIs, CI/CD integration testing without cloud API costs, and air-gapped deployment where no external network access is permitted.
Architecture
Ollama is a single Go binary that bundles llama.cpp (for inference) with a model management layer and an HTTP API server.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
+----------------------------------------------------------+
| Ollama Process |
| +----------------+ +---------------+ +--------------+ |
| | Model Registry | | llama.cpp | | HTTP API | |
| | (pull, list, | | Engine | | Server | |
| | create, push) | | (GGUF format) | | :11434 | |
| +-------+--------+ +-------+-------+ +------+-------+ |
| | | | |
| +-------v--------------------v------------------v------+ |
| | Model Storage (~/.ollama/models) | |
| | Quantized GGUF files, Modelfiles, manifests | |
| +-------------------------------------------------------+ |
+----------------------------------------------------------+
| |
+----v-----+ +-----v------+
| CPU/GPU | | System RAM |
| Inference | | / VRAM |
+-----------+ +------------+
Model Format: GGUF
Ollama uses the GGUF format (from the llama.cpp project), which packages model weights in quantized form. Quantization levels range from Q2_K (smallest, lowest quality) to F16 (full half-precision, highest quality). The sweet spot for most use cases is Q4_K_M – roughly 4 bits per weight, which gives ~70-80% of full-precision quality at 25% of the memory footprint.
Modelfile (Dockerfile for Models)
1
2
3
4
5
6
7
FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """You are a helpful code review assistant. Focus on security issues,
performance problems, and maintainability concerns. Be concise."""
Build and run: ollama create code-reviewer -f Modelfile && ollama run code-reviewer
This is powerful for standardizing model configurations across a team – version-control the Modelfile, share via a registry.
API
Ollama exposes a REST API on port 11434 that is straightforward to integrate with.
Generate (Completion)
1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain Kubernetes pods in one paragraph",
"stream": false
}'
Chat (Multi-turn)
1
2
3
4
5
6
7
8
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a K8s expert."},
{"role": "user", "content": "What is a DaemonSet?"}
],
"stream": false
}'
OpenAI Compatibility
Since late 2023, Ollama supports OpenAI-compatible endpoints at /v1/chat/completions. This means any tool or SDK using the OpenAI client library can point to Ollama:
1
2
3
4
5
6
7
8
9
10
11
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by SDK but not validated
)
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello"}]
)
Model Management
1
2
3
4
5
6
ollama pull llama3 # Download a model
ollama list # List installed models
ollama show llama3 # Show model details (params, license, size)
ollama cp llama3 my-llama3 # Copy/alias a model
ollama rm old-model # Delete a model
ollama ps # Show running models and memory usage
Supported Models
Ollama’s model library includes most popular open-weight models, pre-quantized and ready to pull:
| Model | Sizes | Use Case |
|---|---|---|
| Llama 3 / 3.1 | 8B, 70B | General purpose, strong reasoning |
| Mistral / Mixtral | 7B, 8x7B | Fast inference, good code |
| Qwen 2.5 | 0.5B-72B | Multilingual, strong on benchmarks |
| Phi-3 / Phi-4 | 3.8B, 14B | Small but capable, good for edge |
| Gemma 2 | 2B, 9B, 27B | Google’s open model, efficient |
| DeepSeek Coder | 1.3B-33B | Code generation and completion |
| CodeLlama | 7B, 13B, 34B | Code-specialized Llama |
| Llava | 7B, 13B | Multi-modal (image + text) |
| Command-R | 35B, 104B | RAG-optimized |
Custom models from HuggingFace can be imported if they are in GGUF format or convertible to it.
Use Cases
1. Local Development and Testing
Run the same model your production system uses (or a smaller variant) locally. Test prompt engineering, tool calling, and agent logic without API costs or rate limits.
1
2
3
4
5
6
7
# Dev machine: run a small model for fast iteration
ollama run phi3
# CI pipeline: run integration tests against a local model
docker run -d --name ollama ollama/ollama
docker exec ollama ollama pull llama3:8b
pytest tests/integration/ --llm-base-url=http://localhost:11434/v1
2. Air-Gapped / Regulated Environments
In environments where data cannot leave the network (healthcare, defense, financial services), Ollama provides LLM capabilities without any external API calls. Pre-pull models on an internet-connected machine, transfer the ~/.ollama/models directory to the air-gapped system.
3. Privacy-Sensitive Applications
All inference happens locally. No data is sent to any external service. Useful for processing sensitive documents, PII-containing data, or proprietary code.
4. Edge and Embedded
Ollama runs on macOS, Linux, and Windows. With small models (Phi-3 3.8B, Gemma 2B), it runs reasonably well on laptops and even some single-board computers with sufficient RAM.
Limitations for Production
Ollama is explicitly not designed for production serving at scale. Understanding these limitations is critical before choosing it over vLLM or TGI.
| Limitation | Detail | Production Alternative |
|---|---|---|
| Single-request inference | No continuous batching; processes one request at a time per model | vLLM, TGI |
| No tensor parallelism | Cannot split a model across multiple GPUs | vLLM, TensorRT-LLM |
| No autoscaling | Single process, no built-in horizontal scaling | vLLM + K8s HPA, KServe |
| Limited observability | Basic logging only, no Prometheus metrics, no distributed tracing | vLLM (Prometheus), TGI |
| No auth/RBAC | API is open by default, no token validation | API gateway in front |
| GGUF only | Cannot serve models in safetensors/PyTorch format directly | vLLM (native HF support) |
| Quantized only (practical) | Full-precision models use too much CPU memory to be useful | vLLM on GPUs |
When Ollama in Kubernetes Makes Sense
Despite the limitations, there are valid K8s deployments:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
memory: "16Gi"
cpu: "8"
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
This works for internal tools, dev environments, and low-traffic internal APIs where the simplicity of Ollama outweighs the performance limitations. Do not use this pattern for customer-facing production traffic.
Ollama vs vLLM: Decision Matrix
| Factor | Ollama | vLLM |
|---|---|---|
| Setup time | Minutes | Hours (CUDA, drivers, model conversion) |
| Throughput | Low (single request) | High (continuous batching, PagedAttention) |
| GPU utilization | Basic | Optimized (95%+ memory efficiency) |
| Model format | GGUF (quantized) | Native HF (safetensors, FP16/BF16) |
| Multi-GPU | No | Yes (tensor + pipeline parallelism) |
| API compatibility | OpenAI-compatible | OpenAI-compatible |
| Best for | Dev, testing, air-gapped, privacy | Production serving at scale |
When to Use Ollama
Use Ollama when:
- You need a fast local LLM for development, prompt engineering, or testing
- You are in an air-gapped or high-security environment
- You want to run integration tests in CI without cloud API costs
- You need a simple, low-maintenance LLM for internal tools with low traffic
- You are evaluating models and want to quickly try different options
Skip Ollama when:
- You need to serve multiple concurrent users in production (use vLLM)
- You need maximum inference performance on GPUs (use vLLM or TensorRT-LLM)
- You need autoscaling based on demand (use vLLM + K8s HPA or KServe)
- You are serving full-precision models and need every bit of quality (use vLLM with FP16)
References
- Ollama official site: https://ollama.ai
- Ollama GitHub: https://github.com/ollama/ollama
- Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md
- Ollama model library: https://ollama.ai/library
- GGUF format specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Ollama Docker image: https://hub.docker.com/r/ollama/ollama