Ollama

A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.

Posted Dec 15, 2025

7 min read

Ollama

A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.

Why Ollama Matters

Ollama is not trying to be a high-throughput production serving engine. Its value proposition is developer experience: ollama run llama3 and you have a working LLM in seconds. No Python environment, no CUDA toolkit configuration, no HuggingFace token dance. This makes it the fastest path from “I want to try a model” to “I have a model running locally.”

For enterprise teams, Ollama fills three specific gaps: local development parity with production LLM APIs, CI/CD integration testing without cloud API costs, and air-gapped deployment where no external network access is permitted.

Architecture

Ollama is a single Go binary that bundles llama.cpp (for inference) with a model management layer and an HTTP API server.

+----------------------------------------------------------+
|  Ollama Process                                           |
|  +----------------+  +---------------+  +--------------+ |
|  | Model Registry  |  | llama.cpp     |  | HTTP API     | |
|  | (pull, list,    |  | Engine        |  | Server       | |
|  |  create, push)  |  | (GGUF format) |  | :11434       | |
|  +-------+--------+  +-------+-------+  +------+-------+ |
|          |                    |                  |         |
|  +-------v--------------------v------------------v------+ |
|  | Model Storage (~/.ollama/models)                      | |
|  | Quantized GGUF files, Modelfiles, manifests           | |
|  +-------------------------------------------------------+ |
+----------------------------------------------------------+
         |                    |
    +----v-----+       +-----v------+
    | CPU/GPU   |       | System RAM |
    | Inference |       | / VRAM     |
    +-----------+       +------------+

Model Format: GGUF

Ollama uses the GGUF format (from the llama.cpp project), which packages model weights in quantized form. Quantization levels range from Q2_K (smallest, lowest quality) to F16 (full half-precision, highest quality). The sweet spot for most use cases is Q4_K_M – roughly 4 bits per weight, which gives ~70-80% of full-precision quality at 25% of the memory footprint.

Modelfile (Dockerfile for Models)

FROM llama3

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

SYSTEM """You are a helpful code review assistant. Focus on security issues,
performance problems, and maintainability concerns. Be concise."""

Build and run: ollama create code-reviewer -f Modelfile && ollama run code-reviewer

This is powerful for standardizing model configurations across a team – version-control the Modelfile, share via a registry.

API

Ollama exposes a REST API on port 11434 that is straightforward to integrate with.

Generate (Completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Kubernetes pods in one paragraph",
  "stream": false
}'

Chat (Multi-turn)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a K8s expert."},
    {"role": "user", "content": "What is a DaemonSet?"}
  ],
  "stream": false
}'

OpenAI Compatibility

Since late 2023, Ollama supports OpenAI-compatible endpoints at /v1/chat/completions. This means any tool or SDK using the OpenAI client library can point to Ollama:

        
      
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by SDK but not validated
)

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}]
)

Model Management

        
      
ollama pull llama3              # Download a model
ollama list                     # List installed models
ollama show llama3              # Show model details (params, license, size)
ollama cp llama3 my-llama3      # Copy/alias a model
ollama rm old-model             # Delete a model
ollama ps                       # Show running models and memory usage

Supported Models

Ollama’s model library includes most popular open-weight models, pre-quantized and ready to pull:

Model	Sizes	Use Case
Llama 3 / 3.1	8B, 70B	General purpose, strong reasoning
Mistral / Mixtral	7B, 8x7B	Fast inference, good code
Qwen 2.5	0.5B-72B	Multilingual, strong on benchmarks
Phi-3 / Phi-4	3.8B, 14B	Small but capable, good for edge
Gemma 2	2B, 9B, 27B	Google’s open model, efficient
DeepSeek Coder	1.3B-33B	Code generation and completion
CodeLlama	7B, 13B, 34B	Code-specialized Llama
Llava	7B, 13B	Multi-modal (image + text)
Command-R	35B, 104B	RAG-optimized

Custom models from HuggingFace can be imported if they are in GGUF format or convertible to it.

Use Cases

1. Local Development and Testing

Run the same model your production system uses (or a smaller variant) locally. Test prompt engineering, tool calling, and agent logic without API costs or rate limits.

        
      
# Dev machine: run a small model for fast iteration
ollama run phi3

# CI pipeline: run integration tests against a local model
docker run -d --name ollama ollama/ollama
docker exec ollama ollama pull llama3:8b
pytest tests/integration/ --llm-base-url=http://localhost:11434/v1

2. Air-Gapped / Regulated Environments

In environments where data cannot leave the network (healthcare, defense, financial services), Ollama provides LLM capabilities without any external API calls. Pre-pull models on an internet-connected machine, transfer the ~/.ollama/models directory to the air-gapped system.

3. Privacy-Sensitive Applications

All inference happens locally. No data is sent to any external service. Useful for processing sensitive documents, PII-containing data, or proprietary code.

4. Edge and Embedded

Ollama runs on macOS, Linux, and Windows. With small models (Phi-3 3.8B, Gemma 2B), it runs reasonably well on laptops and even some single-board computers with sufficient RAM.

Limitations for Production

Ollama is explicitly not designed for production serving at scale. Understanding these limitations is critical before choosing it over vLLM or TGI.

Limitation	Detail	Production Alternative
Single-request inference	No continuous batching; processes one request at a time per model	vLLM, TGI
No tensor parallelism	Cannot split a model across multiple GPUs	vLLM, TensorRT-LLM
No autoscaling	Single process, no built-in horizontal scaling	vLLM + K8s HPA, KServe
Limited observability	Basic logging only, no Prometheus metrics, no distributed tracing	vLLM (Prometheus), TGI
No auth/RBAC	API is open by default, no token validation	API gateway in front
GGUF only	Cannot serve models in safetensors/PyTorch format directly	vLLM (native HF support)
Quantized only (practical)	Full-precision models use too much CPU memory to be useful	vLLM on GPUs

When Ollama in Kubernetes Makes Sense

Despite the limitations, there are valid K8s deployments:

        
      
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            memory: "16Gi"
            cpu: "8"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models

This works for internal tools, dev environments, and low-traffic internal APIs where the simplicity of Ollama outweighs the performance limitations. Do not use this pattern for customer-facing production traffic.

Ollama vs vLLM: Decision Matrix

Factor	Ollama	vLLM
Setup time	Minutes	Hours (CUDA, drivers, model conversion)
Throughput	Low (single request)	High (continuous batching, PagedAttention)
GPU utilization	Basic	Optimized (95%+ memory efficiency)
Model format	GGUF (quantized)	Native HF (safetensors, FP16/BF16)
Multi-GPU	No	Yes (tensor + pipeline parallelism)
API compatibility	OpenAI-compatible	OpenAI-compatible
Best for	Dev, testing, air-gapped, privacy	Production serving at scale

When to Use Ollama

Use Ollama when:

You need a fast local LLM for development, prompt engineering, or testing
You are in an air-gapped or high-security environment
You want to run integration tests in CI without cloud API costs
You need a simple, low-maintenance LLM for internal tools with low traffic
You are evaluating models and want to quickly try different options

Skip Ollama when:

You need to serve multiple concurrent users in production (use vLLM)
You need maximum inference performance on GPUs (use vLLM or TensorRT-LLM)
You need autoscaling based on demand (use vLLM + K8s HPA or KServe)
You are serving full-precision models and need every bit of quality (use vLLM with FP16)

References

Ollama official site: https://ollama.ai
Ollama GitHub: https://github.com/ollama/ollama
Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md
Ollama model library: https://ollama.ai/library
GGUF format specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
Ollama Docker image: https://hub.docker.com/r/ollama/ollama

AI & Agents, AI Ops

system-design

This post is licensed under CC BY 4.0 by the author.