Post

Ollama

A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.

Ollama

A local LLM runtime that packages model weights, tokenizer, and runtime configuration into a single manageable unit, making it trivial to run open-weight models on developer machines, CI pipelines, and air-gapped environments.


Why Ollama Matters

Ollama is not trying to be a high-throughput production serving engine. Its value proposition is developer experience: ollama run llama3 and you have a working LLM in seconds. No Python environment, no CUDA toolkit configuration, no HuggingFace token dance. This makes it the fastest path from “I want to try a model” to “I have a model running locally.”

For enterprise teams, Ollama fills three specific gaps: local development parity with production LLM APIs, CI/CD integration testing without cloud API costs, and air-gapped deployment where no external network access is permitted.


Architecture

Ollama is a single Go binary that bundles llama.cpp (for inference) with a model management layer and an HTTP API server.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
+----------------------------------------------------------+
|  Ollama Process                                           |
|  +----------------+  +---------------+  +--------------+ |
|  | Model Registry  |  | llama.cpp     |  | HTTP API     | |
|  | (pull, list,    |  | Engine        |  | Server       | |
|  |  create, push)  |  | (GGUF format) |  | :11434       | |
|  +-------+--------+  +-------+-------+  +------+-------+ |
|          |                    |                  |         |
|  +-------v--------------------v------------------v------+ |
|  | Model Storage (~/.ollama/models)                      | |
|  | Quantized GGUF files, Modelfiles, manifests           | |
|  +-------------------------------------------------------+ |
+----------------------------------------------------------+
         |                    |
    +----v-----+       +-----v------+
    | CPU/GPU   |       | System RAM |
    | Inference |       | / VRAM     |
    +-----------+       +------------+

Model Format: GGUF

Ollama uses the GGUF format (from the llama.cpp project), which packages model weights in quantized form. Quantization levels range from Q2_K (smallest, lowest quality) to F16 (full half-precision, highest quality). The sweet spot for most use cases is Q4_K_M – roughly 4 bits per weight, which gives ~70-80% of full-precision quality at 25% of the memory footprint.

Modelfile (Dockerfile for Models)

1
2
3
4
5
6
7
FROM llama3

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

SYSTEM """You are a helpful code review assistant. Focus on security issues,
performance problems, and maintainability concerns. Be concise."""

Build and run: ollama create code-reviewer -f Modelfile && ollama run code-reviewer

This is powerful for standardizing model configurations across a team – version-control the Modelfile, share via a registry.


API

Ollama exposes a REST API on port 11434 that is straightforward to integrate with.

Generate (Completion)

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Kubernetes pods in one paragraph",
  "stream": false
}'

Chat (Multi-turn)

1
2
3
4
5
6
7
8
curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a K8s expert."},
    {"role": "user", "content": "What is a DaemonSet?"}
  ],
  "stream": false
}'

OpenAI Compatibility

Since late 2023, Ollama supports OpenAI-compatible endpoints at /v1/chat/completions. This means any tool or SDK using the OpenAI client library can point to Ollama:

1
2
3
4
5
6
7
8
9
10
11
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by SDK but not validated
)

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}]
)

Model Management

1
2
3
4
5
6
ollama pull llama3              # Download a model
ollama list                     # List installed models
ollama show llama3              # Show model details (params, license, size)
ollama cp llama3 my-llama3      # Copy/alias a model
ollama rm old-model             # Delete a model
ollama ps                       # Show running models and memory usage

Supported Models

Ollama’s model library includes most popular open-weight models, pre-quantized and ready to pull:

Model Sizes Use Case
Llama 3 / 3.1 8B, 70B General purpose, strong reasoning
Mistral / Mixtral 7B, 8x7B Fast inference, good code
Qwen 2.5 0.5B-72B Multilingual, strong on benchmarks
Phi-3 / Phi-4 3.8B, 14B Small but capable, good for edge
Gemma 2 2B, 9B, 27B Google’s open model, efficient
DeepSeek Coder 1.3B-33B Code generation and completion
CodeLlama 7B, 13B, 34B Code-specialized Llama
Llava 7B, 13B Multi-modal (image + text)
Command-R 35B, 104B RAG-optimized

Custom models from HuggingFace can be imported if they are in GGUF format or convertible to it.


Use Cases

1. Local Development and Testing

Run the same model your production system uses (or a smaller variant) locally. Test prompt engineering, tool calling, and agent logic without API costs or rate limits.

1
2
3
4
5
6
7
# Dev machine: run a small model for fast iteration
ollama run phi3

# CI pipeline: run integration tests against a local model
docker run -d --name ollama ollama/ollama
docker exec ollama ollama pull llama3:8b
pytest tests/integration/ --llm-base-url=http://localhost:11434/v1

2. Air-Gapped / Regulated Environments

In environments where data cannot leave the network (healthcare, defense, financial services), Ollama provides LLM capabilities without any external API calls. Pre-pull models on an internet-connected machine, transfer the ~/.ollama/models directory to the air-gapped system.

3. Privacy-Sensitive Applications

All inference happens locally. No data is sent to any external service. Useful for processing sensitive documents, PII-containing data, or proprietary code.

4. Edge and Embedded

Ollama runs on macOS, Linux, and Windows. With small models (Phi-3 3.8B, Gemma 2B), it runs reasonably well on laptops and even some single-board computers with sufficient RAM.


Limitations for Production

Ollama is explicitly not designed for production serving at scale. Understanding these limitations is critical before choosing it over vLLM or TGI.

Limitation Detail Production Alternative
Single-request inference No continuous batching; processes one request at a time per model vLLM, TGI
No tensor parallelism Cannot split a model across multiple GPUs vLLM, TensorRT-LLM
No autoscaling Single process, no built-in horizontal scaling vLLM + K8s HPA, KServe
Limited observability Basic logging only, no Prometheus metrics, no distributed tracing vLLM (Prometheus), TGI
No auth/RBAC API is open by default, no token validation API gateway in front
GGUF only Cannot serve models in safetensors/PyTorch format directly vLLM (native HF support)
Quantized only (practical) Full-precision models use too much CPU memory to be useful vLLM on GPUs

When Ollama in Kubernetes Makes Sense

Despite the limitations, there are valid K8s deployments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            memory: "16Gi"
            cpu: "8"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models

This works for internal tools, dev environments, and low-traffic internal APIs where the simplicity of Ollama outweighs the performance limitations. Do not use this pattern for customer-facing production traffic.


Ollama vs vLLM: Decision Matrix

Factor Ollama vLLM
Setup time Minutes Hours (CUDA, drivers, model conversion)
Throughput Low (single request) High (continuous batching, PagedAttention)
GPU utilization Basic Optimized (95%+ memory efficiency)
Model format GGUF (quantized) Native HF (safetensors, FP16/BF16)
Multi-GPU No Yes (tensor + pipeline parallelism)
API compatibility OpenAI-compatible OpenAI-compatible
Best for Dev, testing, air-gapped, privacy Production serving at scale

When to Use Ollama

Use Ollama when:

  • You need a fast local LLM for development, prompt engineering, or testing
  • You are in an air-gapped or high-security environment
  • You want to run integration tests in CI without cloud API costs
  • You need a simple, low-maintenance LLM for internal tools with low traffic
  • You are evaluating models and want to quickly try different options

Skip Ollama when:

  • You need to serve multiple concurrent users in production (use vLLM)
  • You need maximum inference performance on GPUs (use vLLM or TensorRT-LLM)
  • You need autoscaling based on demand (use vLLM + K8s HPA or KServe)
  • You are serving full-precision models and need every bit of quality (use vLLM with FP16)

References

  • Ollama official site: https://ollama.ai
  • Ollama GitHub: https://github.com/ollama/ollama
  • Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md
  • Ollama model library: https://ollama.ai/library
  • GGUF format specification: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
  • Ollama Docker image: https://hub.docker.com/r/ollama/ollama
This post is licensed under CC BY 4.0 by the author.