Post

Agent Memory & State

An agent without memory is just an expensive function call -- memory is what turns a stateless LLM into something that learns, adapts, and maintains continuity across interactions.

Agent Memory & State

An agent without memory is just an expensive function call – memory is what turns a stateless LLM into something that learns, adapts, and maintains continuity across interactions.


Why Memory Matters for Agents

LLMs are stateless. Every API call starts from zero. Memory systems bolt on persistence so agents can reference prior conversations, learn from mistakes, accumulate knowledge, and maintain context across sessions. Without memory, every interaction is a cold start.

For enterprise AI platforms: memory enables personalization, context-aware escalation, and audit trails. For async side ventures: memory lets agents pick up where they left off after running overnight.


Memory Taxonomy

1. Short-Term Memory (Conversation Buffer)

The simplest form: pass the conversation history as part of the prompt. Every message (user and assistant) appends to a list, which gets sent with each new request.

1
2
3
4
5
6
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's our Q1 revenue?"},
    {"role": "assistant", "content": "Q1 revenue was EUR 4.2M..."},
    {"role": "user", "content": "How does that compare to Q4?"},  # needs prior context
]

Limits: Context window is finite. At ~200K tokens (Claude), you get far, but a week of agent activity will overflow. You need a strategy before that happens.

Truncation strategies:

  • Sliding window: Keep last N messages. Simple but loses early context.
  • Summarization: Periodically summarize older messages into a condensed block. LangChain’s ConversationSummaryBufferMemory does this.
  • Token-based trim: Keep as many recent messages as fit in a token budget.

2. Long-Term Memory (Persistent Knowledge)

Facts, preferences, and learned information that persist across sessions. Typically stored in a vector database and retrieved via semantic search (RAG pattern).

1
2
3
4
User says "I prefer Python over TypeScript"
    --> Extract fact: {user_preference: "Python over TypeScript"}
    --> Store in vector DB with embedding
    --> Retrieve in future sessions when language choice is relevant

Storage backends: Pinecone, Weaviate, Qdrant, ChromaDB, pgvector. For enterprise, pgvector in your existing Postgres is often the pragmatic choice.

3. Episodic Memory

Memory of specific past interactions – what happened, when, what worked, what failed. This is how agents learn from experience.

1
2
3
4
5
6
7
8
{
  "episode_id": "ep_2024_0315_001",
  "task": "Debug production auth failure",
  "actions_taken": ["checked logs", "identified token expiry", "updated refresh logic"],
  "outcome": "success",
  "duration_minutes": 12,
  "lessons": "Token refresh failures often caused by clock skew between services"
}

Use case: An agent that debugs production issues gets better over time by recalling similar past incidents and what resolved them.

4. Semantic Memory

General knowledge and facts about the domain, organized as concepts and relationships. Think of it as the agent’s “textbook knowledge” – not tied to specific episodes.

1
2
3
4
5
Company: MediaMarktSaturn
  - Industry: Consumer Electronics Retail
  - Cloud: GCP
  - AI Platform: Custom, GCP-based
  - Key constraint: GDPR compliance required

Implementation: Often a knowledge graph (Neo4j) or structured store. Can also be embedded documents in a vector DB.

5. Working Memory

Scratchpad for the current task. Intermediate results, partial computations, task state. This is not conversation history – it’s the agent’s “notepad” while solving a multi-step problem.

1
2
3
4
5
6
working_memory = {
    "current_task": "Analyze sales data",
    "steps_completed": ["loaded CSV", "cleaned nulls"],
    "steps_remaining": ["run regression", "generate report"],
    "intermediate_results": {"row_count": 14523, "null_pct": 0.03}
}

Implementation: Usually a structured dict/JSON passed in the system prompt or as a tool result. LangGraph’s State object serves this role.


State Management Across Turns

LangGraph Checkpointing

LangGraph treats agent execution as a graph with typed state. The MemorySaver (in-memory) or SqliteSaver / PostgresSaver persists state at every node transition.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph

checkpointer = MemorySaver()  # swap for PostgresSaver in production

graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("write", write_node)
graph.add_edge("research", "write")

app = graph.compile(checkpointer=checkpointer)

# Resume from checkpoint
config = {"configurable": {"thread_id": "task-42"}}
result = app.invoke({"query": "continue analysis"}, config)

Key benefit: If an agent crashes mid-execution, you resume from the last checkpoint, not from scratch. Essential for long-running async agents.

Thread vs. memory separation: thread_id scopes conversation state. A separate memory namespace stores cross-thread facts (user preferences, learned knowledge).

Anthropic Claude: Conversation History as State

Claude’s API uses explicit message arrays. State management is your responsibility. For multi-turn agents:

1
2
3
4
5
6
# Store conversation in your DB, keyed by session_id
conversations_table.upsert(
    session_id="sess_abc",
    messages=messages,
    metadata={"last_active": now(), "agent": "support_v2"}
)

OpenAI Agents SDK: Built-in Thread State

The OpenAI Agents SDK manages conversation threads internally. State persists within a Runner.run() loop but must be explicitly saved for cross-session persistence.


Mem0: Managed Memory Layer

Mem0 (formerly MemGPT-adjacent, now standalone) provides a memory API that sits between your agent and a vector/graph store. It handles extraction, deduplication, and retrieval.

1
2
3
4
5
6
7
8
9
10
11
from mem0 import Memory

memory = Memory()

# Add memories from conversation
memory.add("User prefers concise responses", user_id="imrul")
memory.add("Working on MMS AI platform on GCP", user_id="imrul")

# Retrieve relevant memories for new context
relevant = memory.search("What cloud does the user work with?", user_id="imrul")
# Returns: [{"memory": "Working on MMS AI platform on GCP", "score": 0.94}]

Architecture: Mem0 uses an LLM to extract facts from conversations, embeds them, stores in a vector DB, and retrieves via semantic search. It handles conflict resolution (updating facts that changed) and temporal awareness.

When to use: When you want memory without building the extraction/retrieval pipeline yourself. Good for side ventures where you want fast iteration. For enterprise, evaluate whether you need the control of a custom pipeline.


Custom Memory Architecture for Enterprise

For an enterprise AI platform, you likely need:

1
2
3
4
5
6
7
8
9
10
11
                    ┌─────────────────┐
                    │   Agent Runtime  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼───┐  ┌──────▼─────┐  ┌─────▼──────┐
     │ Short-Term │  │ Long-Term  │  │  Working   │
     │  (Redis)   │  │ (pgvector) │  │  (State)   │
     └────────────┘  └────────────┘  └────────────┘
         TTL: 24h      Persistent     Per-task JSON

Short-term in Redis: Fast, auto-expires, handles conversation buffers for active sessions. TTL prevents unbounded growth.

Long-term in pgvector: User preferences, learned facts, domain knowledge. Queryable with SQL for compliance/audit. Embeddings for semantic search.

Working state in task DB: Per-task JSON blob tracking progress, intermediate results, retry counts. Enables resume-from-checkpoint.


Memory Patterns for Async Agents

When agents run overnight or across days, memory design changes:

  1. Checkpoint aggressively. Save state after every meaningful step, not just at completion. Assume crashes.
  2. Structured handoff memory. When Agent A finishes and Agent B picks up, the handoff object must contain everything B needs – don’t assume B can re-derive context.
  3. Append-only logs. For audit and debugging, never mutate memory – append new entries. Store correction events (“previous analysis was wrong because…”).
  4. Time-aware retrieval. Facts change. “Our primary DB is MySQL” might have been true 6 months ago but not today. Timestamp all memory entries and prefer recent ones.

Tradeoffs

Approach Latency Cost Complexity Best For
Full history in context Low High (tokens) Low Short conversations
Summarized history Low Medium Medium Multi-session chatbots
Vector store RAG Medium (retrieval) Low (storage) High Knowledge-heavy agents
Mem0 managed Medium Medium Low MVPs, side ventures
LangGraph checkpointing Low Low Medium Multi-step workflows
Custom hybrid Varies Varies High Enterprise platforms

Anti-Patterns

  • Stuffing everything into context. You’ll hit token limits and degrade quality. LLMs perform worse with irrelevant context than with less context.
  • No memory expiration. Stale facts poison agent decisions. Build TTLs or versioning.
  • Treating conversation history as the only memory. Conversation is a poor format for storing structured facts. Extract and store separately.
  • Sharing all memory across agents. In multi-agent systems, agents should only see memory relevant to their role. A billing agent doesn’t need the debugging agent’s episodic memory.

References

This post is licensed under CC BY 4.0 by the author.