[EN] Acervo v0.1: I built my own AI memory because the standard solution doesn't work

This is the first post in a series where I document the development of Acervo — an open source library that tackles the memory problem in AI agents. I'm writing it as I develop, version by version, mistakes included.

Your AI forgets everything. Always.

If you use AI tools in your daily work — Claude, ChatGPT, Cursor, or any agent — you've already experienced it.

Monday: "I work at a software studio, we have three web projects, one with React, another with Next.js, the third with Angular..."

Tuesday: "I work at a software studio, we have three web projects..."

Every session starts from scratch. And within the same session, the problem is worse than it seems: the LLM receives the entire previous conversation on every turn. Turn 1: 200 tokens. Turn 50: 9,000 tokens. Turn 100: it hits the context limit and starts losing information — ironically, the first thing you told it.

It's like your coworker reading every meeting from the last month every time you ask them a question. And when there are too many meetings, they start forgetting the earliest ones.

The problem is worse than it seems

The industry tried to improve this, but with more layers of the same thing. Now you can add files like CLAUDE.md or Custom Rules in Copilot to give the model persistent context. You can write Skills.md to add summarized knowledge about your stack. AGENTS.md to give it personality or specify how you want it to behave. All of this helps — but it's more tokens. More static context sent with every request.

And that's just the beginning. Tool definitions add JSON schemas. Coordinator agents inject their own rules. MCP servers add more instructions. Each new layer burns context tokens that could be used for the actual conversation. Models now support 128K, 200K, even 1M tokens — but having more space doesn't solve the underlying problem. You're stuffing everything into an ever-bigger bag instead of organizing what you need. With long sessions, the model loses information anyway.

And there's a second problem nobody talks about: sessions. Every conversation is independent. If today you spend 2 hours with your agent on a complex bug, tomorrow that doesn't exist. We'd need talking to an agent to be like talking to a coworker — regardless of the session, regardless of when the last conversation was, the relevant context is always available.

The standard solution: RAG

The industry's answer is called RAG (Retrieval-Augmented Generation). The idea is intuitive: you save conversations as documents, convert them to vectors, and when a new question arrives, you search for the most similar fragments and inject them into the context.

It works. Up to a point. But it has three fundamental problems:

RAG retrieves text, not knowledge. If in session 3 you said "I'm a River Plate fan" and in session 47 you ask about River's coach, the system has to find that specific fragment among thousands of chunks. Sometimes it does. Sometimes it doesn't. There's no "River Plate" node with everything accumulated — there are scattered pieces of text.

RAG sends noise. A typical retrieval pulls 5 chunks of ~500 tokens each = 2,500 tokens. Most of it is irrelevant context around the phrase that actually mattered. To answer "what framework are we using?", the LLM receives 2,500 tokens of text when the answer fits in 5.

RAG grows but doesn't improve. Each conversation adds more chunks. More chunks = more noise in results = more tokens spent without adding value.

The origin of the idea

I've been using coding agents like Claude Code for over a year. I always felt something was missing: a coordinator "agent" that manages everything that needs to be done, that can work across multiple projects, that maintains the thread between sessions.

Claude Code improved with parallel agents and sub-agent teams, but the main thread still suffers from the same context problem. I wanted an agent I could use for diverse tasks without losing context every time I switch topics or close a session.

For a moment I thought OpenClaw would let me do it — and while it's cool, the problem was even more fundamental. The core issues were the models' context problems and sessions. It didn't matter what orchestrator I used, they all hit the same wall: the history grows, the context fills up, information is lost.

And that's when it hit me: why instead of always using the full chat as context, don't we "bring" the necessary information little by little, on demand?

Thinking about this, a reference came to mind about how we humans think. Similar to how we archived information before computers: filing cabinets with indexes, folders organized by topic. When you needed something, you didn't read the entire archive — you looked in the index, opened the right folder, and brought only what was necessary. If you needed more, you dug deeper. But you were always "requesting" information little by little, on demand.

That's where the name came from: acervo. In library science, an acervo is the complete collection of a library — every book, document, and record, organized so anything can be found when needed.

The core idea: stop sending the full chat

The idea wasn't to build another RAG. It was something conceptually simpler.

What if instead of sending the entire previous conversation to the model, we bring only the knowledge it needs for this turn?

Not the full history. Not text chunks. Entities with their attributes and relationships — what the model actually needs to know, compressed into an organized structure.

The fundamental shift isn't "use fewer tokens" — it's stop using the chat as the context source. In the traditional approach, the chat IS the context: everything the model knows comes from rereading previous messages. If they're lost (due to context limit or a new session), the knowledge is lost.

With Acervo, knowledge lives in the graph. The chat is transient — just a medium through which information arrives. Once extracted and persisted in the graph, the original chat can disappear without losing anything.

What this changes in practice

Think of it this way. In a typical work session with a coding agent, your context today looks like:

System prompt + CLAUDE.md / rules          ~2,000 tokens
Skills / AGENTS.md / personality           ~1,500 tokens
Tool definitions (MCP, function calling)   ~3,000 tokens
Conversation history (turn 50)             ~9,000 tokens
                                          ─────────────
Total:                                    ~15,500 tokens

The history grows with each turn. By turn 100, that's 20,000+ tokens of chat alone. And when you switch sessions, that history disappears and you start over.

With Acervo, the chat history is replaced by graph nodes:

System prompt + rules                      ~2,000 tokens
Skills / personality                       ~1,500 tokens
Tool definitions                           ~3,000 tokens
Graph context (relevant nodes)               ~400 tokens  ← constant
Last 2 messages                              ~200 tokens
                                          ─────────────
Total:                                     ~7,100 tokens

The key difference: those ~400 graph tokens don't grow. Turn 1, turn 50, turn 500 — always ~400 tokens because the graph only brings the nodes relevant to the current topic, not everything that was said.

And when you close the session and open a new one the next day, you don't start from zero. The graph is still there. Context is reconstructed on the first turn, with the same ~400 tokens. The session stops being a limit.

Even in an extreme scenario — many skills, many MCPs, document chunks — the total stays well below the context limit. And if tomorrow you open a new session, it drops back to the baseline. You can keep talking as if you never closed the conversation.

The graph as a semantic compression layer between conversations and the LLM. That was Acervo's foundational concept.

The initial architecture

From the start, I thought of Acervo as an independent library — something that could be invoked as an API, as an MCP server, or integrated in any way into any agent system. The idea was for it to be framework-agnostic: if you use LangChain, CrewAI, or your own system, Acervo connects as just another piece.

The knowledge graph

The heart of the system is a graph persisted to disk. Each node represents a real-world entity — a person, a place, an organization, a technology, a project. Each edge represents a relationship between two entities.

If in a conversation I say "I'm a River Plate fan and I live in Cipolletti", the extractor creates:

Nodes:
  River Plate  → organization
  Cipolletti   → place

Facts:
  River Plate: "The user is a River Plate fan" (source: user)
  Cipolletti: "The user lives in Cipolletti" (source: user)

A principle from day one: a node is never duplicated, it's enriched. If "River Plate" appears in 30 conversations over 3 months, there's a single node. Each conversation adds facts to the existing node. The graph is a repository of accumulated knowledge, not a conversation history.

Two analogies to understand the layers

The operating system analogy: an OS doesn't give every program access to all memory. It gives it what it needs, when it needs it, and frees it when it's no longer used. Acervo does the same: the LLM (program) doesn't receive the full history (memory). It receives a small, relevant subset (RAM). The graph (disk) stores everything else. When the user mentions something old (page fault), the graph brings it back.

The human analogy — which for me is the most intuitive: the hot/warm/cold layers are a metaphor for how we think when we talk. When you're in a conversation, you keep focus on one or a few things at a time — that's the hot layer. Things you mentioned a while ago are "there" but not in the foreground — warm. And everything else you know but aren't thinking about — cold. You're not reading your entire life every time you answer a question. You bring to the foreground only what's relevant.

The per-turn pipeline

User message
    ↓
Topic detector (did the topic change?)
    ↓
Activation of relevant graph nodes
    ↓
Query planner (what information does the LLM need?)
    ↓
Context build (assembles hot + warm layers)
    ↓
LLM responds to the user (streaming)
    ↓
Extractor (pulls knowledge from the conversation)
    ↓
Persists to graph → available for the next turn

The cascading topic detector

Before each turn, the system detects whether you changed topics. This determines which nodes are activated.

Level 1 — Keywords (free, instant): regex. "Changing topics", "on another note". If it matches → resolved.

Level 2 — Embeddings (fast, no LLM): cosine similarity between the message and the current topic. High → same topic. Low → new topic. Medium → ambiguous, escalates to level 3.

Level 3 — LLM (only if L2 can't decide): asks the model "same topic, subtopic, or new topic?".

Most turns are resolved at L1 or L2, without LLM calls.

The conservative extractor

The extractor has a strict philosophy: it only saves what was explicitly stated. Never inferences, never general knowledge. Why? Because hallucinations in the graph propagate forever. If an incorrect fact enters the graph, the LLM will confidently repeat it in every future conversation. An empty node is better than a node with incorrect data.

Local setup

Everything runs locally. Two models in LM Studio:

Qwen 3.5 9B — conversation. Smart, fluent.
Qwen 2.5 3B — extraction and classification. Fast, clean JSON, no "thinking" mode.

Why two models? The 9B had a reasoning mode that generated <think> blocks of 4,000+ tokens of internal thought. Useful for conversation, a disaster for extraction: it consumed context and sometimes truncated the JSON output.

No cloud APIs. No cost per token. No data leaving your machine.

What worked

By the end of v0.1, the system captured real knowledge and used it in future conversations. If on Monday you talked about your project and on Wednesday you asked "what framework are we using?", the LLM already knew.

The context remained constant regardless of how many conversations had passed. The core idea of semantic compression worked.

What broke

Too many LLM calls. 3-4 per turn, several synchronous. The planner ran before the model started responding. The user waited.

The topic detector was hyperactive. A conversation about Harry Potter jumped between "Literature" → "Pop culture" → "Fantasy" on consecutive turns. Follow-up questions failed because they didn't explicitly mention Harry Potter.

The graph was flat. Loose nodes without hierarchical structure. Batman and Gotham existed but nothing connected them to the DC universe.

The 9B thought too much. <think> blocks of 4,000+ tokens before producing a response. That's why it needed the separate 3B — but maintaining two models was fragile.

What we learned for v0.2

v0.1 validated the hypothesis: a knowledge graph can replace raw history and keep context constant. Semantic compression works.

But we still needed to prove it works with long conversations — short chats worked fine, but we hadn't tested it with 100+ turn sessions with multiple topic changes.

We also realized the query planner was unnecessary, that we needed a different way to detect topics, and that extraction had to improve drastically. The conclusion was that we needed a fine-tuned model for this.

And most importantly: Acervo had to be installable as a tool — like Git, like Claude Code — in any project, so it could be included within any agent system. Not an app. A piece of infrastructure.

The question that led me to v0.2 was: what if a single model could do everything? That meant fine-tuning. I'd never done it. I had to learn.

In the next post: fine-tuning a Qwen 3.5 9B to do chat AND extraction with a single prompt, simplifying the pipeline by half, and the concept of stateless Acervo.

Acervo is open source under Apache 2.0. Repo: github.com/sandyeveliz/acervo · Model: huggingface.co/SandyVeliz/acervo-extractor-qwen3.5-9b