[EN] Acervo v0.2: I fine-tuned my own model and the pipeline was cut in half

Second installment in the Acervo series. If you haven't read the previous post, start there — it explains the problem and why RAG isn't enough.

What was left pending from v0.1

v0.1 proved that the core idea works: a knowledge graph can compress conversations and keep context constant. In short chats it worked well — the LLM received ~400 tokens of graph instead of thousands of raw history.

But there were things we hadn't tested. The most important: does it work with long conversations? Chats of 100+ turns where you change topics 5 times, return to previous topics, correct information. That was the real test and we still didn't have it.

We also realized the pipeline was too complex for what it did. Two different models, four LLM calls per turn, three synchronous. The query planner — which decided before the chat what information the model needed — was unnecessary. And extraction with the 3B model produced inconsistent results.

The conclusion was twofold: we needed to drastically simplify the pipeline, and we needed a model that could do structured extraction reliably.

Step 1: redesign the pipeline

Before touching models, the first thing was to sit down and think about which steps we actually needed and which were unnecessary. I grabbed a piece of paper and drew the ideal flow.

The result was this simplified pipeline:

S1 Unified — topic classification + extraction in a single call. Replaces three separate components (L3 classifier, query planner, entity extractor) with a single call that returns everything together.
S2 Gather — searches the graph for relevant nodes. No LLM, pure logic.
S3 Agent — the LLM responds to the user with enriched context.
S1.5 Graph Update — graph curation in the background, after the response. Doesn't block the user.

From 4 synchronous LLM calls to 2 synchronous + 1 asynchronous. The query planner disappeared. The user no longer waits for pre-processing before receiving their response.

But this diagram had an implicit problem: step S1 needed a model that could do topic classification AND entity extraction in a single call, producing structured JSON reliably. The Qwen 2.5 3B wasn't good enough for that, and the Qwen 3.5 9B had the <think> block problem that broke extraction.

We needed a model that could do both things well. And that led us to fine-tuning.

Step 2: fine-tuning my own model

I'd never done fine-tuning. I knew it existed, understood the theory, but had never trained a model. I decided to learn by doing.

The idea was conceptually simple: take Qwen 3.5 9B as the base and train it with extraction examples. Have the model learn that when it receives an extraction prompt it returns clean JSON, and when it receives a chat prompt it converses normally. One model. Two behaviors. Determined solely by the system prompt.

Building the dataset

The hardest part wasn't the training — it was creating the training data. I needed hundreds of conversation → correct extraction examples, covering all edge cases:

Example type	%	Why it matters
Facts about existing entities	30%	"Our project has 50K users" → don't create a new entity, add fact to existing node
New entity extraction	20%	The basic task: detect people, projects, technologies
Empty output	15%	"Thanks, that's all" → the model must NOT invent entities
Topic changes	10%	"Changing topics..." and also implicit changes without announcement
Subtopics	10%	Going deeper into one aspect without changing the general topic
Events with participants	5%	Meetings, releases, incidents — who, when, where
Corrections	5%	"We migrated from React to Vue" → it's a correction, not a new entity
Deduplication	5%	"Our project" → map to existing node, don't create duplicate

The empty output examples were the most important in the dataset. Without them, the model hallucinates entities in casual chat — it sees "thanks" and extracts a "Gratitude (concept)" node. I had to explicitly teach it when to do nothing.

I generated 612 examples across 5 domains: software (35%), business (20%), literature (15%), personal (15%), academic (15%). Bilingual Spanish/English because that's how I talk to my tools.

The training

Base: Qwen 3.5 9B Method: LoRA via Unsloth Hardware: My RTX 5070 Ti, 16GB VRAM Time: ~1 hour 15 minutes Cost: $0 (all local)

LoRA is what makes this viable: instead of retraining all 9 billion parameters, it injects small matrices into specific layers. It only trains ~0.5% of the model. The base model stays intact, the new skill is added on top.

Results

20 stress tests designed to break the model:

"Thanks, that's all" → does it return empty or invent?
"Our project" with the project already in the graph → does it reference the existing one or duplicate?
"We migrated from React to Vue" → does it understand it's a correction?
"I saw Angular 19 came out, interesting" → does it create a false relationship with the project?
Full paragraph from Dune with 9 characters → does it produce valid JSON?

Result: 100% JSON parse rate, 85% accuracy.

The model is published on Hugging Face: SandyVeliz/acervo-extractor-qwen3.5-9b. Open source, Apache 2.0.

The discovery: a single model is enough

The most surprising result of the fine-tuning wasn't the accuracy — it was discovering that I didn't need two models.

Fine-tuning doesn't replace the base model. It adds a skill. Qwen 3.5 9B still knows how to converse exactly the same. But now, when it receives an extraction system prompt, it produces structured JSON instead of prose with <think> blocks.

# Same model, same endpoint, different system prompt:

# Chat prompt → converses normally
"You are a work assistant..."
→ "The project is going well, we deployed the auth module yesterday."

# Extraction prompt → clean JSON
"Extract structured knowledge. Return only valid JSON."
→ {"entities": [...], "relations": [...], "facts": [...]}

From two loaded models (~14GB VRAM) to just one (~6GB). GPU went from 97% to 42%.

The new pipeline: from 4 calls to 2+1

User message
    ↓
S1 Unified (sync, fine-tuned model)
  → detects the topic AND extracts knowledge in ONE single call
  → runs ALWAYS, every turn
    ↓
Context enriched with graph nodes
    ↓
Agent (sync, same model, different prompt)
  → responds to the user with streaming
    ↓
S1.5 Graph Update (async, does NOT block the response)
  → graph curation + extraction from the assistant's response

S1 Unified is the biggest change. Before, classifying the topic and extracting entities were two separate calls to two different models. Now it's a single call that returns everything:

{
  "topic": {"action": "subtopic", "label": "Production deploy"},
  "entities": [
    {"id": "kubernetes", "label": "Kubernetes", "type": "technology",
     "layer": "UNIVERSAL"}
  ],
  "relations": [
    {"source": "orbit", "target": "kubernetes", "relation": "uses_technology"}
  ],
  "facts": [
    {"entity": "orbit", "text": "Considering migration to Kubernetes",
     "speaker": "user"}
  ]
}

Why does S1.5 exist?

S1.5 solves a subtle but important problem. When the user sends their message, S1 extracts knowledge from that message. But the LLM's response also contains knowledge — data, relationships, facts the assistant mentioned. If we don't extract them, we lose information.

S1.5 runs after the LLM responds, in the background, without the user waiting. It passes the assistant's response through the same model (hence "1.5") and extracts the knowledge the assistant generated. It also does curation: merging duplicates, type correction, new relationships between nodes.

But S1.5 has a deeper purpose than extraction. By processing the response asynchronously, we're making Acervo stateless — it doesn't need session state, doesn't need to know what happened before, doesn't need to be running continuously. The graph gets updated and is ready for the next message, whenever it comes.

Stateless Acervo: the underlying idea

We make Acervo stateless the same way an LLM is stateless. An LLM doesn't "remember" — it receives all its context on every call and responds as if it were the first time.

Acervo does the same: on each turn, it reads the graph, builds the context, and gives it to the LLM. It doesn't matter if 5 seconds or 5 days have passed since the last turn. It doesn't matter if the process restarted. The graph is the state.

Imagine Acervo achieves 100% of what we want. It's a system where the LLM always responds as if it were the first time but with all the context needed for complex tasks. There's no session. There's no growing history. There's a compressed knowledge graph that tells the model exactly what it needs to know, in ~400 tokens, always.

From library to proxy

In v0.1, Acervo was a Python library that you imported and used with prepare() / process(). It worked, but it required changing your application code to integrate it.

In v0.2 we found the best approach was to turn it into a transparent proxy. It sits between your app and the LLM. From your code's perspective, you make a single normal HTTP call. The proxy intercepts, enriches the context, and forwards.

Your app  →  POST /v1/chat/completions  →  Acervo proxy (:9470)
                                               ↓
                                       S1: topic + extraction
                                       context build (graph → tokens)
                                               ↓
                                       POST →  LLM (:1234)
                                               ↓
                                       stream response ←
                                               ↓
                                       S1.5 async: graph curation
                                               ↓
Your app  ←  stream response  ←──────────────────

Zero code changes. You just redirect the base_url of your application to the proxy. Compatible with OpenAI and Anthropic API format.

But Acervo is still also an installable library — the prepare() / process() API is still available for deeper integrations. The idea is for it to be like Git or Claude Code: a tool that installs in any project and adapts to the workflow you already have.

Data lives in .acervo/ in your project directory, following the .git/ pattern:

my-project/
├── .acervo/
│   ├── graph/
│   │   ├── nodes.json
│   │   └── edges.json
│   ├── vectordb/
│   └── config.toml
├── src/
└── ...

What I learned from fine-tuning

Three lessons I couldn't find in any tutorial:

1. The "do nothing" data is the most important. 15% of the dataset are examples where the correct output is empty arrays. Without them, the model hallucinates entities on every turn. Teaching it when NOT to extract is as hard as teaching it when to.

2. The prompt format is a contract. If you train with the template "EXISTING NODES: [...]\nTOPIC HINT: ...\nUSER: ...", the model in production needs to receive exactly that format. Changing one word can degrade accuracy by 30%. It's an implicit contract between training and inference that nobody tells you about, but you find out when something doesn't work.

3. Fine-tuning is absurdly accessible. A consumer GPU. One hour. $0. A dataset of 612 examples. And the result is a model that responds better for your specific task than models 10x larger. Not because it's smarter — because it knows exactly what format you want.

Comparison: v0.1 → v0.2

	v0.1	v0.2
Models	2 (9B chat + 3B extraction)	1 (fine-tuned, does both)
VRAM	~14 GB	~6 GB
LLM calls per turn	3-4 (synchronous)	2 sync + 1 async
Extraction accuracy	Not measured	85%
JSON parse rate	Variable	100%
Integration	Python library only	Transparent proxy + library
State	Session-dependent	Stateless (graph is the state)

What still doesn't work 100%

We still haven't tested with really long conversations. v0.2 improved the architecture but we didn't do the 100+ turn benchmark with multiple topic changes. That's the definitive test and it's the first thing coming in v0.3.

Name duplicates. "Orbit" and "Orbit App" end up as two different nodes. S1.5 should merge them but doesn't always succeed.

Facts sometimes empty. The model extracts entities and relations well, but sometimes where it should extract a specific fact ("has 50K users"), it returns facts: []. The training data was under-represented in those cases.

The graph is still flat. No hierarchy: Batman and Superman are loose nodes with no connection to the DC universe.

What's coming in v0.3

Reproducible benchmarks. A script that runs 100 turns comparing full history vs sliding window vs RAG vs Acervo. The real test we owe ourselves.

Second round of fine-tuning with real usage data. The failures from the first weeks will feed the next training.

Easy installation. Docker Compose with a single command. So someone can try Acervo in 2 minutes.

And the feature I'm most excited about: chunk_refs — having graph nodes point to vector DB chunks. Today the graph has the compressed summary. With chunk_refs, if the summary isn't enough, the system will fetch the full text — but only the chunks that the node references, not the thousands in the entire database. The graph as an index for RAG. It doesn't replace it — it makes it 20x more efficient.

Current status

v0.2 is published on PyPI (pip install acervo). The fine-tuned model is on Hugging Face under Apache 2.0. Everything runs 100% locally with 6GB of VRAM.

Acervo (library): github.com/sandyeveliz/acervo
Acervo Models (training): github.com/sandyeveliz/acervo-models
Model (HuggingFace): SandyVeliz/acervo-extractor-qwen3.5-9b
Acervo Studio (GitHub): github.com/sandyeveliz/acervo-studio

If you're interested in AI memory or context compression, follow along — I'll keep documenting the entire process, including what doesn't work.