- Published on
Acervo v0.3 — Proof it works
- Authors

- Name
- Sandy Veliz
- @sandy_veliz
v0.2 proved the architecture worked. v0.3's goal was simpler and harder: prove it to anyone who installs it in 5 minutes.
Everything in this release answers the same question: can someone who isn't me try it, understand it, and verify it's better than what they already have?
Six milestones shipped. Here's what they are.
The bug that had to die first
Before anything else could matter, one thing had to be fixed.
When the graph had no relevant context for a turn — early in a conversation, or during unrelated small talk — the system was falling back to the full conversation history. Every single message, sent to the LLM on every turn. This completely bypassed the sliding window and made token counts grow linearly — the exact problem Acervo exists to prevent.
The fix: enforce the history window always, with or without graph context. If the graph is empty, send the last 2 turn pairs. That's it. (The Anthropic API path had no windowing at all — that got fixed too.)
This single change moved average token savings from ~55% to ~76% across all scenarios.
acervo up
Running Acervo used to require 4 terminals: Ollama, LM Studio, backend, frontend. For a new user that's a setup failure waiting to happen.
acervo up # proxy + health check banner
acervo up --dev # everything: proxy + studio + web + ollama, multiplexed logs
acervo status # health check all dependencies
acervo up --dev starts all services in a single terminal with tagged log output ([proxy], [studio], [web], [ollama]). Ctrl+C stops everything. No Docker, no docker-compose, no PID files — just a process manager in stdlib Python.
Service detection is automatic: Ollama binary via shutil.which, LM Studio via HTTP health check on port 1234, Acervo Studio via config path → importlib.find_spec → sibling directory scan.
Graph inspection
If the extractor creates a wrong node — wrong type, merge it didn't ask for — there was previously no way to fix it without wiping the entire graph.
acervo graph show # table: ID, label, type, kind, layer, facts, edges
acervo graph show <id> # full detail: facts with source/date, all edges
acervo graph search "beacon auth" # search labels and fact content
acervo graph delete <id> # delete node + edges (with confirmation)
acervo graph merge <id1> <id2> # merge two nodes (preview + confirmation)
acervo graph repair # fix missing fields, orphan edges, duplicates
All operations are also available as REST endpoints for use from Acervo Studio. Every destructive command requires confirmation or --yes to bypass.
Document ingestion — chunks linked to the graph
This is the first real step toward "knowledge beyond conversation."
The problem with standard RAG: chunks live in a vector store, isolated from the knowledge graph. When a graph node is activated, there's no connection to the document chunks that support it.
What Acervo does instead: when you index a .md file, the chunks are embedded into ChromaDB and linked to the corresponding graph node via a chunk_ids field. When that node gets activated during a conversation turn, retrieval is scoped to its own chunks — not the global vector store.
acervo index --path ./docs/architecture.md
Or via the REST API (for Studio):
POST /acervo/documents → upload + index
GET /acervo/documents → list indexed documents
GET /acervo/documents/{id} → detail with chunk_ids
DELETE /acervo/documents/{id} → remove doc + chunks + graph nodes
The result: a question like "What's the auth middleware?" retrieves ~200 tokens of scoped context instead of ~2,500 tokens of global RAG.
A bug was fixed in the process. The old _store_embeddings() called index_file_chunks() once per chunk — but each call wiped all existing chunks for that file first. Only the last chunk survived. Replaced with _store_and_link_chunks() that stores everything in one call.
Chunk-aware retrieval
Not every question needs chunks. "What does Beacon do?" is conceptual — the graph node summary is enough (~80 tokens). "What's the exact JWT expiry configuration?" is specific — it needs the relevant chunk.
A new specificity classifier (acervo/specificity.py) makes this decision before touching the vector store:
- Specific patterns (15): code snippets, numbers, dates, "show me", error messages, config keys
- Conceptual patterns (8): "explain", "why", "overview", "compare", "what's your opinion"
Conceptual queries skip chunk retrieval entirely. Specific queries fetch top 3 chunks from the activated node. The classifier's decision is logged per turn as query_specificity in the trace.
31 tests cover the classifier (15 specific, 12 conceptual, 4 edge cases).
Structured trace per turn
Every turn is now fully auditable. A JSONL file at .acervo/traces/{session_id}.jsonl records:
{
"turn": 14,
"context": {
"tokens_total_to_llm": 475,
"tokens_without_acervo": 8900,
"compression_ratio": 0.053,
"graph_nodes_in_context": 6,
"topic": "beacon-auth-bug",
"topic_action": "same"
},
"extraction": {
"entities_extracted": 2,
"relations_created": 3,
"facts_added": 1
},
"performance": {
"prepare_ms": 120,
"process_ms": 350
}
}
CLI and REST:
acervo trace show # table: tokens, compression ratio, timing per turn
GET /acervo/traces/{id} # all turns as JSON
GET /acervo/traces/{id}/summary # aggregated: avg tokens, compression, timing
This is also what powers the benchmark reports.
Configurable logs
acervo up --log-level info # one line per turn: topic, tokens, latency
acervo up -v # debug: entities extracted, topic decisions
acervo up --log-level trace # full prompts + raw responses
INFO output looks like: prepare done: topic=beacon-auth tokens=412 (warm=280 hot=132) 118ms
Third-party loggers (httpcore, httpx, uvicorn, chromadb) are silenced unless you're at TRACE level.
Error handling
- LLM down → graceful degradation: extractors return empty results, proxy returns 502 with a clear error. No crash.
- Invalid extraction JSON → retry once with
temperature=0.0before falling back to empty result. - Corrupt graph →
acervo graph repairfixes missing node fields, orphan edges, and duplicate edges. - Timeouts → configurable per phase in
config.toml:llm_chat,embedding,s1_unified,vector_search.
Benchmark results: 360 turns across 6 scenarios
With the foundation solid, the test framework ran 360 turns across 6 scenarios — synthetic and real-world, technical and non-technical.
| # | Scenario | Turns | Savings | Hit Rate | Entity Recall |
|---|---|---|---|---|---|
| 1 | Developer Workflow | 50 | 68.2% | 94% | 81% |
| 2 | Literature & Comics | 50 | 78.2% | 96% | 50% |
| 3 | Academic Research | 50 | 67.3% | 98% | 46.7% |
| 4 | Mixed Domains | 50 | 79.6% | 84% | 79.2% |
| 5 | SaaS Founder (101 turns) | 101 | 82.5% | 96% | 46.7% |
| 6 | Product Manager (59 turns) | 59 | 81.0% | 94.9% | 50% |
Average: 76% token savings. 0 phantom entities across all 360 turns.
The scissors effect in the SaaS Founder scenario (101 turns, Carlos building Menuboard):
| Turn | Baseline | Acervo | Savings |
|---|---|---|---|
| 1 | 17 tk | 6 tk | 65% |
| 30 | ~500 tk | ~180 tk | 64% |
| 60 | ~1,800 tk | ~380 tk | 79% |
| 100 | 5,157 tk | 490 tk | 90.5% |
When you add realistic agent overhead (system prompt + tools + project context, typically 2,000–8,000 tokens per turn), effective savings shift to 85–93%.
On entity recall variance (46–81%): this is intentional. The fine-tuned model only extracts explicit statements, never inferences. Technical conversations name things plainly ("I'm using FastAPI", "the project is Menuboard") — recall is high. Academic papers and literature reference things obliquely — the model correctly skips them. A hallucinated node in the graph propagates forever. A missed extraction is recoverable.
Prompt variant test: before running the full suite, 4 S1 prompt variants were compared:
| Variant | Entity Recall | Phantom Entities |
|---|---|---|
| Default | 65% | 2 |
| Strict | 78% | 0 |
| Structured | 72% | 0 |
| Verbose | 58% | 0 |
Strict won. It's now the default S1 prompt.
What's next
v0.4 has one goal: document indexation that works for real — .txt, .pdf, .docx support, semantic chunking (boundary detection via cosine similarity instead of fixed size), and benchmark reports for 4 document domains: code repositories, literature, academic papers, and multi-project indexation.
After those benchmarks identify extraction failure modes by domain, fine-tune v2 targets 85% → 92%+ entity recall.
Everything is open source (Apache 2.0).
- Acervo core: https://github.com/sandyeveliz/acervo
- Extractor model (Qwen3.5-9B): https://huggingface.co/SandyVeliz/acervo-extractor-qwen3.5-9b
- Fine-tuning repo: https://github.com/sandyeveliz/acervo-models
- Chat client (Studio WIP): https://github.com/sandyeveliz/acervo-studio