Acervo v0.3 — Proof it works

v0.2 proved the architecture worked. v0.3's goal was simpler and harder: prove it to anyone who installs it in 5 minutes.

Everything in this release answers the same question: can someone who isn't me try it, understand it, and verify it's better than what they already have?

Six milestones shipped. Here's what they are.

The bug that had to die first

Before anything else could matter, one thing had to be fixed.

When the graph had no relevant context for a turn — early in a conversation, or during unrelated small talk — the system was falling back to the full conversation history. Every single message, sent to the LLM on every turn. This completely bypassed the sliding window and made token counts grow linearly — the exact problem Acervo exists to prevent.

The fix: enforce the history window always, with or without graph context. If the graph is empty, send the last 2 turn pairs. That's it. (The Anthropic API path had no windowing at all — that got fixed too.)

This single change moved average token savings from ~55% to ~76% across all scenarios.

`acervo up`

Running Acervo used to require 4 terminals: Ollama, LM Studio, backend, frontend. For a new user that's a setup failure waiting to happen.

acervo up           # proxy + health check banner
acervo up --dev     # everything: proxy + studio + web + ollama, multiplexed logs
acervo status       # health check all dependencies

acervo up --dev starts all services in a single terminal with tagged log output ([proxy], [studio], [web], [ollama]). Ctrl+C stops everything. No Docker, no docker-compose, no PID files — just a process manager in stdlib Python.

Service detection is automatic: Ollama binary via shutil.which, LM Studio via HTTP health check on port 1234, Acervo Studio via config path → importlib.find_spec → sibling directory scan.

Graph inspection

If the extractor creates a wrong node — wrong type, merge it didn't ask for — there was previously no way to fix it without wiping the entire graph.

acervo graph show                         # table: ID, label, type, kind, layer, facts, edges
acervo graph show <id>                    # full detail: facts with source/date, all edges
acervo graph search "beacon auth"         # search labels and fact content
acervo graph delete <id>                  # delete node + edges (with confirmation)
acervo graph merge <id1> <id2>            # merge two nodes (preview + confirmation)
acervo graph repair                       # fix missing fields, orphan edges, duplicates

All operations are also available as REST endpoints for use from Acervo Studio. Every destructive command requires confirmation or --yes to bypass.

Document ingestion — chunks linked to the graph

This is the first real step toward "knowledge beyond conversation."

The problem with standard RAG: chunks live in a vector store, isolated from the knowledge graph. When a graph node is activated, there's no connection to the document chunks that support it.

What Acervo does instead: when you index a .md file, the chunks are embedded into ChromaDB and linked to the corresponding graph node via a chunk_ids field. When that node gets activated during a conversation turn, retrieval is scoped to its own chunks — not the global vector store.

acervo index --path ./docs/architecture.md

Or via the REST API (for Studio):

POST /acervo/documents          → upload + index
GET  /acervo/documents          → list indexed documents
GET  /acervo/documents/{id}     → detail with chunk_ids
DELETE /acervo/documents/{id}   → remove doc + chunks + graph nodes

The result: a question like "What's the auth middleware?" retrieves ~200 tokens of scoped context instead of ~2,500 tokens of global RAG.

A bug was fixed in the process. The old _store_embeddings() called index_file_chunks() once per chunk — but each call wiped all existing chunks for that file first. Only the last chunk survived. Replaced with _store_and_link_chunks() that stores everything in one call.

Chunk-aware retrieval

Not every question needs chunks. "What does Beacon do?" is conceptual — the graph node summary is enough (~80 tokens). "What's the exact JWT expiry configuration?" is specific — it needs the relevant chunk.

A new specificity classifier (acervo/specificity.py) makes this decision before touching the vector store:

Specific patterns (15): code snippets, numbers, dates, "show me", error messages, config keys
Conceptual patterns (8): "explain", "why", "overview", "compare", "what's your opinion"

Conceptual queries skip chunk retrieval entirely. Specific queries fetch top 3 chunks from the activated node. The classifier's decision is logged per turn as query_specificity in the trace.

31 tests cover the classifier (15 specific, 12 conceptual, 4 edge cases).

Structured trace per turn

Every turn is now fully auditable. A JSONL file at .acervo/traces/{session_id}.jsonl records:

{
  "turn": 14,
  "context": {
    "tokens_total_to_llm": 475,
    "tokens_without_acervo": 8900,
    "compression_ratio": 0.053,
    "graph_nodes_in_context": 6,
    "topic": "beacon-auth-bug",
    "topic_action": "same"
  },
  "extraction": {
    "entities_extracted": 2,
    "relations_created": 3,
    "facts_added": 1
  },
  "performance": {
    "prepare_ms": 120,
    "process_ms": 350
  }
}

CLI and REST:

acervo trace show                        # table: tokens, compression ratio, timing per turn
GET /acervo/traces/{id}                  # all turns as JSON
GET /acervo/traces/{id}/summary          # aggregated: avg tokens, compression, timing

This is also what powers the benchmark reports.

Configurable logs

acervo up --log-level info      # one line per turn: topic, tokens, latency
acervo up -v                    # debug: entities extracted, topic decisions
acervo up --log-level trace     # full prompts + raw responses

INFO output looks like: prepare done: topic=beacon-auth tokens=412 (warm=280 hot=132) 118ms

Third-party loggers (httpcore, httpx, uvicorn, chromadb) are silenced unless you're at TRACE level.

Error handling

LLM down → graceful degradation: extractors return empty results, proxy returns 502 with a clear error. No crash.
Invalid extraction JSON → retry once with temperature=0.0 before falling back to empty result.
Corrupt graph → acervo graph repair fixes missing node fields, orphan edges, and duplicate edges.
Timeouts → configurable per phase in config.toml: llm_chat, embedding, s1_unified, vector_search.

Benchmark results: 360 turns across 6 scenarios

With the foundation solid, the test framework ran 360 turns across 6 scenarios — synthetic and real-world, technical and non-technical.

#	Scenario	Turns	Savings	Hit Rate	Entity Recall
1	Developer Workflow	50	68.2%	94%	81%
2	Literature & Comics	50	78.2%	96%	50%
3	Academic Research	50	67.3%	98%	46.7%
4	Mixed Domains	50	79.6%	84%	79.2%
5	SaaS Founder (101 turns)	101	82.5%	96%	46.7%
6	Product Manager (59 turns)	59	81.0%	94.9%	50%

Average: 76% token savings. 0 phantom entities across all 360 turns.

The scissors effect in the SaaS Founder scenario (101 turns, Carlos building Menuboard):

Turn	Baseline	Acervo	Savings
1	17 tk	6 tk	65%
30	~500 tk	~180 tk	64%
60	~1,800 tk	~380 tk	79%
100	5,157 tk	490 tk	90.5%

When you add realistic agent overhead (system prompt + tools + project context, typically 2,000–8,000 tokens per turn), effective savings shift to 85–93%.

On entity recall variance (46–81%): this is intentional. The fine-tuned model only extracts explicit statements, never inferences. Technical conversations name things plainly ("I'm using FastAPI", "the project is Menuboard") — recall is high. Academic papers and literature reference things obliquely — the model correctly skips them. A hallucinated node in the graph propagates forever. A missed extraction is recoverable.

Prompt variant test: before running the full suite, 4 S1 prompt variants were compared:

Variant	Entity Recall	Phantom Entities
Default	65%	2
Strict	78%	0
Structured	72%	0
Verbose	58%	0

Strict won. It's now the default S1 prompt.

What's next

v0.4 has one goal: document indexation that works for real — .txt, .pdf, .docx support, semantic chunking (boundary detection via cosine similarity instead of fixed size), and benchmark reports for 4 document domains: code repositories, literature, academic papers, and multi-project indexation.

After those benchmarks identify extraction failure modes by domain, fine-tune v2 targets 85% → 92%+ entity recall.

Everything is open source (Apache 2.0).

→ View benchmark reports

Acervo core: https://github.com/sandyeveliz/acervo
Extractor model (Qwen3.5-9B): https://huggingface.co/SandyVeliz/acervo-extractor-qwen3.5-9b
Fine-tuning repo: https://github.com/sandyeveliz/acervo-models
Chat client (Studio WIP): https://github.com/sandyeveliz/acervo-studio