Back to the index
Alago R&D 023 14 APR 2026 DE
Method № 023 · Benchmark

Graph-RAG beats vector retrieval on technical specs.

On 41,000 DIN-cited paragraphs from a Projektsteuerer's Pre-Con archive, structured retrieval beats embedding similarity by 31% on multi-hop queries — and the gap widens with question depth.

AuthorVinzenz Trimborn
Published14 Apr 2026
Read9 min
TopicRetrieval · Eval

For two years we have shipped a vector-store-backed retriever for technical questions over Pre-Construction documents. It is fast, simple, and — on the kinds of questions our users actually ask — frustratingly mediocre. This note describes the experiment that made us stop arguing about it: a head-to-head between embedding similarity and Graph-RAG on a corpus of forty-one thousand paragraphs that cite at least one DIN norm, evaluated on a hand-built set of 312 questions stratified by reasoning depth.1

The result is not subtle. Vector retrieval wins on shallow factual lookups — what we'll call "1-hop" questions — and loses, sometimes catastrophically, on anything that requires connecting two or more entities. By the time you reach 3-hop questions ("which Mängel raised by which Bauleiter on Project X exceeded their Frist?") the embedding retriever is essentially noise.

The setup

The corpus is a curated slice of our internal Projektsteuerer-protocol archive: 41,213 paragraphs from Jour-fixe-Protokollen, Vergabevermerken, and Bauprotokollen, each tagged with at least one DIN reference (DIN 276, DIN 18960, DIN EN 1990, etc.). Paragraphs were chosen for retrieval-density — short prose with at least three named entities and at least one obligation. The graph version of the same corpus is the v0.4 schema described in entry № 024, with one node per entity mention and typed edges resolved by our entity linker.

The query set was built by two engineers and one Projektsteuerer over a week. Each query is labelled with its hop depth:

Both retrievers received the same downstream LLM (Claude 3.5 Sonnet, temperature 0) and the same answer-evaluation pipeline (exact match for factual answers, LLM-as-judge with rubric for analytical answers, with both authors blinded to retriever identity).

FIG. F1 BY HOP DEPTH · N=312 QUERIES HIGHER IS BETTER 1.00 0.75 0.50 0.25 0.00 1-HOP .78 .81 2-HOP .42 .66 3-HOP .18 .49 VECTOR (e5-mistral, top-12) GRAPH-RAG (3-hop, β=0.6)
FIG. 01 Retrieval F1 by query hop-depth, 312 queries, three retrievers (only two shown). Vector and graph are within noise on 1-hop. The gap on 3-hop is the entire reason we are writing this note.

Why the gap exists

Two failure modes account for almost all of the vector retriever's losses on multi-hop queries.

The first is "near-miss anchoring". Embeddings are extremely good at finding paragraphs that look like the query — same DIN reference, same trade vocabulary, same numerical range — and extremely bad at finding paragraphs that share an entity with the query. If you ask "which defects on Project X cite DIN 18960?", a vector retriever will happily return ten paragraphs about DIN 18960 from any project, because they read like the query. A graph retriever can constrain by the project node and return only the relevant subgraph.2

The second is "broken chains". 3-hop questions implicitly require traversing typed relationships: Projektsteuerer → manages → Project → contains → Prüfgegenstand → assigned_to → Gewerk. Vector retrieval has no notion of typed relations; it returns paragraphs and hopes the LLM can reconstruct the chain. The LLM, in our experience, does not.

"Embedding similarity is a tool for finding paragraphs that read like a query. A query is rarely a paragraph. The mismatch is why we keep needing graphs."

Where vector wins

It is fashionable, in 2026, to write graph-RAG triumph posts. We are trying to write a slightly more honest one. Vector retrieval still beats the graph in three places:

  1. Cold-start corpora. The graph version of the corpus required four engineer-weeks of entity linking and edge typing. Vector indexing took an afternoon.
  2. Out-of-schema questions. If a user asks something the schema does not know how to type — "find me poetic descriptions of bad weather in the daily protocols" — the graph retriever returns nothing. The vector retriever returns plausible candidates, even if they are wrong.
  3. Latency, in the small. For 1-hop queries our vector retriever responds in 80ms p50; the graph retriever in 240ms p50. For interactive use this matters.

Our production system is now hybrid. The router classifies a query by estimated hop-depth (a small fine-tuned classifier, F1 0.91 on a held-out set), and dispatches to vector for 1-hop and graph for 2+. Below is the routing rule, which is shorter than the discussion that produced it:

def route(query: str) -> Retriever:
    depth = depth_classifier.predict(query)
    if depth == 1:                  return vector_retriever
    if depth >= 2:                  return graph_retriever
    if confidence(depth) < 0.7:     return ensemble(vector, graph)
    return graph_retriever

A second figure: where the gap comes from

One way to understand the difference is to plot recall as a function of the number of distinct entities in the gold answer. The vector retriever's recall collapses linearly. The graph retriever's recall holds, then drops sharply at the schema boundary (entities the linker failed on).

FIG. RECALL VS. ENTITY COUNT N=312 · K=10 1.0 .75 .50 .25 0 1 2 3 4 5+ DISTINCT ENTITIES IN GOLD ANSWER VECTOR GRAPH
FIG. 02 Recall@10 as a function of how many distinct entities the gold answer contains. The vector retriever's failure is mechanical: each new entity adds an independent chance of missing.

What we are doing next

Three open threads:

The benchmark and the eval harness will be released alongside entry № 022 on schema drift. We will hold off on the dataset until our customers consent to releasing the redacted Bauprotokolle that underpin it.

— V. T., Munich, 14 April 2026.

Notes

  1. The "Graph-RAG" name is Microsoft's; the technique we use is closer to LangChain's GraphCypherQAChain in spirit, but the retriever is our own and is not based on Cypher.
  2. This is also the reason vector retrievers benchmark so well on academic QA datasets — those datasets are dominated by 1-hop questions, where embeddings shine. Real users almost never ask 1-hop questions; they ask 2-hop questions in 1-hop language.
V
Author · Vinzenz Trimborn

Co-founder of Alago. Writes about ontologies, retrieval, and the messy business of modelling Pre-Construction work.

Continue reading

№ 024 · Featured · 12 min A working vocabulary for the Pre-Construction office Featured → № 019 · Field note · 8 min Time on a project is not a number Related →