For two years we have shipped a vector-store-backed retriever for technical questions over Pre-Construction documents. It is fast, simple, and — on the kinds of questions our users actually ask — frustratingly mediocre. This note describes the experiment that made us stop arguing about it: a head-to-head between embedding similarity and Graph-RAG on a corpus of forty-one thousand paragraphs that cite at least one DIN norm, evaluated on a hand-built set of 312 questions stratified by reasoning depth.1
The result is not subtle. Vector retrieval wins on shallow factual lookups — what we'll call "1-hop" questions — and loses, sometimes catastrophically, on anything that requires connecting two or more entities. By the time you reach 3-hop questions ("which Mängel raised by which Bauleiter on Project X exceeded their Frist?") the embedding retriever is essentially noise.
The setup
The corpus is a curated slice of our internal Projektsteuerer-protocol archive: 41,213 paragraphs from Jour-fixe-Protokollen, Vergabevermerken, and Bauprotokollen, each tagged with at least one DIN reference (DIN 276, DIN 18960, DIN EN 1990, etc.). Paragraphs were chosen for retrieval-density — short prose with at least three named entities and at least one obligation. The graph version of the same corpus is the v0.4 schema described in entry № 024, with one node per entity mention and typed edges resolved by our entity linker.
The query set was built by two engineers and one Projektsteuerer over a week. Each query is labelled with its hop depth:
- 1-hop (96 queries): "What does DIN 18960 §4.2 require for accessibility in stairways?"
- 2-hop (124 queries): "Which defects on Bauvorhaben Riemerschmidt cite DIN 18960?"
- 3-hop (92 queries): "For each Projektsteuerer on projects in Bavaria, which trade has the most overdue Aufgaben?"
Both retrievers received the same downstream LLM (Claude 3.5 Sonnet, temperature 0) and the same answer-evaluation pipeline (exact match for factual answers, LLM-as-judge with rubric for analytical answers, with both authors blinded to retriever identity).
Why the gap exists
Two failure modes account for almost all of the vector retriever's losses on multi-hop queries.
The first is "near-miss anchoring". Embeddings are extremely good at finding paragraphs that look like the query — same DIN reference, same trade vocabulary, same numerical range — and extremely bad at finding paragraphs that share an entity with the query. If you ask "which defects on Project X cite DIN 18960?", a vector retriever will happily return ten paragraphs about DIN 18960 from any project, because they read like the query. A graph retriever can constrain by the project node and return only the relevant subgraph.2
The second is "broken chains". 3-hop questions implicitly require traversing typed relationships: Projektsteuerer → manages → Project → contains → Prüfgegenstand → assigned_to → Gewerk. Vector retrieval has no notion of typed relations; it returns paragraphs and hopes the LLM can reconstruct the chain. The LLM, in our experience, does not.
"Embedding similarity is a tool for finding paragraphs that read like a query. A query is rarely a paragraph. The mismatch is why we keep needing graphs."
Where vector wins
It is fashionable, in 2026, to write graph-RAG triumph posts. We are trying to write a slightly more honest one. Vector retrieval still beats the graph in three places:
- Cold-start corpora. The graph version of the corpus required four engineer-weeks of entity linking and edge typing. Vector indexing took an afternoon.
- Out-of-schema questions. If a user asks something the schema does not know how to type — "find me poetic descriptions of bad weather in the daily protocols" — the graph retriever returns nothing. The vector retriever returns plausible candidates, even if they are wrong.
- Latency, in the small. For 1-hop queries our vector retriever responds in 80ms p50; the graph retriever in 240ms p50. For interactive use this matters.
Our production system is now hybrid. The router classifies a query by estimated hop-depth (a small fine-tuned classifier, F1 0.91 on a held-out set), and dispatches to vector for 1-hop and graph for 2+. Below is the routing rule, which is shorter than the discussion that produced it:
def route(query: str) -> Retriever:
depth = depth_classifier.predict(query)
if depth == 1: return vector_retriever
if depth >= 2: return graph_retriever
if confidence(depth) < 0.7: return ensemble(vector, graph)
return graph_retriever
A second figure: where the gap comes from
One way to understand the difference is to plot recall as a function of the number of distinct entities in the gold answer. The vector retriever's recall collapses linearly. The graph retriever's recall holds, then drops sharply at the schema boundary (entities the linker failed on).
What we are doing next
Three open threads:
- Schema-aware embeddings. Train a sentence encoder that knows about our entity types, so that "DIN 18960" and "DIN 18960 §4.2 Treppen" embed close together but distinctly. Early results suggest this closes about a third of the 2-hop gap.
- Hybrid scoring. Instead of a hard router, score every retrieval candidate with both retrievers and combine. Slower but more robust on edge queries.
- Better depth classifier. The current 91% is misleading; it fails predictably on questions phrased as 1-hop but requiring 2-hop traversal in the gold answer.
The benchmark and the eval harness will be released alongside entry № 022 on schema drift. We will hold off on the dataset until our customers consent to releasing the redacted Bauprotokolle that underpin it.
— V. T., Munich, 14 April 2026.