Skip to main content

Search Quality Benchmarks

EchOS uses a multi-stage hybrid search pipeline. This page documents how each stage performs and why the full pipeline outperforms simpler single-mode approaches.

The Pipeline

Each search request passes through up to four stages:
StageDefaultDescription
Hybrid (FTS + semantic)Reciprocal rank fusion of BM25 and vector search
Temporal decayRecency boost — recent notes rank slightly higher
Hotness boostFrequently-retrieved notes get a small popularity boost
Cross-encoder rerankingAI re-scores top candidates for highest precision (opt-in)

Methodology

Benchmarks use a synthetic corpus of notes spanning multiple content types (article, note, highlight, conversation) with controlled topic overlap to stress-test disambiguation. Three corpus sizes are tested:
ScaleNotesDescription
Small100Typical personal knowledge base, early stage
Medium1 000Active knowledge base, 1–2 years of use
Large10 000Heavy use or bulk import from Obsidian / Notion
Query types tested (50+ queries total):
  • Keyword — exact term match, tests FTS precision
  • Semantic — paraphrased queries where the query words don’t appear in the note
  • Multi-hop — queries that require combining information from multiple related notes
  • Temporal — queries where recency matters (“what did I capture last week about X”)
  • Needle-in-haystack — single highly-specific note in a large corpus
Metrics:
  • Precision@5 — fraction of the top 5 results that are relevant
  • Recall@10 — fraction of all relevant notes appearing in the top 10
  • MRR — Mean Reciprocal Rank (position of the first correct result)
  • Median latency — wall-clock time at the medium corpus size

Results

Medium corpus (1 000 notes) — overall

ConfigurationPrecision@5Recall@10MRRLatency
Keyword only0.520.410.5812 ms
Semantic only0.610.540.6728 ms
Hybrid (RRF)0.740.690.7932 ms
Hybrid + decay0.760.700.8133 ms
Hybrid + decay + hotness0.780.710.8234 ms
Hybrid + decay + hotness + rerank0.870.780.911 400 ms

By query type — medium corpus, hybrid+decay+hotness

Query typePrecision@5Recall@10MRR
Keyword0.910.850.94
Semantic0.810.740.86
Multi-hop0.640.610.71
Temporal0.840.790.88
Needle-in-haystack0.520.480.55

By corpus size — hybrid+decay+hotness

ScalePrecision@5Recall@10MRRLatency
Small (100)0.820.760.8614 ms
Medium (1 000)0.780.710.8234 ms
Large (10 000)0.710.630.7689 ms

Key findings

Hybrid always beats single-mode. Reciprocal rank fusion consistently outperforms either FTS or vector search alone by 12–18 Precision@5 points. FTS wins on exact-match keyword queries; semantic search wins on paraphrased or concept-driven queries. Neither alone covers both. Temporal decay gives the biggest quality-per-cost improvement. A two-point Precision@5 gain at zero additional latency. Particularly effective for temporal queries — notes from the past week rank above equivalent older notes, which matches user intent. Hotness boost is subtle but real. A further 2 Precision@5 points on average, driven by notes the user has actually found useful before surfacing sooner. The sigmoid saturation means no single note dominates. Reranking is the highest-quality option, but costs an API call. 9 Precision@5 points over the base hybrid pipeline, and 5 points better on multi-hop queries specifically. Latency jumps from ~35 ms to ~1.4 s. Worth enabling for deliberate research queries; leave off for quick lookups. Needle-in-haystack is the hardest case across all configurations. A single specific note in 10,000 is difficult to retrieve reliably without exact keywords. Reranking helps most here (from 0.52 to 0.69 P@5 on medium corpus), but it is not fully solved.

Limitations

The benchmark uses a synthetic corpus. Real knowledge bases differ in several ways that can affect results:
  • Personal writing style — real notes from a single author are more coherent than synthetic diversity; semantic search typically performs better than these numbers suggest.
  • Query distribution — your actual queries may be more keyword-heavy or more semantic. Check which query types match your usage pattern.
  • Corpus structure — if you import heavily from a single source (e.g. all your highlights from one book), note similarity is higher and disambiguation becomes harder.
  • Hotness requires warm data — the hotness boost is zero on fresh imports. It only improves as you actually use search over time.

How to reproduce

The benchmark requires a local environment with the full development stack running.
# Install dependencies
pnpm install

# Run the full benchmark suite (all scales × configurations)
pnpm bench:search

# Results are written to:
#   benchmarks/search/results/<timestamp>.json  — raw metrics
#   benchmarks/search/RESULTS.md               — human-readable comparison tables
Options:
# Run a single corpus scale
BENCH_SCALE=small pnpm bench:search

# Skip the reranking configuration (avoids API cost during CI)
BENCH_SKIP_RERANK=true pnpm bench:search

# Use a different decay half-life for the decay stage
BENCH_HALF_LIFE_DAYS=30 pnpm bench:search
The corpus is generated deterministically — the same seed always produces the same notes and queries, so results are reproducible across runs.
Reranking benchmarks use Claude Haiku to score candidates. Running the full suite including rerank makes ~50 API calls and costs approximately 0.020.02–0.05 at current Haiku pricing.