October 27, 2025 ·Research·6 min read

Embedding Pipeline v2 improves retrieval quality

Our new embedding pipeline improves retrieval quality by 18–24% and reduces near-duplicate matches across heterogeneous enterprise datasets.

▷Listen to article

2:14

We rebuilt our embedding stack to make retrieval meaningfully better for real teams - not just synthetic benchmarks.

Why we did this

Teams store mixed content (docs, tickets, chats, PDFs, images). Traditional embeddings flatten nuance:

"Reset device" might mean a firmware command in Support, but a full wipe in IT.
“Spec” could be a hardware drawing or a product brief, depending on team. Our goal: consistent relevance across domains without hand-tuning.

What changed in v2

Domain-adaptive training We pretrain on a diverse corpus, then adapt on domain slices (support, eng, legal, finance). This reduces term hijacking (e.g., “issue” = GitHub vs invoice).
Robust normalization Light-weight text normalization handles boilerplate, signatures, tables, and OCR noise. We treat headings and bullets as structure, not clutter.
Context windows We embed both passages and their parents (section/page). Queries can match a chunk or inherit signal from the parent to avoid cutting off meaning.
Hard negative mining We purposefully train on “looks similar but wrong” pairs (model cards vs policy cards; version docs vs release notes) to reduce near-misses.
Cross-asset alignment Captions from images/diagrams and alt-text from PDFs land in the same space as text, so references like “the blue wiring diagram” become searchable.

Results

Across our internal evals:

+18–24% MRR@10 on heterogeneous corpora (docs + chats + PDFs)
−31% false-positive matches on near-duplicate pages
+22% first-click success in human trials (search → open → useful)

Upgrading

No schema changes required.
Re-embed only the collections you search most; mixed indexes support v1 + v2 during transition.
API: set
```
model: "embed-v2"
```
. Clients using
```
nearest_k
```
and
```
hybrid: true
```
work unchanged.

Good defaults

Chunking: 500–900 tokens with 120-token overlap for manuals; 300–500 for chats.
Hybrid search: lexical (BM25) + vector with reciprocal rank fusion.
Rerank: lightweight cross-encoder on top 50 results; latency < 50 ms on standard tier.

What’s next

Incremental re-embedding when large files change
Live schema hints from usage analytics to improve ranking
Guardrail features that down-weight stale policy docs automatically

v2 is live for all new workspaces. If you’ve got a gnarly corpus, we’d love to see it - these edge cases are how we make the model better for everyone.