Embedding Pipeline v2 improves retrieval quality
Our new embedding pipeline improves retrieval quality by 18–24% and reduces near-duplicate matches across heterogeneous enterprise datasets.
We rebuilt our embedding stack to make retrieval meaningfully better for real teams - not just synthetic benchmarks.
Why we did this
Teams store mixed content (docs, tickets, chats, PDFs, images). Traditional embeddings flatten nuance:
- "Reset device" might mean a firmware command in Support, but a full wipe in IT.
- “Spec” could be a hardware drawing or a product brief, depending on team. Our goal: consistent relevance across domains without hand-tuning.
What changed in v2
-
Domain-adaptive training We pretrain on a diverse corpus, then adapt on domain slices (support, eng, legal, finance). This reduces term hijacking (e.g., “issue” = GitHub vs invoice).
-
Robust normalization Light-weight text normalization handles boilerplate, signatures, tables, and OCR noise. We treat headings and bullets as structure, not clutter.
-
Context windows We embed both passages and their parents (section/page). Queries can match a chunk or inherit signal from the parent to avoid cutting off meaning.
-
Hard negative mining We purposefully train on “looks similar but wrong” pairs (model cards vs policy cards; version docs vs release notes) to reduce near-misses.
-
Cross-asset alignment Captions from images/diagrams and alt-text from PDFs land in the same space as text, so references like “the blue wiring diagram” become searchable.
Results
Across our internal evals:
- +18–24% MRR@10 on heterogeneous corpora (docs + chats + PDFs)
- −31% false-positive matches on near-duplicate pages
- +22% first-click success in human trials (search → open → useful)
Upgrading
- No schema changes required.
- Re-embed only the collections you search most; mixed indexes support v1 + v2 during transition.
- API: set
. Clients usingmodel: "embed-v2"
andnearest_k
work unchanged.hybrid: true
Good defaults
- Chunking: 500–900 tokens with 120-token overlap for manuals; 300–500 for chats.
- Hybrid search: lexical (BM25) + vector with reciprocal rank fusion.
- Rerank: lightweight cross-encoder on top 50 results; latency < 50 ms on standard tier.
What’s next
- Incremental re-embedding when large files change
- Live schema hints from usage analytics to improve ranking
- Guardrail features that down-weight stale policy docs automatically
v2 is live for all new workspaces. If you’ve got a gnarly corpus, we’d love to see it - these edge cases are how we make the model better for everyone.