October 27, 2025 ·Research·6 min read

Embedding Pipeline v2 improves retrieval quality

Our new embedding pipeline improves retrieval quality by 18–24% and reduces near-duplicate matches across heterogeneous enterprise datasets.

Listen to article
2:14

We rebuilt our embedding stack to make retrieval meaningfully better for real teams - not just synthetic benchmarks.

Why we did this

Teams store mixed content (docs, tickets, chats, PDFs, images). Traditional embeddings flatten nuance:

  • "Reset device" might mean a firmware command in Support, but a full wipe in IT.
  • “Spec” could be a hardware drawing or a product brief, depending on team. Our goal: consistent relevance across domains without hand-tuning.

What changed in v2

  1. Domain-adaptive training We pretrain on a diverse corpus, then adapt on domain slices (support, eng, legal, finance). This reduces term hijacking (e.g., “issue” = GitHub vs invoice).

  2. Robust normalization Light-weight text normalization handles boilerplate, signatures, tables, and OCR noise. We treat headings and bullets as structure, not clutter.

  3. Context windows We embed both passages and their parents (section/page). Queries can match a chunk or inherit signal from the parent to avoid cutting off meaning.

  4. Hard negative mining We purposefully train on “looks similar but wrong” pairs (model cards vs policy cards; version docs vs release notes) to reduce near-misses.

  5. Cross-asset alignment Captions from images/diagrams and alt-text from PDFs land in the same space as text, so references like “the blue wiring diagram” become searchable.

Results

Across our internal evals:

  • +18–24% MRR@10 on heterogeneous corpora (docs + chats + PDFs)
  • −31% false-positive matches on near-duplicate pages
  • +22% first-click success in human trials (search → open → useful)

Upgrading

  • No schema changes required.
  • Re-embed only the collections you search most; mixed indexes support v1 + v2 during transition.
  • API: set
    model: "embed-v2"
    . Clients using
    nearest_k
    and
    hybrid: true
    work unchanged.

Good defaults

  • Chunking: 500–900 tokens with 120-token overlap for manuals; 300–500 for chats.
  • Hybrid search: lexical (BM25) + vector with reciprocal rank fusion.
  • Rerank: lightweight cross-encoder on top 50 results; latency < 50 ms on standard tier.

What’s next

  • Incremental re-embedding when large files change
  • Live schema hints from usage analytics to improve ranking
  • Guardrail features that down-weight stale policy docs automatically

v2 is live for all new workspaces. If you’ve got a gnarly corpus, we’d love to see it - these edge cases are how we make the model better for everyone.

More in Research

View all
© 2025 Ziqara - All rights reserved.