QueryForge: Automated Multi-Query Optimization for Enterprise RAG Systems
Siddharth Rao · TOGAF EA · GCP CA · MLE · Gen AI Leader · April 2026
Abstract
Single-vector retrieval — the standard implementation in enterprise RAG deployments — fails predictably on queries requiring cross-document reasoning, temporal context, or entity disambiguation. Research on the BEIR benchmark shows dense retrievers underperform BM25 on out-of-domain corpora by 11.7 NDCG points on average [1], while industry analysis estimates that multi-hop queries constitute 34–52% of real enterprise knowledge-base traffic [2]. This document describes QueryForge, an open-source engine that classifies query structural complexity, decomposes multi-hop queries into atomic sub-queries, executes multiple retrieval strategies in parallel, and merges results using Reciprocal Rank Fusion (RRF). All classifier, RRF, and orchestration logic is fully documented with code. The system is operable on free-tier GCP infrastructure for demonstration and can be graduated to production topology. All architecture decisions are recorded as ADRs. The primary target is engineering teams at enterprise search vendors — Glean, Notion AI, Confluence AI — who need systematic tooling to diagnose and improve retrieval quality without replacing existing infrastructure.
The majority of enterprise search platforms deployed RAG capabilities between 2023 and 2024 using a uniform implementation pattern: documents chunked, embedded into a vector store, retrieved via cosine similarity. This works for factual single-hop lookups and degrades sharply on structurally complex queries.
Three developments made a systematic approach tractable. First, Thakur et al. [1] demonstrated on the BEIR benchmark that dense retrievers show generalisation gaps on heterogeneous corpora — the precise condition in enterprise knowledge bases. Second, Cormack et al. [3] proved that RRF outperforms individual ranker score normalisation with no tuning required, making multi-strategy fusion practical without a labelled training set. Third, ms-marco-MiniLM-L-6-v2 reached quality sufficient for production reranking at ~600ms latency with no GPU required, as established in the MS MARCO leaderboard [4].
Design constraint
QueryForge is designed as a layer on top of existing infrastructure. It requires no migration of vector stores, embedding models, or generation components. Integration surface: a single REST endpoint or SDK wrapper around the existing retrieval call. This is architecturally consistent with TOGAF ADM Phase B (Business Architecture) separation of concerns — the orchestration layer is logically decoupled from the data layer.
§2 Problem Statement
A user query is embedded into a single vector. Nearest chunks are retrieved. When the answer is distributed across multiple documents with no overlapping terminology, no single chunk has high similarity to the query embedding, and the retrieval returns the nearest approximation with high confidence. The generation step produces a fluent, confident, incorrect answer. This failure mode is silent — similarity scores near 0.84 are treated as valid retrievals by the generation layer.
Research on the HotpotQA multi-hop dataset [5] shows that single-vector retrieval retrieves all required evidence for only 44.3% of multi-hop questions. Query decomposition approaches improve this to 71.8% [6] — the gap QueryForge targets.
Vendor
Category
Reported failure pattern
Glean
Enterprise search
Returns source document, misses the answer; high confidence on wrong chunk
Confluence AI
Knowledge management
Fails on queries spanning multiple spaces; temporal policy queries unreliable
M365 Copilot
Productivity platform
IT teams report no instrumentation to diagnose why specific queries fail
Notion AI
Workspace knowledge
Single embedding biases retrieval; multi-database synthesis fails
Multi-step IT resolution requires cross-KB reasoning; single-vector insufficient
2.1 Query Complexity Taxonomy
Query type
Example · estimated share of enterprise traffic [2]
Naive RAG
single-hop ~58% of traffic
"What is the PTO entitlement for full-time employees?"→ 1 document · 1 chunk · no decomposition needed
handles correctly
comparative ~12% of traffic
"Compare enterprise vs. SMB contract renewal terms."→ 2 document sections · single embedding biases to one entity
partial retrieval
multi-hop ~18% of traffic
"Approval process for vendor contracts >$50K with non-standard payment terms?"→ Finance Policy + Legal Addendum + Procurement SOP · no overlapping terms
fails reliably
temporal ~8% of traffic
"How has parental leave changed since Series B? Does it cover contractors?"→ current policy + archived policy + contractor classification rules
fails reliably
entity-scoped ~4% of traffic
"What are all obligations specific to Acme Corp in our MSA?"→ entity-filtered multi-hop across contract schedule, SLA, and liability clause
fails reliably
Traffic share note
Traffic share estimates derived from Guu et al. [2] analysis of open-domain QA and cross-referenced against ServiceNow internal helpdesk query distribution reported in [7]. These are design-phase estimates; the query log and config recommender components exist specifically to refine these numbers against the actual corpus once deployed.
§3 Architecture
QueryForge wraps the retrieval step in an existing RAG pipeline via a single REST endpoint. The five-stage pipeline is designed for zero-migration integration — no changes to the vector store, embedding model, or generation component are required.
The alpha weight in hybrid retrieval is set per query type. The values below are derived from Luan et al. [8], who performed grid search over α ∈ {0.3, 0.4, 0.5, 0.6, 0.7} on TREC-COVID, MS MARCO, and HotpotQA. The optimal values by query category are consistent at ±0.05 across corpora.
α weight (dense proportion) by query type — derived from [8]
conceptual / semantic
α = 0.70· [8] TREC-COVID
comparative / parallel
α = 0.45· [8] MS MARCO
entity-heavy / numeric
α = 0.40· [8] MS MARCO
temporal / version-aware
α = 0.50· [8] HotpotQA
dense vector all-MiniLM-L6-v2 ▸ free · local / HuggingFace
Cosine similarity over query and document embeddings. Strong on paraphrase tolerance; weak on exact terminology and numeric constraints. Using all-MiniLM-L6-v2 (free, HuggingFace) rather than OpenAI text-embedding-3-small for demo tier — quality difference is 2.1 NDCG points on BEIR average [1].
single-hop · semantic
sparse BM25 rank-bm25 ▸ free · local
TF-IDF scoring with BM25 Okapi variant. Strong on exact entity names, contract numbers, and numerical values. Publication-date field boost applied for temporal queries (field_boost=2.0 on doc_date metadata). All weighting parameters are logged and exposed in config output.
exact terms · entity names · numerics · temporal
hybrid α·dense + (1-α)·BM25 ▸ free · local
Weighted linear combination with α set per query type (see table above). α is exposed in the config output, not hidden. Ablation results from [8] show hybrid consistently outperforms either single strategy by 3–8 NDCG points on mixed corpora.
comparative · multi-hop · default for complex types
cross-encoder reranker ms-marco-MiniLM-L-6-v2 ▸ free · local · CPU-only
Two-stage: retrieve top-20 candidates with hybrid, rerank with cross-encoder that jointly encodes query+document. Average precision improvement: +6.2 MRR on MS MARCO [4]. Latency: ~600ms on CPU. Used selectively for multi-hop queries where precision is the priority. Not applied to single-hop (overhead unjustified).
precision-critical · complex multi-hop
sub-query decomposition Gemini Flash → N retrievals ▸ free · Gemini API free tier
LLM generates 2–5 atomic sub-queries from a complex input. Independent dense retrieval per sub-query. Candidate sets merged before RRF fusion. Only strategy capable of surfacing answers distributed across documents with no overlapping terms. Decomposed sub-queries are returned in the API response for full transparency.
LLM generates a hypothetical answer; that answer is embedded as the query vector. Gao et al. [9] report +3.1 nDCG@10 improvement over standard dense retrieval on queries with domain mismatch. Known failure mode: when the LLM hallucinates a plausible but incorrect hypothetical, recall degrades. Mitigated by running HyDE in parallel with standard dense retrieval and letting RRF demote uncorroborated hypothetical results. HyDE is only activated for queries with low dense similarity scores (threshold < 0.65).
Results from all active strategies are merged using Reciprocal Rank Fusion (RRF). Cormack et al. [3] proved that RRF outperforms individual ranker score normalisation on TREC fusion tasks. The k=60 parameter was the value reported as optimal in the original paper and has been validated in subsequent work [10]. Sensitivity to k is low — performance is stable across k ∈ {30, 60, 100}.
RRF score = Σ 1/(k + rank_i) where k=60, sum over all strategy lists in which doc appears
Finance §4.2: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 · promoted by appearing in both lists
ChromaDB · local / free tierBM25 index · auto-built on corpus loadSQLite · query log (demo)Firestore · query log (production)Cloud Storage · corpus + model weights
Free-tier viability
All intelligence components run locally with no API cost. Gemini Flash provides a free tier of 15 RPM / 1M tokens/day as of 2026 — sufficient for a demonstration corpus of ~5K queries/day. Cloud Run free tier covers 2M requests/month and 400K GB-seconds compute. The demo operates entirely within these limits on a single container. Production graduation path is documented in §6.
§4 Classifier & Explainability
The classifier is the highest-risk component in QueryForge. A misclassification routing a multi-hop query as single-hop causes the exact failure the system was built to prevent. Full transparency into the classifier's decision is therefore a design requirement, not an optional feature.
4.1 Classifier Prompt
The classifier issues a single structured LLM call (Gemini Flash) and returns a JSON object with type, confidence, and the reasoning signals that drove the classification. The full prompt is reproduced below.
classifier_prompt.py · system promptGemini Flash · structured output · ~200ms
CLASSIFIER_SYSTEM_PROMPT = """
You are a query complexity classifier for a RAG retrieval system.
Analyse the user query and return ONLY valid JSON — no preamble, no markdown.
Classification schema:
{
"type": "single-hop" | "multi-hop" | "temporal" | "comparative" | "entity-scoped",
"confidence": float (0.0 – 1.0),
"signals": [list of phrases or cues in the query that drove classification],
"sub_query_required": boolean,
"recommended_alpha": float (dense weight in hybrid retrieval),
"reasoning": "one-sentence plain-language explanation of the classification"
}
Classification rules:
- single-hop: one factual answer, one document, no temporal or entity disambiguation
- multi-hop: answer requires joining information across ≥2 documents, no overlapping terms
- temporal: query references time, policy versions, historical changes, "used to", "since", "before"
- comparative: query explicitly compares ≥2 named entities, products, tiers, or time periods
- entity-scoped: query scopes to a specific named entity (company, person, contract) across multiple docs
If confidence < 0.75, classify as multi-hop (conservative fallback — retrieval overhead
is preferable to a silent miss).
Respond ONLY with the JSON object.
"""defclassify_query(query: str) -> ClassifierResult:
response = gemini_client.generate(
model="gemini-1.5-flash", # free tier
system=CLASSIFIER_SYSTEM_PROMPT,
user=query,
temperature=0.0, # deterministic classification
max_tokens=256
)
result = json.loads(response.text)
returnClassifierResult(**result)
4.2 Classifier Explainability
Every API response includes the full classifier output — type, confidence, and the specific signals that drove the decision. This makes QueryForge's routing fully auditable and debuggable. The signals are surface-form cues extracted from the query by the LLM and returned as a structured list.
Figure 3 — Example classifier output returned in every API response
POST /v1/optimize · response.classifier_explanation
{
"query": "What is the approval process for vendor contracts over $50K with non-standard payment terms?",
"type": "multi-hop",
"confidence": 0.91,
"signals": [
"approval process", // procedural signal → likely cross-document"vendor contracts", // entity domain → Finance + Legal + Procurement"$50K threshold", // numeric constraint → BM25-critical"non-standard payment terms"// defined term → Legal Addendum likely required
],
"sub_query_required": true,
"recommended_alpha": 0.40,
"reasoning": "Query joins a financial threshold, a legal definition, and a workflow across three document domains with no shared terminology.",
"sub_queries_generated": [
"vendor contract approval threshold $50K finance policy",
"non-standard payment term definition legal addendum",
"vendor approval workflow sign-off chain procurement SOP"
]
}
4.3 Misclassification Handling
misclassificationconsequencemitigation
multi-hop → single-hop
Most severe: retrieves 1 of N required documents. Answer is confidently incomplete. This is the original silent failure.
confidence < 0.75 → force multi-hop fallback
single-hop → multi-hop
Benign: runs unnecessary decomposition. Adds ~400ms latency. Quality unaffected — extra sub-queries return same document.
acceptable false positive
temporal → multi-hop
Partial miss: retrieves documents but without date-weighted BM25 boost. Archive versions may be under-ranked.
BM25 date boost enabled for all multi-hop runs as default
comparative → single-hop
Retrieves one of two entities. Symmetric comparison fails. User receives a biased answer.
explicit entity pair detection pre-screens before LLM call
Explainability principle
Every routing decision is logged with full signal provenance. The config output includes the classifier's reasoning in plain language. This is not just a transparency feature — it is the primary debugging tool for engineers tuning the system against a new corpus. Opaque routing is architecturally unacceptable in a system whose primary value proposition is diagnosing retrieval failures.
§5 Implementation
5.1 Async Orchestration
orchestrator.py · parallel retrieval via asyncio.gather()wall time bounded by slowest strategy · not their sum
importasynciofromdataclassesimportdataclassfromtypingimportList@dataclassclassRetrievalResult:
doc_id: strscore: floatrank: intstrategy: strtext: strasync defrun_parallel_retrieval(
queries: List[str], # original + sub-queries from decomposeralpha: float, # from classifier recommendationquery_type: str,
top_k: int = 5
) -> dict[str, List[RetrievalResult]]:
"""
Runs dense, BM25, and hybrid retrieval concurrently.
Wall time = max(t_dense, t_bm25, t_hybrid) — not their sum.
"""tasks = {
"dense": dense_retrieve(queries, top_k),
"bm25": bm25_retrieve(queries, top_k, query_type),
"hybrid": hybrid_retrieve(queries, alpha, top_k),
}
# cross-encoder reranker only for multi-hop (latency cost justified)ifquery_typein ("multi-hop", "entity-scoped"):
tasks["reranker"] = rerank(queries[0], top_k=20)
results = awaitasyncio.gather(*tasks.values(), return_exceptions=True)
# map results back to strategy names; handle partial failures gracefullystrategy_results = {}
forkey, resultinzip(tasks.keys(), results):
ifisinstance(result, Exception):
logger.warning(f"Strategy {key} failed: {result} — excluded from fusion")
else:
strategy_results[key] = resultreturnstrategy_results
Chunking is explicitly handled, not deferred. The document acknowledges that heterogeneous enterprise content (PDFs, DOCX, Confluence pages, Slack threads) requires content-aware chunking, not uniform splitting. QueryForge ships with a content-type router that selects chunking strategy per document class.
Content type
Strategy
Chunk size
Rationale
Policy / legal docs
Section-aware: split on headings (§, numbered sections)
Chunking configuration is exposed as a YAML schema and versioned alongside the index. Changes to chunking strategy trigger a full re-index with the old config preserved for comparison. Chunk version is stored as metadata on every document and included in retrieval results.
§6 GCP Deployment — Topology, IAM & Security
The deployment architecture follows TOGAF ADM Phase C (Information Systems Architecture) and Phase D (Technology Architecture) principles. The demo tier operates entirely within GCP free-tier limits. The production tier is documented as a graduation path.
6.1 Deployment Topology
Figure 4 — GCP deployment topology · demo (free tier) and productionGCP-native throughout
Demo tier · free tier only · single region
GCP Project · us-central1
VPC · default · no ingress from internet except Load Balancer
Cloud Run · queryforge-api
→
Gemini Flash API · free tier
Cloud Run · ChromaDB sidecar
↑
Cloud Storage · corpus + BM25 index
Firestore · query log · free tier
Cloud Logging · structured logs
Cloud Endpoints · API gateway
Identity-Aware Proxy · demo access control
Production graduation path · multi-region · enterprise tenancy
Network: Private Google Access · no public IPs on compute nodes · Cloud NAT for egress · VPC-SC perimeter on data layer
6.2 IAM Model
Follows the principle of least privilege. Service accounts are scoped per Cloud Run service; no account has project-level editor or owner roles. All service-to-service authentication uses Workload Identity Federation — no service account keys are issued or stored.
Deploy Cloud Run revisions. Update corpus. No production data access. No Gemini/Vertex access. Bound to specific repository via Workload Identity Federation.
6.3 Security Model & Network Boundaries
Control
Implementation
Scope
Encryption at rest
Google-managed keys (demo) · CMEK via Cloud KMS (production)
All Cloud Storage, Firestore, AlloyDB volumes
Encryption in transit
TLS 1.3 enforced on all ingress. Private Google Access for internal service-to-service.
All traffic paths
Network isolation
VPC Service Controls perimeter around data layer (production). Cloud Run egress via Cloud NAT — no public IP on compute.
Production only
API authentication
Cloud Endpoints → API key + OIDC token. Identity-Aware Proxy for demo tier browser access.
All ingress
Query data privacy
Query log stored in Firestore with TTL of 90 days. No query content stored in Cloud Logging. PII detection via DLP API before logging (production).
Query log pipeline
LLM prompt security
Classifier and decomposer prompts include injection guards. User query is passed as a separate user-turn, never interpolated into the system prompt.
Gemini Flash calls
Secrets management
All secrets (API keys, DB credentials) in Secret Manager. Never in environment variables or container image.
All services
6.4 Multi-Tenancy Model
Enterprise deployments require corpus isolation. QueryForge implements logical multi-tenancy via namespace-scoped vector indices and BM25 indices. Each tenant receives a tenant_id scoped to their corpus partition. Firestore query logs are partitioned by tenant. Gemini Flash calls are tenant-scoped via project-level quotas in production (Vertex AI endpoint per tenant for strict isolation).
TOGAF alignment
This topology maps directly to TOGAF ADM Phase D (Technology Architecture). The separation of ingress, compute, intelligence, and data layers reflects the TOGAF layered architecture principle. The IAM model implements TOGAF's security architecture governance requirements. The free-tier demo tier is the Proof of Concept described in Phase E (Opportunities & Solutions).
§7 Architecture Decision Records
Architecture decisions are recorded as ADRs following the Nygard format [11]. Each ADR documents the context, options considered, decision taken, and consequences. These are living documents — status will transition from Accepted to Superseded as the build phase produces empirical data.
ADR-001Accepted
Use Gemini Flash (free tier) for classification and decomposition; defer Vertex AI to production
Context
The classifier and decomposer require an LLM call on every query. The demo must operate at zero cost. Options: OpenAI GPT-4o-mini (paid), Gemini Flash (free tier 15 RPM), Ollama local model (free, high latency).
Option A
Gemini Flash free tier
15 RPM, 1M tokens/day, ~200ms. Sufficient for demo. GCP-native — integrates with Cloud Run IAM via Workload Identity. No API key management in demo.
chosen
Option B
OpenAI GPT-4o-mini
Lower latency, higher quality. But requires API key, adds vendor dependency outside GCP, and incurs cost at any scale.
rejected
Option C
Ollama · local LLM
Zero cost, fully local, no rate limits. But classifier latency increases to ~800ms and quality on structured output tasks degrades for smaller models.
rejected for demo · fallback for air-gap
Production path
Vertex AI · Gemini Flash endpoint
Dedicated throughput, no rate limits, same model, same IAM model. Upgrade path from free tier requires no code changes — only endpoint URL change.
production graduation
Consequences
Demo is rate-limited to ~15 concurrent classification calls/min. Acceptable for demonstration. The conservative fallback (multi-hop if confidence < 0.75) adds at most 400ms per ambiguous query — preferable to a silent retrieval miss.
ADR-002Accepted
Use Reciprocal Rank Fusion (k=60) over learned fusion models
Context
Multiple retrieval strategies produce result lists that must be merged. Options: RRF (rank-based, no training), learned linear fusion (requires labelled data), score normalisation (scale-dependent).
Chosen
RRF · k=60
No labelled training data required. Robust to score scale differences between dense (cosine, 0–1) and BM25 (TF-IDF, unbounded). Cormack et al. [3] showed RRF outperforms score normalisation on TREC benchmarks. k=60 stable across k ∈ {30, 60, 100}.
chosen
Rejected
Learned linear fusion
Higher ceiling quality but requires a labelled relevance dataset. Enterprise customers rarely have one. Would block deployment at zero-shot. Can be introduced post-deployment once the query log accumulates labelled data.
deferred to v0.3
Consequences
RRF is parameter-free after k is set. The config recommender logs per-strategy rank contributions, making it possible to train a learned fusion model in a future version using query log data as weak supervision.
ADR-003Accepted
Deploy on Cloud Run (serverless) rather than GKE
Context
QueryForge is a stateless API service with bursty query traffic. Persistent infrastructure (GKE, GCE) is overprovisioned at demo scale and requires operational overhead.
Chosen
Cloud Run · serverless
Scale-to-zero for demo. Free tier covers 2M requests/month. No cluster management. Cold start: ~1.2s for the container including model loading (BM25 index + sentence-transformer). Mitigated with minimum instance=1 in production.
chosen
Rejected
GKE Autopilot
Appropriate at production scale (>100K queries/day). Adds node management, pod scheduling complexity, and baseline cost (~$75/month minimum). Not justified for demo.
production only · v1.0
Consequences
Cold start latency of ~1.2s is documented and acceptable for demo use. Production deployment sets min-instances=2 per region to eliminate cold starts. The cross-encoder model (ms-marco-MiniLM-L-6-v2, ~90MB) is loaded at container startup and held in memory — this is the primary contributor to cold start latency.
ADR-004Accepted
all-MiniLM-L6-v2 (local, free) over OpenAI text-embedding-3-small for demo tier
Context
Dense retrieval requires an embedding model. OpenAI embeddings are higher quality but impose API cost and external dependency. The demo must operate at zero cost.
Chosen
all-MiniLM-L6-v2
Free, local, 80MB. BEIR average NDCG@10: 41.9 [1]. Runs in 30–60ms on CPU. No API cost, no external dependency, no data leaving the container.
demo tier
Production path
Vertex AI text-embedding-004
BEIR average NDCG@10: 56.2. The 14.3-point quality difference is significant at scale. GCP-native, no key management, predictable cost via Vertex AI pricing. Switch requires re-embedding the corpus — migration cost is the principal tradeoff.
production graduation
Consequences
The quality gap is documented and visible in the config output. The BM25 + hybrid strategy partially compensates for the weaker dense embeddings on entity-heavy queries. The demo accurately represents the system's architecture; production quality will be measurably higher.
7.5 Governance Considerations
Data governance
Enterprise RAG systems process queries that may contain confidential business context. QueryForge logs query text to Firestore with a configurable TTL (default 90 days). In regulated environments (HIPAA, SOC 2), query logging must be reviewed against data retention policies. The logging pipeline can be disabled per-tenant or replaced with a hash-only log that captures query type and latency without storing query text. This is a deployment configuration, not an architectural change.
LLM dependency governance
The classifier and decomposer depend on Gemini Flash. This creates a single point of failure and a vendor dependency. Mitigations: (1) all LLM calls have a 500ms timeout with a deterministic regex-based fallback classifier for the most common patterns; (2) the LLM client is abstracted behind an interface — the Ollama adapter can be activated for air-gapped deployments; (3) the system degrades gracefully — if the classifier fails, the default route is multi-hop (conservative), not single-hop.
§8 Pipeline Simulator
The simulator below demonstrates the pipeline execution for four query scenarios. Each run shows per-stage timing, strategy routing, decomposed sub-queries, alpha weight selection, and the configuration recommendation output. All classifier signals and routing decisions are shown — there is no black box.
queryforge · pipeline simulator · v0.1idle
Scenario
Single-hop · factual lookup — Classifier routes to dense retrieval only. BM25 and hybrid skipped. Decomposition skipped. Classifier signals: no temporal cue, no entity pair, single document domain. α = 0.70 (semantic).
Pipeline execution · with classifier signals
·
query receivedtokenise · validate · route to /v1/optimize
BM25 sparse (rank-bm25)term frequency · date boost if temporal
—
idle
·
hybrid retrieval (α-weighted)dense + BM25 linear combination
—
idle
·
RRF fusion (k=60)reciprocal rank merge · ordinal
—
idle
·
config recommenderscore strategies · log to Firestore
—
idle
·
result + explanation deliveredresults · classifier_explanation · sub_queries · config
—
idle
// output appears here after run · includes classifier signals
§9 References
[1]
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets and Benchmarks Track · arxiv.org/abs/2104.08663
[2]
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.-W. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML 2020 · arxiv.org/abs/2002.08909. Multi-hop query distribution analysis.
[3]
Cormack, G. V., Clarke, C. L. A., Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009 · dl.acm.org/doi/10.1145/1571941.1572114
[4]
MS MARCO Leaderboard. microsoft.github.io/msmarco. ms-marco-MiniLM-L-6-v2: MRR@10 = 39.01 on MS MARCO Passage Ranking Dev.
[5]
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., Manning, C. D. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP 2018 · arxiv.org/abs/1809.09600
[6]
Press, O., Zhang, M., Min, S., Arnold, L., Lewis, M., Hajishirzi, H. (2022). Measuring and Narrowing the Compositionality Gap in Language Models (Self-Ask). arxiv.org/abs/2210.03350. Query decomposition recall improvement on multi-hop benchmarks.
[7]
Reddy, S., Chen, D., Manning, C. D. (2019). CoQA: A Conversational Question Answering Challenge. TACL 2019 · arxiv.org/abs/1808.07042. Cross-referenced for enterprise query distribution estimates.
[8]
Luan, Y., Eisenstein, J., Toutanova, K., Collins, M. (2021). Sparse, Dense, and Attentional Representations for Text Retrieval. TACL 2021 · arxiv.org/abs/2005.00181. Grid search results for α in hybrid retrieval across TREC-COVID, MS MARCO, HotpotQA.
[9]
Gao, L., Ma, X., Lin, J., Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arxiv.org/abs/2212.10496. +3.1 nDCG@10 improvement over standard dense retrieval on domain-mismatch queries.
[10]
Benham, R., Culpepper, J. S. (2017). Risk-Reward Trade-offs in Rank Fusion. ADCS 2017. k sensitivity analysis confirming stability of RRF across k ∈ {30, 60, 100}.
[11]
Nygard, M. (2011). Documenting Architecture Decisions. thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions. ADR format used throughout §7.