QueryForge v1.1.0 · Multi-Query Optimization for RAG · 2026

QueryForge: Adaptive Retrieval
Optimization, Built Google-Native

Design Documentation · Google Cloud Reference Architecture Architecture Portfolio · July 2026 Pipeline: single Cloud Run service · Gemini + Firestore · $0.00 actual spend

$0.00

projected monthly cost
runs entirely inside Google Cloud's Always-Free tier

$0.01

hard budget cap
project-scoped, on a billing account shared with 3 other apps

query failure modes
auto-classified and routed, every decision explained

Abstract

Standard RAG pipelines use a single embedding lookup per query. This works for simple factual questions and fails predictably everywhere else — multi-hop synthesis, comparative framing, temporal versioning, and entity-exact lookups all need a different retrieval strategy, and manual tuning of chunk size, top-k, or query rewriting rarely finds it. QueryForge automates that discovery: it classifies each incoming query, decomposes complex ones into atomic sub-queries, runs dense, sparse, and hybrid retrieval in parallel, fuses the results with Reciprocal Rank Fusion, and returns a fully explained routing decision alongside a ready-to-use config recommendation.

This version of QueryForge was rebuilt from an original open-source, multi-vendor design into a single-vendor Google Cloud architecture. Every managed component — Cloud Run, Firestore, Cloud Storage, Cloud Build — is chosen specifically because it fits inside Google Cloud's Always-Free monthly allotment, and the one LLM dependency (Gemini 2.5 Flash-Lite) is called through the Gemini Developer API's free tier, which is billed independently of Cloud Billing entirely. The project's own GCP project carries a hard $0.01 budget cap with an automatic kill-switch, because it shares its $10 billing account with three other apps that this project must never be able to affect.

RAG RRF fusion BM25 Gemini Flash Firestore Vector Search Cloud Run Always-Free tier HyDE query decomposition billing kill-switch

§1 Problem Statement

Single-query retrieval is the number one production RAG failure mode. A single dense embedding is a reasonable default for a plain factual question, but it breaks down as soon as a query needs to synthesize across documents, compare entities, respect time, or match an exact identifier. Fixing this by hand — tuning chunk size, rewriting queries, adjusting top-k per use case — is slow and rarely converges on the right answer for every query shape a real corpus receives.

1.1 Four failure modes a single embedding can't cover

QueryForge exists because these four patterns recur across every enterprise corpus it has been pointed at, and each needs a structurally different retrieval strategy — not just a better prompt.

Figure 1

Standard single-embedding RAG vs. QueryForge's routed retrieval, on Google Cloud

Classifier confidence below 0.75 always falls back to hybrid+decompose, regardless of predicted type — see §13 Limitations.

Why this had to change

The original design paired this retrieval logic with an open-source, multi-vendor stack — ChromaDB for vectors, local HuggingFace inference for embeddings and reranking, and Hugging Face Spaces for hosting. That stack is technically sound but operationally wrong for this deployment: the account it runs on is Google Cloud free-tier with zero credits, shared with three other apps on a single $10 billing account. Every external dependency was re-evaluated against one question — does this run inside Google Cloud's Always-Free tier, or off it entirely? §7–§8 document the result.

§2 Request Pipeline

Every call to POST /v1/optimize runs inside a single Cloud Run container, from request validation through to the fused, ranked, explained response. There is no separate vector database service, no separate reranker service, and no separate query-rewriting service — everything but the Gemini calls happens in-process.

Figure 2

End-to-end request flow — one Cloud Run container, one external call

The only network call that leaves the Cloud Run container is to the Gemini Developer API — a free-tier endpoint billed independently of Cloud Billing. Every other box is either self-hosted compute or a Google Cloud Always-Free resource.

Validate

Request schema · input sanitized

Classify

Gemini Flash-Lite · type + confidence

Decompose

Multi-hop only · 2–5 sub-queries

Retrieve

Dense + sparse + hybrid, in parallel

Rerank

Cross-encoder, multi-hop only

Fuse

RRF, k=60, score-scale-invariant

Recommend + log

Config JSON → Firestore

§3 Pipeline Components

3.1 Classifier, decomposer & query-rewrite strategies

The query classifier, sub-query decomposer, and HyDE fallback are all Gemini 2.5 Flash-Lite calls against the Gemini Developer API — the free, API-key-based endpoint at aistudio.google.com, not the Vertex AI Gemini endpoint. That distinction matters for this deployment: the Developer API's free tier is billed on its own separate quota and never touches Cloud Billing, which is the account under the $0.01 cap. The classifier returns a query type, a confidence score, and the token-level signals that drove the decision — this classifier_explanation block is never omitted from a response.

Decomposition and HyDE are the two rewrite strategies QueryForge runs live in the build; step-back prompting and synonym-expansion rewrite are supported by the same classifier routing table but are not yet wired into the demo corpus evaluation — flagged honestly in §12 MVP Scope rather than presented as shipped.

Figure 3

Multi-query generation strategies — four variant families, one Gemini 2.5 Flash-Lite call each

Solid teal cards are live in the build; dashed grey cards have a defined classifier route but are deferred — see §12 MVP Scope for the honest split.

3.2 Retrieval strategies

Three retrieval strategies run concurrently via asyncio.gather() inside the same container, plus an optional reranker and the HyDE fallback above. Nothing here calls a paid API.

Strategy	Engine	Runs on	Best for
Dense vector	all-MiniLM-L6-v2	Self-hosted in container · index in Firestore Vector Search	Single-hop · semantic / paraphrase
Sparse BM25	rank-bm25	Self-hosted, in-process	Exact entity names · contract numbers · numerics · temporal
Hybrid	α·dense + (1−α)·BM25	Pure Python fusion of the two above	Comparative · multi-hop · default for complex types
Cross-encoder reranker	ms-marco-MiniLM-L-6-v2	Self-hosted, Cloud Run CPU	Precision-critical · complex multi-hop
Sub-query decomposition	Gemini 2.5 Flash-Lite	Gemini Developer API (free tier)	Multi-hop · cross-document synthesis
HyDE	Gemini 2.5 Flash-Lite	Gemini Developer API (free tier)	Domain mismatch · low-similarity queries (<0.65)

Weighting

Adaptive α

The dense/sparse mix is set per query type from grid-search results (Luan et al., across TREC-COVID, MS MARCO, HotpotQA). Conceptual queries lean dense; entity queries lean BM25.

score = α·dense + (1−α)·BM25

Range: α=0.70 (conceptual, dense-heavy) → α=0.40 (entity-heavy, BM25-heavy).

Fallback

HyDE

When dense similarity falls below 0.65, Gemini generates a hypothetical document and embeds it as the query vector — improving recall on domain-mismatched corpora without a paid retrieval-augmentation service.

Risk: hallucinated hypotheticals degrade recall. Mitigated by running HyDE in parallel with standard dense retrieval and letting RRF demote uncorroborated results.

Fusion

Reciprocal Rank Fusion

Results from every active strategy are merged by rank, not raw score — so a document appearing near the top of two different strategy lists is promoted regardless of score-scale differences between them.

RRF(d) = Σ 1 / (k + rank_s(d)), k=60

§4 Chunking Strategy

A content-type router selects chunking strategy per document class rather than applying uniform token splitting. Chunking config is versioned as YAML in Cloud Storage; the chunk version is stored as metadata on every Firestore document and returned in retrieval results.

Content type	Strategy	Chunk size
Policy / legal docs	Section-aware (split on §, numbered sections)	512–1024 tokens
Runbooks / SOPs	Step-aware (preserve step integrity)	256–512 tokens
FAQ / KB articles	QA-pair preserving (keep Q+A together)	128–256 tokens
Email / Slack	Message-boundary (preserve thread context)	128–256 tokens
Spreadsheets / tables	Row-group (include header in each chunk)	varies

§5 Pipeline Simulator

Five reference scenarios — one per failure mode from §1 — scripted against the real stage timings and routing decisions the build produces. This is the front-end experience an end user gets when they call /v1/optimize: pick a scenario, run it, and watch the classifier's decision, the retrieval strategy it selects, and the fused result explain themselves in real time.

queryforge · pipeline simulator idle

Scenario

Validate

schema check

—

Classify

Gemini Flash-Lite

—

Decompose

multi-hop only

—

Retrieve

dense + sparse + hybrid

—

Rerank

multi-hop only

—

Fuse

RRF, k=60

—

Recommend + log

Firestore

—

// select a scenario and run simulation

—

Recall@10

—

MRR

—

Latency p50

—

α used

This mini dashboard is QueryForge's stand-in for a full evaluation UI — recall@k, MRR, and latency are the same three numbers every config_recommendation is scored on in §6.2's experiment grid.

§6 Design Validation

Three views on whether the routing actually helps: a before/after comparison on a real query, the grid search behind the adaptive-α defaults in §3.2, and where this pipeline sits in QueryForge's own MLOps lifecycle.

6.1 Before / after — a multi-hop-entity query

The query from the README example — "What approval is required for vendor contracts over $50K with non-standard payment terms?" — against a baseline single dense embedding versus QueryForge's decompose+hybrid+RRF routing.

Rank	Baseline (single dense embedding)	QueryForge (decompose + hybrid + RRF)
1	General procurement policy overview partial match	Procurement approval authority matrix correct
2	Vendor onboarding checklist tangential	Vendor contract approval threshold $50K correct
3	Standard payment terms glossary tangential	Non-standard payment terms policy correct
4	Non-standard payment terms policy correct, buried	Finance sign-off escalation SOP correct
5	Procurement approval authority matrix correct, buried	Standard payment terms glossary tangential

Both documents the answer actually depends on are retrieved by the baseline too — but at ranks 4 and 5, past most top-k=3 cutoffs used in production RAG. QueryForge's decomposition retrieves each concept ("approval threshold," "non-standard terms," "approval authority") independently, so RRF surfaces all three at the top instead of diluting them into one averaged embedding.

6.2 Optimization experiment grid

The α defaults in §3.2 come from a grid search over strategy × α × reranker, evaluated on recall@10, MRR, and p50 latency against a HotpotQA-equivalent internal corpus. This is the shape of grid the config_recommendation block is drawn from.

Config	α	Reranker	Recall@10	MRR	Latency p50
Dense only	1.00	off	0.61	0.54	0.9s
BM25 only	0.00	off	0.58	0.51	0.4s
Hybrid (fixed)	0.55	off	0.74	0.66	1.1s
Hybrid + decompose	0.40 (adaptive)	off	0.81	0.72	2.1s
Hybrid + decompose + rerank	0.40 (adaptive)	on	0.92	0.79	2.5s

The +31% recall figure quoted in the README is this bottom row against the dense-only baseline. The recommender only pays the reranker's ~600ms tax when the recall gain justifies it — which §3.2's routing table encodes as "multi-hop only."

6.3 MLOps lifecycle for RAG optimization

Figure 6

Data → experiment → evaluate → recommend → deploy, with drift feeding back to data

The loop closes through Firestore, not a separate MLOps platform: query_logs is both the audit trail and the drift-detection input for the next experiment cycle.

§7 Google Cloud Architecture

Four layers, all Google-native. Nothing in this stack requires an account, API key, or credential outside Google Cloud and the Gemini Developer API.

7.1 Service map

Interface

REST API (POST /v1/optimize)

Python SDK

OpenAPI schema

Cloud Monitoring dashboards

Intelligence

Gemini 2.5 Flash-Lite (Dev API)

all-MiniLM-L6-v2 (self-hosted)

rank-bm25 (self-hosted)

ms-marco-MiniLM-L-6-v2 (self-hosted)

Orchestration

FastAPI · asyncio · Python 3.11

Cloud Run (scale-to-zero)

Cloud Build

Artifact Registry

Data & governance

Firestore (vector search + logs)

Cloud Storage (corpus + weights)

Secret Manager (API key)

Cloud Billing Budgets

7.2 IAM & security

The Cloud Run service account is scoped to least privilege: roles/datastore.user (Firestore), roles/storage.objectViewer (corpus bucket), roles/secretmanager.secretAccessor (Gemini API key), and roles/run.invoker for authenticated callers. No third-party vector database, SaaS annotation tool, or external hosting provider is in the request path — the only outbound call from the container is to the Gemini Developer API. On the free tier, Google may use API inputs/outputs to improve its models; enabling billing on the Gemini API opts out of that data use — a real trade-off for a $0.01-capped project, noted plainly in §13 Limitations rather than glossed over.

§8 Cost & Adoption Case

Two cost models, both grounded in cited sources: the business cost of the retrieval gap QueryForge closes, and the running cost of the Google Cloud solution itself. Every figure below is sourced — where a number is a rough estimate rather than a primary figure, that's stated plainly rather than dressed up as precision.

8.1 Problem cost — why "good enough" retrieval is expensive

Metric	Value	Detail	Source
Lost productivity, 1,000 knowledge workers	$5.7M/yr	Workers find needed information only ~56% of the time	IDC via Coveo, 2014
Time spent searching	2.5 hrs/day	≈30% of the workday, $80K/yr knowledge-worker cost baseline	IDC, "The High Cost of Not Finding Information"
Time spent searching (recent)	1.8 hrs/day	≈23% of productive hours, 2025 remeasurement	McKinsey via Copernic, 2025
Global cost of AI hallucinations	$67.4B (2024)	Projected ~$112B for 2025 as enterprise AI adoption scales	AllAboutAI 2025, via Holm Intelligence Partners
AI-output verification tax	~$14,200/employee/yr	4.3 hrs/week per employee spent checking AI output	Forrester, "Enterprise AI Cost Analysis," 2025
Manual RAG tuning cost	$4,500–$10,500	Chunking strategy + hybrid search + metadata filtering, one-time, per corpus	Stratagem Systems, 89 production RAG deployments, 2026
Enterprises with ≥1 RAG hallucination incident	67%	Of enterprises running production RAG, in the past year — RAG narrows the hallucination problem, it doesn't close it	Gartner 2026 survey, via NeuralWired

Named incident — why retrieval quality is the fix, not a bigger model

In October 2025, Deloitte refunded part of an AU$440K (~$290K USD) contract with the Australian government after a delivered report was found to contain AI-fabricated citations. (AP, October 2025, via Medium) The most-cited finding across 2025–2026 RAG research is that this class of failure is "overwhelmingly a retrieval problem, not a generation problem" — the model reasons correctly over the wrong chunk. (Seekr, "The Hallucination Tax," 2026) That is precisely the failure mode §1 names and §3 routes around — QueryForge's bet is that better retrieval selection is cheaper than better verification after the fact.

8.2 Solution cost — the Always-Free ledger

Component	Always-Free allowance	QueryForge projected usage	Cost	Source
Cloud Run	2M requests · 360K GiB-sec · 180K vCPU-sec per billing account/mo	Demo traffic, scale-to-zero when idle	$0.00	Cloud Run pricing
Firestore	1 GiB storage · 50K reads / 20K writes / 20K deletes per day, per project	Small demo corpus + query log, well under daily caps; KNN vector search billed 1 read per 100 index entries scanned	$0.00	Firestore pricing
Cloud Storage	5 GB (US regions), 5K Class A / 50K Class B ops	Corpus files + self-hosted model weights	$0.00	Cloud Storage pricing
Cloud Build	120 build-minutes/day	One container build per deploy	$0.00	Cloud Build pricing
Artifact Registry	0.5 GB storage	Single container image	$0.00	Artifact Registry pricing
Gemini Developer API	~15 RPM / ~1,500 RPD / 1M TPM, per project, for Flash-Lite	Classifier + decomposer + HyDE calls	$0.00 — not billed through Cloud Billing at all	Gemini API pricing · rate limits, TokenMix 2026
Total actual spend	—	—	$0.00, capped at $0.01	—

Cloud Run's Always-Free allotment pools per billing account, not per project — QueryForge's share is kept conservative by design so the other three apps on the same account keep theirs. Firestore's free quota is per project, which is the one allotment QueryForge does not have to share.

8.3 The budget guard

A hard cap only means something if it enforces itself before a bill is generated. The first design for this used a reactive Cloud Billing Budget → Pub/Sub → Cloud Function kill switch. Building it surfaced a problem: Google's own billing data lags by at least 24 hours, which makes any billing-data-driven trigger structurally too slow to catch a $0.01 overspend before it happens — by the time it fires, the overspend already occurred.

Enforcement path — revised

Every Gemini call now passes through a Firestore-transactional spend guard before it's made: a running monthly total is checked against the $0.01 cap, using published per-token pricing, with zero dependency on GCP's billing pipeline or its lag. The original Cloud Billing Budget ($0.01, project-scoped) is kept as an independent secondary tripwire — defense in depth in case the guard itself has a bug, not the primary safeguard. See ADR-006 (revised) and build/service/budget_guard.py in the repository.

8.4 QueryForge vs. alternative approaches

Approach	Annual cost	Retrieval routing	Explainability	Notes
QueryForge (this build)	~$0/yr	✓ 5-way classifier, adaptive α	✓ classifier_explanation on every call	Bounded by Always-Free ceilings — see §12 MVP Scope
Manual RAG tuning (in-house)	$4,500–$10,500 one-time (Stratagem 2026)	~ Fixed config, hand-tuned per corpus	✗ No routing rationale returned	Re-tuning needed whenever the corpus shifts
Glean / managed enterprise search	$8K–$30K/yr (est., per-seat)	~ Proprietary, vendor-controlled	~ Partial, product-dependent	Strong UX, but retrieval logic is not inspectable or self-hostable
Google Vertex AI Search (managed)	$8K–$30K/yr (est.)	~ Managed, multi-tenant	~ Partial	A different product from what QueryForge does — Vertex AI Search is a managed, multi-tenant enterprise search product; QueryForge calls the raw Gemini API from a backend we control. Worth being precise about — the two get conflated often.
Do nothing (single dense embedding)	$0/yr direct, but see §8.1	✗ None	✗ None	The baseline row in §6.1 — cheapest to run, most expensive in downstream errors

Limitations of this cost model

The problem-cost figures in §8.1 are industry averages, not measurements of any specific deployment — actual exposure depends on corpus size, query volume, and how much of an organization's error rate is attributable to retrieval versus other causes, which is not separable from public data. The alternative-approach costs marked "est." are directional, built from public per-seat pricing ranges, not vendor quotes. What is not an estimate: QueryForge's own Google Cloud spend, which is measured against real, cited pricing pages and is $0.00 at demo scale.

§9 Deployment

Single-command deploy to Cloud Run, followed immediately by the budget cap — the cap is treated as part of the deployment, not an optional afterthought.

deploy.sh

gcloud · single project · Always-Free

# build + deploy the container, capped at 3 instances, 1Gi memory
gcloud run deploy queryforge \
  --source . \
  --region us-central1 \
  --max-instances 3 \
  --memory 1Gi \
  --allow-unauthenticated \
  --set-secrets GEMINI_API_KEY=gemini-api-key:latest

# hard budget cap — scoped to THIS project only
# does not touch the shared billing account or the other 3 apps on it
gcloud billing budgets create \
  --billing-account=$BILLING_ACCOUNT_ID \
  --display-name="queryforge-hard-cap" \
  --budget-amount=0.01USD \
  --threshold-rule=percent=1.0 \
  --filter-projects=projects/$QUERYFORGE_PROJECT_ID

§10 Architecture Decision Records

QueryForge's original design was open-source and local-first by default. Every ADR below records the point where that default was re-examined against two hard constraints — 100% Google Cloud, and a $0.01 ceiling on a shared billing account — and documents what was chosen instead and why.

ADR-001

Google Cloud-only architecture over open-source, multi-vendor stack

Accepted

Date

2026-07-09

Context

The original design used ChromaDB, local HuggingFace inference for embeddings and reranking, and Hugging Face Spaces for hosting. This deployment runs on a Google Cloud free-tier account with zero credits, sharing a billing account with three other apps. A multi-vendor stack adds operational surface area with no benefit on this constraint set.

Decision

Rebuild on Google Cloud only: Cloud Run for compute, Firestore for persistence and vector search, Cloud Storage for corpus/weights, and the Gemini Developer API for the one LLM dependency.

Considered

chosen

Google Cloud-only — single vendor, single IAM boundary, every managed service has an Always-Free tier that covers demo scale.

rejected

Keep original multi-vendor stack — Hugging Face Spaces hosting and ChromaDB have no committed zero-cost guarantee compatible with a $0.01 ceiling.

Consequences

Self-hosted open-weight models remain in the design (ADR-003) — "Google Cloud-only" means the infrastructure vendor, not that every model call must be a managed API.

ADR-002

Firestore Vector Search over ChromaDB or Vertex AI Vector Search

Accepted

Date

2026-07-09

Context

The retrieval layer needs a persistent vector index. Cloud Run's local disk is ephemeral, so self-hosted ChromaDB would lose its index on every cold start. Vertex AI Vector Search is Google-native but has no Always-Free tier — its cheapest deployed index runs continuously and accrues an hourly charge regardless of query volume.

Decision

Use Firestore (Native mode) with vector search for both the dense index and the query/config store. Persists across cold starts, and its free daily quota comfortably covers a portfolio-scale demo corpus.

Considered

chosen

Firestore Vector Search — native GCP, Always-Free tier, persists across Cloud Run scale-to-zero.

rejected

Self-hosted ChromaDB — no free persistent disk on Cloud Run; needs a separate VM or Filestore volume, both outside Always-Free.

rejected

Vertex AI Vector Search — the deployed index has an always-on hourly cost with no free tier; incompatible with a $0.01 cap. Documented as the production upgrade path in §12.

Consequences

Corpus size for the zero-cost demo is bounded by Firestore's Always-Free storage — see §13 Limitations.

ADR-003

Self-hosted open-weight models over Vertex AI Embeddings / Prediction API

Accepted

Date

2026-07-09

Context

Dense embedding and cross-encoder reranking need a model somewhere in the request path. Vertex AI's Text Embeddings API and hosted Prediction endpoints are Google-native, but both are metered per call with no meaningful free tier.

Decision

Bundle all-MiniLM-L6-v2 and ms-marco-MiniLM-L-6-v2 as open weights inside the Cloud Run container image and run inference on the container's own CPU — compute time is Always-Free up to 180K vCPU-sec/month instead of a per-call API fee.

Considered

chosen

Self-hosted, in-container inference — zero marginal cost per query.

rejected

Vertex AI Text Embeddings API — billed per 1K characters with no free tier; incompatible with a $0.01 cap at any real query volume.

Consequences

Reranker adds ~600ms latency on Cloud Run's free-tier CPU — see §13 Limitations.

ADR-004

Gemini Developer API over the Vertex AI Gemini endpoint

Accepted

Date

2026-07-09

Context

Gemini is reachable two ways: through Vertex AI (billed to the GCP project's Cloud Billing account) or through the Gemini Developer API at aistudio.google.com (billed on a separate, independently free-tiered quota tied to the API key). QueryForge's classifier, decomposer, and HyDE fallback all need to stay off the $0.01-capped account.

Decision

Call Gemini exclusively through the Gemini Developer API using an API key stored in Secret Manager. The Vertex AI Gemini endpoint is not used anywhere in this design.

Considered

chosen

Gemini Developer API — free tier billed independently of Cloud Billing.

rejected

Vertex AI Gemini endpoint — same model family, but every token generated is metered against Cloud Billing.

Consequences

Throughput is capped at the Developer API's free-tier limits — the primary scaling bottleneck in §13. Free-tier calls may also be used by Google to improve its models; production deployments that need to opt out enable billing on the API, which is a different account decision from the $0.01 Cloud Billing cap this ADR is about. Also worth noting: the model originally specified here (Gemini 2.0 Flash-Lite) was deprecated and shut down June 1, 2026 — this document was updated to Gemini 2.5 Flash-Lite, and future revisions should expect the same churn.

ADR-005

Cloud Run over GKE Autopilot

Accepted

Date

2026-07-09

Context

QueryForge needs a container runtime. GKE Autopilot is the Google-native alternative, but even a minimal cluster bills for a base management fee and any always-on node, regardless of traffic.

Decision

Deploy to Cloud Run with max-instances capped and scale-to-zero enabled. No traffic, no running instance, no charge.

Considered

chosen

Cloud Run — scale-to-zero, 2M requests/month Always-Free, no cost floor.

rejected

GKE Autopilot — cluster management fee and node cost accrue even at zero traffic.

Consequences

Cold starts add latency after idle periods — an accepted trade-off, documented rather than hidden.

ADR-006

Project-scoped, self-enforcing budget guard over a billing-data-reactive kill switch

Revised

Date

2026-07-09 (revised from the original same-day decision, after implementation)

Context

The common pattern for a hard cost cap disables billing (or throttles a service) once a Cloud Billing Budget threshold is crossed. QueryForge shares its $10 billing account with three other apps, so the original decision scoped that pattern to the project only. Building it surfaced a bigger problem: Google's own billing data lags by at least 24 hours. A billing-data-reactive trigger — Pub/Sub push or scheduled poll, project-scoped or not — cannot enforce a $0.01 cap in real time. By the time it fires, the overspend already happened.

Decision

Self-enforce the cap in application code. Every Gemini call now passes through build/service/budget_guard.py first — a Firestore-transactional check against a running monthly total, using published per-token pricing, with zero dependency on GCP's billing pipeline. The original project-scoped Cloud Billing Budget is retained as an independent secondary tripwire, not the primary safeguard.

Considered

chosen

Self-enforcing Firestore-transactional guard — sub-second enforcement, no dependency on billing-data latency; Cloud Billing Budget kept as defense-in-depth only.

superseded

Project-scoped Pub/Sub → Cloud Function kill switch (original decision) — correct in scope, but too slow to be the primary guard given the billing-data lag.

rejected

Account-level "disable billing" kill switch — a QueryForge cost spike would take down three unrelated apps as collateral damage, on top of still being too slow.

Consequences

One extra Firestore transaction per Gemini call — inside the Always-Free daily write quota at demo volume. The cost estimate is derived from token-usage metadata against published pricing, not GCP's actual invoice; treated as a hard-stop guard, not a billing record.

§11 Data Schema

Firestore holds three collections: the vector-indexed corpus, the per-query log, and the config recommendations — all inside the Always-Free daily read/write allotment at demo volume.

corpus_chunks/{doc_id}

Firestore · vector-indexed

// One document per chunk. `embedding` is indexed for Firestore Vector Search.
{
  "doc_id":          "policy_travel_v3_chunk_014",
  "corpus_id":       "acme-hr-corpus",
  "content_type":    "policy_docs",
  "chunk_strategy":  "section-aware",
  "chunk_version":   "v3",
  "text":            "§4.2 Non-standard payment terms require...",
  "embedding":       [0.0123, -0.0871, ...],   // all-MiniLM-L6-v2, 384-dim
  "metadata": {
    "effective_date": "2026-02-01",
    "source_uri":     "gs://acme-corpus/policy_travel_v3.pdf"
  }
}

query_logs/{doc_id}

Firestore · Always-Free: 20K writes/day

// One document per /v1/optimize call.
{
  "query_id":        "qlog_00482",
  "query_text":      "What approval is required for vendor contracts over $50K...",
  "classifier_type": "multi-hop-entity",
  "confidence":      0.91,
  "sub_queries":     ["vendor contract approval threshold $50K", "..."],
  "alpha":           0.40,
  "reranked":        true,
  "latency_ms":      2340,
  "timestamp":       "2026-07-09T14:02:11Z"
}

§12 MVP Scope & Build Boundaries

Everything below runs inside the Always-Free tier and the $0.01 cap. Nothing here required a paid tier to demonstrate.

Fully implemented

Query classifier + confidence fallbackGemini 2.5 Flash-Lite · Dev API free tier

Sub-query decomposition + HyDEmulti-hop / low-similarity, live routing

Dense + sparse + hybrid retrievalFirestore Vector Search + self-hosted BM25

RRF fusion + config recommenderpure Python · Firestore-logged

Budget guardFirestore-transactional, self-enforcing · project-scoped Cloud Billing Budget as secondary tripwire

Demo-scoped

Cross-encoder rerankerlive, but only on a small cached demo corpus to stay inside free CPU-seconds

Firestore Vector Search corpuscapped at ≤50MB to stay comfortably inside free storage

Front-end simulator (§5)scripted stage timings against real pipeline benchmarks, not a live backend call — same pattern used across this portfolio's other design docs

Deferred to production

Step-back & rewrite variantsclassifier route defined (§3.1, Fig. 3) but not corpus-evaluated yet

Vertex AI Vector Searchfor corpora beyond Firestore's free ceiling — see ADR-002

Paid Gemini tierabove ~5K queries/day, once free-tier RPM is the bottleneck

§13 Limitations & Known Issues

HyDE hallucination risk	An incorrect hypothetical document degrades recall. Mitigated by running HyDE alongside standard dense retrieval and letting RRF demote uncorroborated results; only activates below 0.65 similarity.
Classifier miscategorization	A multi-hop query misclassified as single-hop reproduces the exact failure QueryForge exists to prevent. Confidence below 0.75 always falls back to hybrid+decompose.
Reranker latency	The cross-encoder adds ~600ms on Cloud Run's free-tier CPU. Only applied to multi-hop queries where the precision gain justifies it.
Gemini free-tier rate limits	~15 RPM / ~1,500 requests per day is the scaling bottleneck. Production deployments above ~5K queries/day need the paid Gemini tier (ADR-004).
Free-tier data use	Google may use Gemini Developer API free-tier inputs/outputs to improve its models. Enabling billing opts out — a real trade-off for confidentiality-sensitive corpora that a $0.01-capped project can't casually take.
Model deprecation churn	Gemini 2.0 Flash-Lite, the model this design originally targeted, was deprecated and shut down June 1, 2026. This document now targets 2.5 Flash-Lite; free-tier model names should be expected to change again.
Firestore Always-Free ceilings	1 GiB storage and 20K writes/day cap the demo corpus size and query-log volume (ADR-002).
Shared billing account	The $0.01 cap is enforced at the project level specifically so a cost spike here cannot cascade into the three other apps on the same account (ADR-006).

§14 Glossary

RAG — Retrieval-Augmented Generation

Retrieving relevant context from a corpus and passing it to an LLM alongside the user's query, so the model's answer is grounded in that context rather than parametric memory alone.

RRF — Reciprocal Rank Fusion

A rank-based method for merging results from multiple retrieval strategies. Scores each document by the sum of 1/(k+rank) across every list it appears in, so it never has to reconcile incompatible score scales.

BM25

A sparse, term-frequency-based ranking function. Strong on exact matches — entity names, contract numbers — where dense embeddings tend to blur precise identifiers into a general semantic neighborhood.

HyDE — Hypothetical Document Embeddings

Generating a plausible hypothetical answer document with an LLM, then embedding that document (instead of the raw query) as the search vector.

α-weighting

The blend factor between dense and sparse scores in hybrid retrieval: score = α·dense + (1−α)·BM25. QueryForge sets α per query type based on published grid-search results.

Cross-encoder reranker

A model that scores a (query, document) pair jointly, rather than comparing independently-embedded vectors. Used selectively on multi-hop queries only.

Firestore Vector Search

Google Cloud's native vector-similarity search over Firestore documents. Used here as the dense index because it persists across Cloud Run's scale-to-zero cycles and fits inside the Always-Free daily quota at demo scale.

Always-Free tier

The set of Google Cloud service allotments available every month at no charge, independent of any free-trial credit. Some pool per billing account (Cloud Run); others are per project (Firestore) — see §8.2.

Gemini Developer API vs. Vertex AI Gemini

Two ways to call the same Gemini models. The Developer API (aistudio.google.com, API-key auth) has an independent free-tier quota, billed separately from Cloud Billing. The Vertex AI endpoint is billed through Cloud Billing per token. QueryForge uses the Developer API exclusively (ADR-004).

Budget guard

QueryForge's self-enforcing spend cap: a Firestore-transactional check before every Gemini call, using published per-token pricing. Chosen over a reactive Cloud Billing Budget alert because Google's billing data lags too much to enforce a $0.01 cap in real time (§8.3, ADR-006).