VaultRAG v1.0.0 · On-Prem RAG for Manufacturing · 2026

VaultRAG: Voice-First Retrieval
for Manufacturing Documentation

Design Documentation · Architecture Portfolio Reference Google Cloud Build · July 2026 Design scenario: FlexForm Precision · Tier-2 machining supplier

$260K

per hour
average unplanned downtime cost, general manufacturing

23%

of downtime
attributable to human error and procedure misapplication

<10s

target retrieval
voice query to cited procedure, at the point of need

Abstract

Manufacturing facilities carry tens of thousands of pages of equipment manuals, ISO-controlled SOPs, and non-conformance records — but the constraint is not documentation volume, it is retrieval at the point of need. A technician on a live production floor, hands occupied, in 85–95dB ambient noise, with degraded mobile signal, cannot open a 380-page PDF. The commercially visible fix — cloud-hosted RAG — is architecturally excluded for exactly the manufacturers who need it most: Tier-2 and Tier-3 suppliers operating under ITAR, NDA, and ISO 27001 data governance clauses that prohibit routing proprietary process documentation through third-party APIs.

VaultRAG is a voice-first retrieval-augmented generation system built around a five-layer guardrail pipeline (query normalisation, scope guarding, confidence thresholding, safety flagging, citation enforcement) designed to fail closed rather than hallucinate a procedure in a safety-critical context. The build-phase implementation runs on Google Cloud — Gemini 3 Flash-Lite and Flash via Vertex AI, Firestore Vector Search for retrieval, Cloud Run for the API layer — under a $0-target Blaze budget in europe-west2. The documented production path is Gemini on Google Distributed Cloud air-gapped: the same architecture, with zero network egress after deployment, for facilities where that constraint is not negotiable.

RAG voice interface guardrail pipeline data sovereignty manufacturing SOPs fail-closed design Vertex AI Firestore Vector Search Google Distributed Cloud Cloud Run

§1 Problem Statement

Manufacturing facilities carry extensive technical documentation. The documented constraint is not the absence of knowledge — it is the absence of infrastructure capable of surfacing that knowledge at the point of need, within the data governance boundaries the operating environment requires. This section defines the business and human dimensions of that constraint, with sourced benchmarks.

1.1 Problem scale — industry benchmarks

The figures below are cited from primary research and establish the scale of the constraint the architecture addresses — not projections.

Metric	Value	Detail	Source
Downtime cost	$260K/hr	Average unplanned downtime, general manufacturing. Automotive up to $2.3M/hr.	Aberdeen Group via Oxmaint, 2024
Unplanned downtime	800 hrs/yr	Typical manufacturer, ≈15 hrs/week	Siemens True Cost of Downtime, 2024
Human error share	23%	Of unplanned stoppages — incorrect procedure, missed steps	ABB / Plutomen, 2024
Information search time	1.8 hrs/day	≈23% of productive hours spent locating information	McKinsey via Copernic, 2025
Quality problems from human error	33%	Scrap, rework, defective output	American Society for Quality (ASQ)
Global downtime loss	$1.4T/yr	World's 500 largest manufacturers, ≈11% of revenue	Siemens, 2024
Document search time	18 min	Average time to locate one document, office conditions	Gartner via M-Files, 2025

Figure 1

Failure propagation chain — from retrieval delay to financial impact

A misapplied procedure frequently appears to succeed before the non-conformance surfaces — the costliest failures are the ones that initially look resolved.

1.2 Business and operational dimensions

Design scenario note

FlexForm Precision is a constructed design scenario used to ground architectural decisions in a realistic operational context. Facility profile and operational figures are synthesised from published sector research. FlexForm does not represent a real client or facility.

Business and compliance dimension. A representative mid-size manufacturer in the FlexForm scenario maintains 12,000–40,000 pages of equipment manuals, ISO-controlled SOPs, non-conformance reports, calibration records, and LOTO procedures — organised on a shared drive, in a binder near the supervisor's desk, or in the tacit knowledge of engineers who have since left. The documentation corpus is extensive. The retrieval capability is not.

Cited finding

Over 80% of manufacturers cannot accurately quantify their true downtime costs — a visibility gap that lets indirect expenses accumulate undetected.

↗ iFactory, 2025

Cloud-based retrieval tools — the commercially visible solution — are architecturally incompatible with this environment. Aerospace, automotive, and defence supply chain manufacturers operate under NDA, ITAR, and ISO 27001 requirements that prohibit routing proprietary process documentation through third-party APIs, Google's Gemini API included when called over the public internet. The retrieval pattern that resolves the problem cannot be applied in the environments where the problem is most acute — which is the architectural case for the on-prem, GDC air-gapped production path documented in §6.

Human factors dimension. In the FlexForm scenario, a technician with eleven years of floor experience encounters an E-04 fault on the Line 3 CNC at 07:14 on a Tuesday. The line stops. Shop-floor mobile signal is insufficient for network queries. The relevant service manual — 380 pages, last updated 2022 — is on a laptop in the supervisor's office, forty metres away, during a shift briefing.

Cost of the recall gap

At $260,000/hr, seventeen minutes of downtime represents approximately $73,700 in direct loss — before rework, before the NCR, and before the fault recurs at 09:48 because the torque specification was misremembered by 5 Nm.

↗ Aberdeen Group cost model

This is not a competence failure. It is a capable professional operating without an information system adequate to the physical and time constraints of the environment. The documentation existed. The knowledge existed. The retrieval infrastructure did not.

1.3 Constraint catalogue — six recurring failure modes

These are not edge cases. They represent the daily operational reality of a facility lacking a retrieval capability matched to the floor environment and its governance requirements.

Constraint 01

Point-of-need retrieval doesn't exist

Documentation exists everywhere; a mechanism to surface the correct page in under 10 seconds, hands-free, does not. Shared-drive keyword search doesn't meet the bar.

18 min avg. search · Gartner

Constraint 02

Cloud RAG is governance-excluded

API-based RAG (OpenAI, Anthropic, Google) routes document content to external servers — a hard incompatibility under ITAR, NDA, and ISO 27001.

ITAR · ISO 27001 · NDA exclusion

Constraint 03

Recalled ≠ documented procedure

Under pressure and time constraint, specific values — torque specs, pressure settings — are subject to recall error even by experienced technicians.

23% of downtime · ABB / Plutomen

Constraint 04

Retrieval UX assumes a desk

Document systems assume seated, screen-facing users. A floor technician has one hand on equipment, 85–95dB ambient noise, and needs an answer in <10s.

Voice-first deployment gap

Constraint 05

Bad procedures propagate silently

A misapplied procedure often appears to succeed — fault clears, line restarts — before the non-conformance surfaces days later.

33% quality problems · ASQ

Constraint 06

Safety procedures carry the highest stakes

LOTO, de-pressurisation, and electrical isolation steps are where a misremembered step is a safety incident, not a quality event.

OSHA 29 CFR 1910.147 · LOTO

1.4 Operational scenario — illustrative shift timeline

A composite of published maintenance shift patterns. The persona and facility belong to the FlexForm design scenario; the cost figures are drawn from cited benchmarks.

Time	Event	Detail	Cost
06:50	Shift briefing	Night shift flags intermittent spindle noise on Line 3 CNC, verbally, no fault code recorded.	—
07:14	E-04 fault — Line 3 stops	Mobile signal insufficient; supervisor's laptop is in a briefing. Technician queries a colleague from memory.	$4,333/min clock starts
07:16–07:28	Procedure reconstructed	12 min cross-referencing two technicians' recall. Spindle bearing torque spec (45 Nm) misremembered as 40 Nm.	—
07:31	Fault cleared — line restarts	17 min downtime. Torque error undetected.	≈$73,700
09:48	E-04 returns	Bearing migrated under load; secondary vibration. Escalated to engineering — 34 more minutes to root cause.	+$147,400
14:30	NCR raised	51 total minutes downtime; incorrect procedure application documented; retraining specified.	—
Shift total	≈$221,100 direct cost	Excludes NCR processing, rework, and retraining. The 45 Nm spec was on page 247 of the manual in the supervisor's office.	51 min downtime

1.5 Cost model — knowledge retrieval gap at scale

Illustrative for a 500-person facility, not a projection for any specific operation. All inputs are sourced.

Downtime cost (184 hrs × $260K)

$47.8M

800 hrs/yr × 23% human-error share = 184 hrs attributable. ↗ Siemens 2024 + ABB/Plutomen 2024

Search time cost (225K hrs × $35/hr)

$7.9M

1.8 hrs/day × 500 workers × 250 days = 225,000 hrs/yr. ↗ McKinsey / Copernic 2025

Sector	Hourly cost	Incidents/mo	Avg. duration	Source
Automotive (OEM)	$2.3M	25/month	~4 hrs	Aberdeen/Oxmaint 2024
General manufacturing	$260K	65% face monthly	~4 hrs	Aberdeen Group
Mid-size plant (any sector)	$125K	2/3 experience monthly	~4 hrs	ABB Value of Reliability 2024
Consumer goods	$39K	Variable	Variable	Sumitomo/Aberdeen 2025
All U.S. manufacturing	$50B/yr	800 hrs/yr avg	Industry-wide	Forbes/TeamSense 2026

Error category	Proportion	Manifestation	Source
Downtime from human error	23%	Wrong procedures, missed maintenance steps	ABB via DocuClipper 2025
Quality problems from human error	33%	Scrap, rework, defective product	ASQ
Errors from procedure/training failure	40%	Incorrect or missing procedural knowledge	DoD root cause standard
Global losses from human error	$10B/yr	Direct financial impact, all sectors	Deloitte via Orca Lean

1.6 Constraints addressed by the proposed architecture pattern

The comparison below frames current state against the target architecture — not as a product claim, but as a statement of which constraints the pattern is designed to address.

Current state

Manual retrieval under time and environmental constraint

A technician with a fault code, a high-noise floor, degraded signal, and a manual stored off the production area. ~18 min average resolution. 23% of downtime from human error. $260K/hr exposure.

Target architecture

Voice query against indexed documentation — cited response, <10s

The same technician speaks the fault code. The pipeline queries the indexed corpus, applies retrieval guardrails, and returns the cited procedure. Deployed within the facility boundary in production.

Data sovereignty scope note

In the production deployment model (§6), no document data crosses the facility boundary — Gemini on Google Distributed Cloud air-gapped, with zero network egress after deployment. The build-phase demo described in this document runs on standard Vertex AI and Cloud Run in europe-west2 — a documented, honest exception to the sovereignty claim, made explicit here rather than glossed over, exactly as the original design treated its HuggingFace Spaces demo exception.

§2 Pipeline Overview

VaultRAG runs a voice query through a five-layer guardrail pipeline before generation, and a separate ingestion pipeline handles document intake ahead of retrieval. Both are shown below as a single architecture, build-phase stack labelled throughout.

Figure 2

End-to-end architecture — voice input to cited response, plus ingestion path

Same application logic runs in all three environments (§6) — only the model size, hardware, and network context change between local dev, build-phase demo, and GDC air-gapped production.

Guardrail request flow — nine steps in sequence

0

Voice input

Web Speech API

G1

Query Normaliser

pre-retrieval

G2

Scope Guard

pre-retrieval

·

Firestore Retrieval

top-k=3

G3

Confidence Threshold

post-retrieval

G4

Safety Flag

pre-generation

·

Gemini 3 Generation

Flash

G5

Citation Enforcer

post-generation

✓

Cited response

<10s target

§3 Pipeline Components

Two mechanisms carry the retrieval quality claim: the retrieval fundamentals (how a query becomes a ranked set of chunks) and the guardrail layers (what stops a weak match from becoming a generated answer). Both are documented here at the level a reviewer would need to reproduce or audit them.

3.1 Retrieval fundamentals

Term	Definition	VaultRAG specifics
RAG	Retrieves relevant passages before generation, rather than generating from training data alone.	Custom orchestration on Cloud Run — no third-party RAG framework in the request path.
Embedding	Numerical vector capturing semantic meaning; similar texts land close together.	`gemini-embedding-001` via Vertex AI, applied to both documents (ingestion) and queries (runtime).
Vector store	Database optimised for storing and searching embedding vectors.	Firestore Vector Search (GA) — KNN, billed per read, no always-on index endpoint cost.
Cosine similarity	Similarity measure between two vectors, −1 to 1.	Query vs. chunk embeddings at retrieval. G3 requires a minimum score of 0.70.
Chunking	Splitting a document into retrievable units before embedding.	Procedural section chunking — splits at heading/procedure boundaries so a complete procedure is one retrievable unit, not an arbitrary token window.
Top-k retrieval	Returns the k most similar chunks for a query.	k=3 from Firestore Vector Search, injected into the Gemini 3 Flash context alongside the system prompt.

Why procedural chunking, not token windows

Token-window chunking is the default in most RAG frameworks because it works well on uniform prose. Manufacturing procedure documents are not uniform prose — a 5-step lockout sequence split across two arbitrary 512-token windows returns half a procedure, which is worse than no match at all in a safety-critical context. Procedural chunking is the one decision most likely to be overlooked when adapting a general-purpose RAG pattern to this domain.

3.2 Guardrail layers — G1 through G5

Each layer is a checkpoint, not a filter. The pipeline is designed to fail closed: when confidence is insufficient, when scope is violated, or when a citation can't be produced, the system refuses with a clear reason rather than generating a plausible-sounding but ungrounded answer.

Layer	Position	Trigger condition	Action
G1 · Query Normaliser	Pre-retrieval	Always runs on raw voice transcription	Cleans filler words, mishearing, and jargon into a well-formed query via Gemini 3 Flash-Lite.
G2 · Scope Guard	Pre-retrieval	Top-1 similarity < 0.30	Refuses immediately, before spending retrieval budget on an out-of-corpus query.
G3 · Confidence Threshold	Post-retrieval	Best chunk similarity < 0.70	Refuses rather than generating from a weakly-matched chunk.
G4 · Safety Flag	Pre-generation	LOTO, high voltage, pressure vessel, hazardous material keywords in retrieved chunks	Prepends a mandatory safety warning; citation must include the full procedure reference.
G5 · Citation Enforcer	Post-generation	No source reference in the generated output	Retries once with a stricter prompt; blocks and refuses if the retry also fails.

Fail-closed, stated precisely

The guardrail design does not eliminate the possibility of hallucination — no architecture does. It adds checkpoints intended to catch the conditions most likely to produce it. The design-validation table in §5 is a functional check that each guardrail fires as designed under controlled conditions, not a statistical benchmark of recall or precision at scale.

§4 Pipeline Simulator

Four reference scenarios, each scripted to demonstrate a specific guardrail firing condition against real pipeline timings. This runs entirely client-side — no live backend call — which is a deliberate choice at this build stage, not a limitation papered over: it keeps the demo's zero-exfiltration story honest while the Cloud Run backend is built out (§6), and it's the same pattern AlignR's own simulator uses for the same reason.

Sign in to unlock the live pipeline (currently running scripted mode below)

vaultrag · guardrail simulator idle

Scenario

Voice input

Web Speech API

—

G1 · Query Normaliser

Gemini 3 Flash-Lite

—

G2 · Scope Guard

top-1 similarity check

—

Firestore retrieval

top-k=3 · gemini-embedding-001

—

G3 · Confidence Threshold

min. cosine 0.70

—

G4 · Safety Flag

LOTO / hazard keyword scan

—

Gemini 3 Flash generation

Vertex AI

—

G5 · Citation Enforcer

source reference required

—

// select a scenario and run simulation

§5 Design Validation

A 20-query offline evaluation of the guardrail pipeline against a representative procedure corpus. This is a functional check — did each layer fire under the condition it was designed for — not a statistical benchmark of recall or precision at scale, and it is not framed as a production readiness assessment. AlignR's equivalent section calls this a compliance artefact because AlignR's premise is regulatory conformity; VaultRAG's claim is narrower and is stated as such.

Result

17 / 20 pass rate. The three results not meeting expected behaviour are analysed individually below, each with a remediation note — not smoothed over as "close enough."

Query	Expected guardrail	Observed behaviour	Result
How do I resolve an E-04 spindle fault on the Haas VF-2SS?	G1 normalises; G3 passes (≥0.70); G5 enforces citation	High-confidence retrieval, response returned with section reference	PASS
What is the torque spec for the bearing housing on Line 3?	G1 normalises; G3 passes; G5 enforces citation	Correct procedure retrieved, torque value cited	PASS
Uh… coolant level alarm on the Mazak, how do I reset it?	G1 normalises voice disfluency; G3 passes; G5 enforces citation	G1 cleaned query; cited response returned	PASS
SOP for end-of-shift inspection on Line 7?	G1 normalises; G3 passes; G5 enforces citation	Correct SOP section retrieved with page reference	PASS
What does fault code F-12 mean on the CMM?	G1 normalises; G3 passes; G5 enforces citation	Fault code definition retrieved and cited	PASS
What time does the canteen close?	G2 fires; refused before retrieval	Blocked as out of scope	PASS
Can you write me a Python script?	G2 fires; refused before retrieval	Blocked; no retrieval attempted	PASS
Who won the football last night?	G2 fires; refused before retrieval	Blocked as out of scope	PASS
Tell me about the company's HR policy on overtime	G2 fires; refused before retrieval	Blocked; no retrieval attempted	PASS
What is the best CNC machine brand?	G2 fires; refused before retrieval	Blocked as out-of-scope opinion query	PASS
How do I isolate the hydraulic press before maintenance?	G4 fires on "isolate"; safety prefix	Isolation keyword detected; warning prepended	PASS
LOTO procedure for the Haas spindle drive	G4 fires on "LOTO"; safety prefix	Mandatory LOTO prefix prepended before procedure	PASS
Safe working distance from the high voltage cabinet?	G4 fires on "high voltage"; safety prefix	Triggered; procedure cited correctly	PASS
Max pressure rating for the hydraulic vessel on Line 2?	G4 fires on "pressure vessel"; safety prefix	Triggered on "pressure"; rated value cited	PASS
How do I fix the blinking light on machine 4?	G3 fires (low confidence); refuses	Similarity 0.41; refusal, suggests rephrase	PASS
The thing near the door keeps making a noise	G3 fires (low confidence); refuses	Similarity 0.29; refusal returned	PASS
Procedure for the new update they installed last week	G3 fires — doc not in corpus; refuses	No match above threshold; explains doc may not be indexed	PASS
Can you explain the spindle alignment process generally?	Ambiguous — "generally" suggests non-procedural. G2 expected to fire.	G2 did not fire; low-confidence result correctly caught by G3. Correct refusal, wrong layer.	FAIL
uh lockout the uh press thing before I touch it	G1 normalises; G4 fires post-normalisation on "lockout"	Raw query didn't trigger G4; fired correctly after G1. Sequence confirmed correct.	PASS
Steps for bearing replacement on the spindle	G3 passes; G5 enforces citation with section reference	First generation lacked reference. Retry succeeded with citation.	FAIL

Failure 1 — wrong layer fired

G2 didn't fire; G3 caught it instead

"Can you explain the spindle alignment process generally?" — passed to retrieval, low-confidence result correctly caught by G3. Outcome (refusal) was correct; the layer that fired was not the expected one.

Remediation: tighten G2 floor 0.30 → 0.35 for queries containing "generally", "explain", "describe" without a specific fault/step reference.

Failure 2 — designed behaviour, not a bug

G4 keyword obscured by disfluency

"uh lockout the uh press thing before I touch it" — G4 didn't fire on the raw transcript; fired correctly after G1 normalisation. G4 operates on normalised output, not raw voice — confirmed as designed sequence.

No remediation required. Documented as expected behaviour, not reclassified as a pass to inflate the score.

Failure 3 — retry within budget

G5 required one retry

"Steps for bearing replacement on the spindle" — first generation lacked a section reference; G5 blocked it, retried once, retry succeeded with citation.

Single retry is within the <8s budget. If retry rate exceeds 10% in production, G5's prompt will be strengthened to make citation format mandatory on first generation.

5.1 Real-latency measurement — build phase, against the <10s target

The 17/20 evaluation above tests correctness. The numbers below test the doc's other stated claim — the stat-grid's <10s retrieval target — against actual wall-clock measurements taken during Phase 4c, not the scripted simulator timings used in §4.

Stage	Observed range (5 calls)	Behaviour
G1 normalise (gemini-3.1-flash-lite)	732–1739ms	Consistently sub-2s
Query embedding (gemini-embedding-001)	673–2839ms	High variance across calls — an initial 2-sample read called this a "structural floor"; a 5-sample read shows that was premature. Most likely ordinary network/backend jitter, not a fixed cost.
Firestore find_nearest	587–2911ms	Similarly variable — same jitter pattern as the embedding call, not a clean warm-up curve
Generation (gemini-3-flash-preview)	1467–4979ms	Upper end was pre-`thinking_level` fix; post-fix calls cluster at 1467–3005ms
Total, across all four scenario types	2233ms – 12253ms	Fastest: G3 block (skips generation). Slowest: pre-fix happy path. All post-fix totals land under the <10s target, several well under.

Honest finding — and an honest correction

The first version of this table, written after only two calls, concluded the embedding call had a fixed ~2.2–2.8s latency floor that didn't improve with server warmth. Three more calls immediately contradicted that — 735ms and 673ms on the same warm server. The correct conclusion is narrower: per-call latency on this stack is variable enough that no single stage can be pointed to as "the" bottleneck from this sample size. What's solid: thinking_level tuning measurably helped generation, and every post-fix total came in under the <10s target — including the two fastest paths (G2/G3 refusals) landing at 2.2–5.2s. A real production latency budget would need dozens of runs per stage, not five, before citing a specific number with confidence — and that distinction, between "the target was met in testing" and "the latency profile is fully characterized," is worth keeping explicit.

§6 Deployment Architecture

VaultRAG runs across three environments on the same application code. What changes is the inference endpoint, the network context, and — in production — whether the model call ever leaves the facility. Docker still packages the app; what shifts is the container's dependency on external services.

Local Dev

Developer machine

Full app stack runs locally, calling Vertex AI (Gemini 3 Flash/Flash-Lite, gemini-embedding-001) against the vaultrag-prod project, with a local Firestore emulator for guardrail-logic testing without touching billed reads. Voice input via laptop browser microphone.

Network: localhost + Vertex AI API calls · Start: docker compose up

Build-Phase Demo — Deployed

Cloud Run + Firebase Hosting

Dockerised FastAPI backend live on Cloud Run, europe-west2, authenticated (no public access yet — deferred deliberately, see ADR-004 update). Blaze billing with a $0 budget alert. Frontend on Firebase Hosting pending Phase 5 live-wiring.

Cost target: $0/month · Verified: 2227ms happy-path response, real Cloud Run infra

Guardrail Pipeline — Validated

G1–G5, live on Cloud Run

All five guardrails tested against real Vertex AI + Firestore data across four scenario types (happy path, out-of-scope, low-confidence, non-hazard). Two real bugs found and fixed during build: a missed-isolation-procedure retrieval gap, and G1 non-determinism.

See §5.1 for real latency measurements and §9 ADR-001/002/007 for build-time findings

Production

Google Distributed Cloud — air-gapped

Same application logic, deployed to a Dell-manufactured, Google-certified GDC air-gapped appliance on the facility LAN. Gemini runs on the appliance itself — no connectivity to Google Cloud or the public internet, even for management, after initial provisioning.

Egress: zero after deployment · Full data sovereignty enforced by topology, not policy

Honest exception — build-phase demo

The data sovereignty constraint from §1 does not apply in the build-phase demo environment. Query data and voice audio route through Cloud Run and the device's native STT backend respectively. This is a documented exception limited to the demo, exactly as the original design treated its HuggingFace Spaces exception — carried forward honestly rather than dropped when the vendor changed.

Figure 6

Deployment flow — local dev to production, same application code

Google Distributed Cloud reached GA for Gemini air-gapped deployments in August 2025 — the production path here is a real, current Google product, not a hypothetical.

§7 Cost Analysis

Two cost models: the business cost attributable to the retrieval gap, and the running cost of the solution. All figures are derived from published vendor pricing or cited research applied to the FlexForm scenario. Every assumption is stated; every source is linked. Where a Google-grade cost figure isn't publicly listed — the GDC air-gapped appliance — that's stated as plainly as the figures that are.

7.1 Baseline scenario — FlexForm Precision · 340 employees

Employees	Floor workers	Downtime hrs/yr	Cost/downtime min	NCRs / 12mo	Doc pages
340	260	800	£4,333	14	28K+

Limitations of this model

Downtime figures use sector-wide averages (Aberdeen, Siemens, ABB) — actual facility costs vary by plant size and contract terms. The 23% human-error attribution is a flat rate; the fraction specific to procedure retrieval (versus broader human error) is unknown without facility-level root cause data. The 10% improvement assumption in the ROI calc below is illustrative and unsourced — no deployment data exists for this system.

7.2 Problem cost estimate

Component	Calculation	Modelled annual cost
Downtime, procedure errors	800 hrs × 23% × £125K/hr (mid-size plant, ABB 2024)	£23,000,000
Search time productivity loss	260 workers × 0.5 hrs/day × 250 days × £18/hr	£585,000
NCR costs, procedure errors	9 of 14 NCRs (64%, modelled) × £12,000 avg.	£108,000
Total modelled annual exposure	Additive under stated assumptions	~£23.7M/yr

↗ Siemens True Cost of Downtime 2024 · ABB Value of Reliability 2024 · Plutomen/ABB human error rate · ASQ quality-problem attribution

7.3 Solution cost breakdown — component by component

This is where the Google-grade rebuild changes the honest answer from the original design, which itemised six components at effectively £0 because everything ran locally on open-source software. The Google stack has real, small, mostly-free-tier costs in the demo phase, and one cost that genuinely isn't public.

Component	Demo-phase cost basis	Monthly (demo)	Source
C1 · LLM inference — gemini-3.1-flash-lite / gemini-3-flash-preview	Vertex AI, global location, $0.10–$3.00 per M tokens depending on tier at demo volume; well within free-tier rate limits for portfolio-review traffic	~£0	Vertex AI generative AI pricing
C2 · Embeddings — gemini-embedding-001	Vertex AI, billed per token; negligible at ingestion + query volume for a single-facility corpus	~£0	Vertex AI generative AI pricing
C3 · Vector store — Firestore Vector Search	Billed per 100 KNN index entries read; Firestore's free daily quota covers demo-scale traffic	~£0	Firestore pricing
C4 · Orchestration — custom, no framework	Runs inside the Cloud Run service; no separate licence or managed tier	£0	— no external product to cite
C5 · Backend — Cloud Run	Always-Free tier: 2M requests, 180K vCPU-seconds, 360K GiB-seconds/month, applied per billing account via Tier-1 regional pricing (europe-west2 qualifies — this is not the same US-only restriction that applies to the Compute Engine free VM)	~£0	Cloud Run pricing
C5b · Frontend — Firebase Hosting	Spark plan: free static hosting, no billing account required	£0	Firebase pricing
C6 · Production hardware — GDC air-gapped appliance	Not publicly listed. Google Distributed Cloud air-gapped pricing is a custom enterprise quote, unlike commodity on-prem server hardware — this is itself the honest FDE answer, not a gap in the research.	Quote required	Google Distributed Cloud — product page only, no public price list
C7 · Maintenance — re-indexing, updates	~2 hrs/month IT administrator time at £35/hr, unchanged from original model	£70	Reed Technology Salary Guide 2025 · Hays UK Tech Salary Report 2025

What this means for the ROI figure

The original design's "£1,440/yr, 1,647× ratio" can't be honestly reproduced here without inventing a number for the GDC appliance. What can be stated: the build-phase demo runs at effectively £0/month against Blaze's Always-Free tier, and the production capex is a scoping conversation with Google Cloud, not a self-serve number. An FDE's actual job includes running that scoping conversation — which is a more accurate thing to demonstrate here than a fabricated capex figure.

7.4 VaultRAG vs. alternative approaches

Solution	Annual cost	Data sovereignty	NDA/ITAR compatible	Limitations
VaultRAG (demo phase)	~£0/yr	✗ Vertex AI + Cloud Run, documented exception	✗ Not applicable to demo	Not the sovereignty claim — see production row
VaultRAG (production, GDC air-gapped)	Enterprise quote	✓ On-prem · zero egress after provisioning	✓ Architecturally enforced	No SSO/audit log/multi-tenant scoped yet · appliance cost not self-serve
Azure OpenAI on Your Data	£8K–£40K/yr (est.)	✗ Data sent to Azure OpenAI	✗ Cloud dependency	Enterprise SSO · audit logs · Microsoft support
AWS Bedrock Knowledge Bases	£10K–£50K/yr (est.)	✗ Data processed in AWS	✗ Cloud dependency	Enterprise support · IAM · multi-namespace
Google Vertex AI Search (managed)	£8K–£30K/yr (est.)	✗ Multi-tenant GCP processing	✗ Cloud dependency	The managed enterprise search product — not what VaultRAG calls. VaultRAG's demo calls the raw Vertex AI model API from our own backend; production doesn't call Vertex AI at all, it runs on GDC air-gapped hardware.
ServiceNow Knowledge Management	£360+/user/yr · £120K+ at 340 users	~ ServiceNow cloud	✗ Cloud dependency	SSO · RBAC · audit logs · workflow integration

Worth being precise about

"Google Vertex AI Search" and "VaultRAG calling Vertex AI's Gemini API" are different products solving different problems — the former is a managed, multi-tenant enterprise search service; the latter is a model API call from a backend we control. Collapsing that distinction would make this table wrong in exactly the way an FDE interviewer is trained to catch.

§8 Tech Stack

Two views: the TOGAF viewpoint mapping (unchanged in structure — the architectural reasoning doesn't depend on vendor), and the canonical component table, updated line by line from the original open-source stack with the reason for each change stated.

8.1 Architecture viewpoints — TOGAF mapping

Viewpoint	Concerns addressed	Where documented
Business	Operational cost of the retrieval gap, unplanned downtime risk, NCR exposure, data sovereignty as a contractual constraint, ISO 9001 compliance drivers	§1 Problem Statement · §7 Cost Analysis
Application	Guardrail pipeline sequencing, custom orchestration (no third-party RAG framework), Cloud Run API contract, voice input modality and fallback, mobile UX constraints	§2 Pipeline Overview · §3 Components · §5 Design Validation · §9 ADRs
Data	Document sovereignty and boundary enforcement, chunking strategy, embedding model selection, Firestore Vector Search persistence, citation traceability	§3 Components · §9 ADRs · Glossary
Infrastructure	Three-environment deployment model, hardware by environment, GDC air-gap boundary and demo exceptions, container portability, plant LAN vs. Cloud Run network topology	§6 Deployment Architecture · §7 Cost Analysis

Scope note

These viewpoints are not formally modelled to TOGAF ADM phase artefacts — this is a portfolio design, not an enterprise engagement deliverable. The mapping makes the architectural reasoning legible to reviewers working within an EA framework.

8.2 Canonical technology stack — build phase

Layer	Component	Role	Status vs. original design
LLM	Vertex AI — gemini-3.1-flash-lite / gemini-3-flash-preview	Two-tier inference at location=global: Flash-Lite for G1 routing, Flash for generation	Changed — replaces Ollama/Llama 3.2 3B
Embeddings	Vertex AI — gemini-embedding-001	Document + query embedding	Changed — replaces nomic-embed-text
Vector store	Firestore Vector Search (GA)	KNN retrieval, top-k=3, no idle index cost	Changed — replaces ChromaDB
Orchestration	Custom, inside the Cloud Run service	Ingestion, retrieval, guardrail sequencing	Changed — replaces LlamaIndex
Guardrails	5-layer custom prompts (G1–G5)	Normalise, scope, confidence, safety, citation	Kept — architecture unchanged
Document parsing	PyMuPDF	PDF text extraction, section boundary detection	Kept — infra-neutral utility
Voice input	Web Speech API	Browser-native STT, zero install	Kept — same documented exception
Backend	FastAPI, on Cloud Run	HTTP server + guardrail orchestration	Framework kept, host changed
Frontend	HTML/CSS/JS, on Firebase Hosting	Mobile-responsive voice UI	Host changed — replaces GitHub Pages
Containerisation	Docker	Same image: local dev, Cloud Run demo, GDC production	Kept

Dropped — Ollama

Local-only serving stops being the demo's inference path once Blaze made Vertex AI viable at effectively £0 for demo-scale traffic. Local dev still calls Vertex AI directly — see §6.

Dropped — ChromaDB

Firestore Vector Search reached GA with native LangChain/LlamaIndex integration and no always-on index endpoint cost — the better answer even before the Google-only requirement.

Dropped — LlamaIndex

Custom orchestration removes the third-party RAG framework dependency entirely — same pattern used on PulseRAG.

Dropped — nomic-embed-text

Replaced by gemini-embedding-001, removing the Nomic AI dependency with no architecture change.

Dropped — GitHub Pages

Firebase Hosting is free on Spark with no billing account required — functionally equivalent, Google-named.

Dropped — HuggingFace Spaces

Cloud Run replaces it as the demo backend host — same "documented exception to sovereignty" framing, different vendor.

§9 Architecture Decision Records

Six ADRs cover the decisions made rebuilding VaultRAG on Google Cloud. Three ADRs from the original design — guardrail architecture (5-layer custom over LlamaGuard/Giskard), procedural chunking over token windows, and Web Speech API over local Whisper — carry over unchanged; the reasoning wasn't vendor-dependent and is documented in §3. The six below are where the Google-grade rebuild required a real decision.

ADR-001

LLM inference: Vertex AI Gemini 3.x, not AI Studio, not local Ollama

Accepted · Revised 2026-07-05

Date

2026-07-04, revised 2026-07-05 against live API behaviour during build

Context

The original design ran Llama 3.2 3B locally via Ollama to satisfy a zero-cost, zero-egress demo constraint. The Google-grade rebuild needs the current Gemini model line, not a same-shaped local swap — but the account is UK-based, and Google AI Studio's free tier is explicitly unavailable to EEA/UK developers regardless of usage volume. Vertex AI is the remaining first-party path. During Phase 4c, two further live-API findings required a revision below: the originally-cited model IDs had moved, and the models are not servable from a per-country EU region at all.

Decision

Two-tier Gemini 3.x via Vertex AI — gemini-3.1-flash-lite (GA) for G1 query normalisation, gemini-3-flash-preview (still preview) for generation. Both called at location global, not europe-west2 — Firestore and Cloud Run stay in europe-west2 as before; only the Gemini model-serving location differs, since compute location and model-serving location are independent settings. Billed against Blaze with a $0 budget alert.

Considered

chosen

Vertex AI — gemini-3.1-flash-lite (G1) / gemini-3-flash-preview (generation), location=global — confirmed via live API calls during Phase 4c, not assumed from documentation alone. The model line named in the original ADR (gemini-3-flash-lite-preview) was fully shut down May 25, 2026 — building on it now would 404 immediately.

rejected

Targeting europe-west2 for the Gemini calls — multiple current reports (n8n community, GitHub issue #471) confirm Gemini 3.x models return "No model found" when targeted at individual EU regions like europe-west2/west4. GA Flash-Lite does support the newer eu multi-region datazone endpoint as an alternative to global — worth revisiting once Flash itself leaves preview and gains the same option.

rejected

Google AI Studio free tier — genuinely free and no-card, but not available to UK/EEA developers by Google's own terms regardless of usage. Not a viable path here, not a preference.

rejected

Ollama + Gemma 3, local — would have kept the zero-cost local-inference story and stayed Google-branded via Gemma. Rejected once Blaze was set up specifically to unlock the current Gemini line for the demo; documented as the fallback if billing is ever disabled.

Consequences

Model IDs and preview/GA status for fast-moving Gemini releases are treated as config values to re-verify at build time, not constants to trust from an earlier design pass — this ADR itself needed a same-week revision once real API calls replaced documentation assumptions. Generation also inherits preview-tier rate limits until gemini-3-flash-preview reaches GA.

ADR-002

Vector store: Firestore Vector Search, not Vertex AI Vector Search, not ChromaDB

Accepted

Date

2026-07-04

Context

ChromaDB was chosen originally for being embedded, local, and free. A Google-grade equivalent needs to be free at demo scale and avoid an always-on cost — which rules out more than one Google vector product, not just the open-source one.

Decision

Firestore Vector Search (GA) — KNN retrieval, billed per batch of 100 index entries read, with Firestore's daily free quota covering demo-scale traffic. No deployed index endpoint running idle between queries.

Considered

chosen

Firestore Vector Search — GA, native LangChain/LlamaIndex integration if ever needed, per-read billing, no idle cost.

rejected

Vertex AI Vector Search — the more obvious "Google's vector database" answer, but bills for a running deployed index endpoint whether or not it's queried. Wrong shape for a low-traffic portfolio demo.

rejected

ChromaDB (original) — free and local, but not a Google product and not what this rebuild is for. Kept as the documented local-dev fallback pattern, not the demo path.

Consequences

Retrieval now depends on Firestore's KNN implementation rather than a tunable local ANN index — acceptable at this corpus size (§3), revisit if the corpus grows past what exact KNN comfortably serves.

ADR-003

Orchestration: custom, not LlamaIndex

Accepted

Date

2026-07-04

Context

LlamaIndex handled ingestion, query engine, and response synthesis in the original design — free, MIT-licensed, no complaint against it technically. The Google-grade rebuild's goal is a stack an FDE interviewer can audit without a third-party framework sitting between every guardrail and the model call.

Decision

Custom orchestration inside the Cloud Run service — the guardrail sequence (§2, §3) is plain application code calling Vertex AI and Firestore Vector Search directly.

Considered

chosen

Custom orchestration — same pattern already used on PulseRAG; removes a dependency, keeps the guardrail logic legible in one place.

rejected

LlamaIndex (original) — works fine technically, including with Firestore Vector Search directly. Rejected on portfolio-consistency grounds, not a technical failure.

Consequences

More orchestration code to maintain directly rather than delegated to a framework. Acceptable at this pipeline's size (five guardrails, one retrieval call, one generation call).

ADR-004

Demo hosting: Cloud Run + Firebase Hosting, not HuggingFace Spaces + GitHub Pages

Accepted

Date

2026-07-04

Context

Cloud Run requires a Blaze billing account even to use the Always-Free quota — this was a real blocker until a Blaze account with a $0 budget alert was set up specifically to unlock it, sharing the billing account already created for PulseRAG.

Decision

Cloud Run (backend) + Firebase Hosting Spark (frontend), project vaultrag-prod, region europe-west2, billing account shared with PulseRAG.

Considered

chosen

Cloud Run + Firebase Hosting — both Google-named, both within Always-Free / Spark quotas at demo scale, region ties to the UK manufacturing scenario.

rejected

HuggingFace Spaces + GitHub Pages (original) — free and functional, but not a Google product on either side. Kept as the documented "why this changed" precedent, not a live option.

Consequences

A card is now on file for Blaze — mitigated by a $0 budget alert per project, not a spend cap. Standing habit: delete stray test resources same-day rather than trust the alert alone.

ADR-005

Production path: Gemini on GDC air-gapped, not a continued local build

Accepted

Date

2026-07-04

Context

VaultRAG's central constraint — data cannot leave the premises — is exactly the case Google Distributed Cloud air-gapped exists for. Gemini reached GA on GDC air-gapped in August 2025, expanded further at Cloud Next '26.

Decision

Documented production path: GDC air-gapped — a Dell-manufactured, Google-certified appliance running Gemini fully disconnected from Google Cloud and the public internet after provisioning. Not built in this pass; documented as the target (§6).

Considered

chosen

GDC air-gapped — the honest Google-grade answer to a zero-exfiltration requirement, not a workaround pretending the public Gemini API satisfies it.

rejected

Continue local Ollama/Gemma for production — would satisfy the air-gap requirement but isn't a Google product story; doesn't answer "what does Google itself offer for this constraint."

rejected

Standard Vertex AI for production — not air-gapped by definition; a cloud API call regardless of region or VPC configuration. Fails the actual constraint.

Consequences

GDC air-gapped pricing isn't publicly listed (§7) — production costing requires a Google Cloud sales conversation, not a self-serve estimate. Documented as a real gap, not glossed over.

ADR-006

GCP project structure: separate project, shared billing account

Accepted

Date

2026-07-04

Context

A billing account and a GCP project are separate objects — one Blaze billing account can be linked to multiple projects, each with its own budget, quota, and IAM. PulseRAG already had a Blaze account (013C88-B78A78-FFC4D6) in europe-west.

Decision

New project vaultrag-prod, linked to PulseRAG's existing billing account, region europe-west2 (London) — ties the region choice to the manufacturing/UK design scenario.

Considered

chosen

Separate project, shared billing — independent IAM, quotas, and per-project budget alert; each portfolio piece stays independently auditable.

rejected

Single shared project for the whole portfolio — technically simpler, but muddies the "read this project, understand this system" story each piece is meant to tell on its own.

Consequences

One more project to track IAM and quotas for — acceptable given the interview-legibility benefit outweighs the marginal admin overhead.

ADR-007

Container build: explicit linux/amd64 target via buildx, not the host default

Accepted

Date

2026-07-05

Context

The first Cloud Run deploy attempt failed with Container manifest type must support amd64/linux despite a clean local Docker build and a clean local container run. Docker Desktop on Apple Silicon builds arm64 images by default — the image ran perfectly on the local machine precisely because the local machine is also arm64, which masked the mismatch until deploy time.

Decision

Rebuild with docker buildx build --platform linux/amd64 ... --push, explicitly cross-compiling regardless of host architecture, and re-tag the image (v1 → v2) so it's unambiguous which artifact Cloud Run actually pulled.

Considered

chosen

buildx with explicit --platform — works regardless of the developer's host architecture, and makes the target platform a stated decision rather than an implicit assumption inherited from whatever machine happens to run the build.

rejected

Trusting the default Docker build — this is exactly what failed. "Works on my machine" was true and irrelevant, since the local machine's architecture isn't the deploy target's architecture.

Consequences

Every future container build for this project (and any future portfolio project built on Apple Silicon hardware) uses buildx --platform linux/amd64 as the default invocation, not an occasional fix — this class of bug is silent until deploy time, so it's cheaper to always be explicit than to rely on remembering to check.

§10 Roadmap

What v0.1 delivers, the known production gaps, the v1.0 path that closes them, and the decisions deliberately deferred rather than forgotten. The distinction between a gap and a deferral matters — both are documented as separate categories, not blended into one vague "future work" list.

v0.1 · Current

Core loop validated

Voice query → 5-layer guardrail pipeline → cited procedure response. Single corpus, single concurrent user. Build-phase demo on Cloud Run + Firebase Hosting.

Portfolio demo · Built

v1.0 · Production path

Multi-user, multi-corpus, auditable

Department namespacing, role-based access, immutable audit log, conditional safety-model validation, local STT for full air-gap, Word/PowerPoint ingestion. On GDC air-gapped hardware.

Designed · Not built

v2+ · Platform vision

Governed intelligence utility

Multiple secure assistants (maintenance, quality, safety, engineering) sharing one on-prem trust boundary. Proactive alerts on document updates. Multi-site federation.

Vision · Architecture sketched

10.1 v1.0 production path — eight additions, in dependency order

Priority	Addition	Why this order
1	User authentication, role-based access	Every later feature depends on knowing who's asking — namespacing and audit both require this first.
2	Department-level document namespacing	Firestore Vector Search collections mapped to org namespaces, query routing scoped to permitted collections. Depends on auth.
3	Immutable audit log	Append-only: timestamp, user, query, chunk refs, guardrail outcomes, response hash. Required for ISO 9001 management review.
4	Conditional safety-model validation on G4 trigger	A dedicated safety model activates only when G4 fires — adversarial robustness without always-on inference cost. On GDC air-gapped, this runs on-appliance alongside Gemini.
5	Local STT for fully air-gapped facilities	Web Speech API routes audio to Google/Apple backends — a sovereignty gap in strict deployments even with an on-prem LLM. Needed for facilities where even that routing is disallowed.
6	Word and PowerPoint ingestion	60–70% of a real facility's documentation estate is Word, not PDF. MVP corpus (PDF/TXT) misses most of the real corpus.
7	Text-to-speech response readback	Completes the hands-free loop — reading a response off a screen with one hand on equipment is impractical.
8	Document version management, re-indexing triggers	A superseded SOP answered as current is more dangerous than no system. Automatic re-index on document update.

Known gap, stated precisely

Web Speech API using the device's native STT engine routes audio to Google or Apple backends for transcription — a real data sovereignty violation in strict air-gapped deployments (ITAR, defence supply chain, certain ISO 27001 scopes) even though the LLM itself runs on-prem. v0.1 documents this as a known gap, not a solved problem, and v1.0 Priority 5 addresses it directly.

10.2 Deferred architectural decisions — optimisations, not gaps

Decision	Deferred because	Candidate for
Hybrid retrieval — dense + Firestore full-text (sparse)	Pure semantic retrieval validates the pipeline architecture; hybrid is a precision optimisation for exact part numbers and error codes, not an architectural signal at MVP stage.	v1.0 candidate
Query rewriting — HyDE	Adds an extra Gemini call per query before retrieval. G1 Query Normaliser already addresses the vocabulary gap for v0.1.	v1.0 candidate
Re-ranking with a cross-encoder	Adds another model to the inference path. Procedural chunking already improves rank-1 precision by keeping complete procedures intact.	v2+ candidate
Multi-modal ingestion — diagrams, schematics	Requires a vision model alongside Gemini. Ingestion pipeline is modular enough to insert this later without touching retrieval or generation.	v2+ candidate
Automated adversarial evaluation	Belongs in CI/CD, not the inference path. Vertex AI's Gen AI evaluation service is the Google-grade candidate here, replacing the originally-considered Giskard.	v1.0 CI/CD

Closing

VaultRAG is a portfolio design, rebuilt Google-grade. The architecture validates the on-prem RAG pattern for constrained manufacturing environments; the v1.0 path addresses the known gaps documented above, and GDC air-gapped is the real, current Google product that makes the production claim honest rather than aspirational. The deferrals are optimisations — the core pipeline is sound without them.