Architecture Portfolio · 2026 · Manufacturing · Enterprise AI

VaultRAG — local-first RAG
for constrained manufacturing environments

A design study in voice-first document retrieval for factory floor use. The architecture is built around three constraints that exclude cloud RAG services: data sovereignty requirements, air-gapped network conditions, and zero ongoing API cost. All inference runs on-prem. This portfolio documents the design decisions, trade-offs, and known limitations.

Voice-first interface On-prem LLM inference Local vector store LlamaIndex ChromaDB Ollama · Llama 3.2 5-layer guardrails FastAPI · Docker Portfolio prototype
$260K
Per hour
Average cost of unplanned manufacturing downtime · Aberdeen Group
18 min
Per search
Average time to locate the correct procedure under fault conditions
On-prem
Inference target
Production deployment: no data leaves the facility network. Demo runs on HuggingFace Spaces.
< 8s
To answer
From voice query to cited, step-by-step procedure response
01 — Design context

Four conditions that make this architecture viable

The manufacturing knowledge retrieval problem is not new. What changed between 2022 and 2024 is the availability of open-source components capable of running the full RAG pipeline locally — without cloud dependencies, without API costs, and without sending proprietary documents outside the facility network.

Condition — 01
Small local LLMs reached usable instruction-following quality

Llama 3.2 3B runs on commodity hardware with instruction-following quality sufficient for structured procedure retrieval. A model capable of reading a 400-page equipment manual and returning the relevant procedure now runs entirely on a facility server — no API calls, no internet dependency at inference time.

Llama 3.2 · Meta · 2024
Condition — 02
Browser-native voice input removed the app-install barrier

The Web Speech API is supported natively in every modern mobile browser. A technician can speak a query one-handed without installing anything. In the production deployment this is replaced by local Whisper STT to eliminate the external STT backend dependency — that trade-off is documented in ADR-005.

Web Speech API · Baseline 2023
Condition — 03
Embedded vector stores made local semantic search viable

ChromaDB and similar embedded stores reduced the retrieval layer to a single dependency with persistent local storage. Semantic search over a facility's document corpus now requires no external service, no managed database, and no ongoing cost — making it deployable inside an air-gapped network.

ChromaDB · Apache 2.0 · 2024
Condition — 04
Data sovereignty requirements exclude cloud RAG services

Manufacturing firms in aerospace, automotive, and defence supply chains increasingly prohibit sending proprietary process data to third-party APIs. This contractual constraint — not a preference — architecturally excludes cloud-hosted RAG services and creates a genuine requirement for the on-prem pattern this project addresses.

ISO 27001 · ITAR · NDA clauses
02 — The problem

The knowledge retrieval gap in manufacturing operations

The design scenario addresses two related problems: a business cost measured in downtime and rework, and an operational constraint measured in the time it takes a technician to locate and apply the correct procedure under fault conditions. Both are documented in detail on page 02.

Business dimension

Technical documentation that cannot be queried under fault conditions

A mid-size manufacturer typically maintains 12,000–40,000 pages of technical documentation — equipment manuals, ISO-controlled SOPs, non-conformance reports, maintenance logs. This documentation exists as PDFs on network drives and printed binders. It is not queryable. When a machine faults, the retrieval process is manual, sequential, and slow.

Every minute of that search is unplanned downtime. Every wrong procedure is extended downtime plus a potential non-conformance record. The cost is documented; the retrieval mechanism has not changed.

$260K/hr
Average unplanned downtime cost in general manufacturing · Aberdeen Group via Oxmaint 2024
Operational dimension

A retrieval failure under operational constraints, not a knowledge failure

The Haas CNC has thrown an E-04 fault the floor hasn't seen before. The line is stopped. The manual is a 380-page PDF on a laptop forty metres away. The technician asks a colleague who thinks he remembers the procedure. They apply it. The fault clears and returns two hours later — now a recurring incident in the NCR log.

The knowledge existed in the documentation. The failure was retrieval: noise, distance, time pressure, and no device suitable for navigating a 380-page PDF one-handed under a running machine.

23%
Of unplanned manufacturing stoppages attributable to human error on the floor, including wrong or incomplete procedure application · ABB / Plutomen 2024
Full problem analysis — page 02 →
03 — Architecture overview

Three-layer design — field interface, RAG core, document store

The architecture separates concerns across three layers. In the target production deployment, all three layers run inside the facility network: no data crosses the boundary at any point. The portfolio demo runs on HuggingFace Spaces with browser-native STT — those are documented exceptions that apply to the prototype only. The diagram below reflects the production architecture.

System architecture — production deployment model · Three-layer design
LAYER 01 Field Interface LAYER 02 RAG Intelligence Core LAYER 03 Data & Documents 🎙 MOBILE PWA Browser · No install WEB SPEECH API Voice → Text · Demo only v1.0: local Whisper STT PLANT WIFI Internal network only TEXT FALLBACK Keyboard input backup FASTAPI GATEWAY HTTP · JSON Auth · Rate limit 5-LAYER GUARDRAIL PIPELINE G1 Query Normaliser Voice denoising G2 Scope Guard Off-topic reject G3 Confidence Threshold Low score refuse G4 Safety Flag LOTO / hazard warn G5 Citation Enforcer Source required CHROMADB RETRIEVAL · top-k=3 · nomic-embed-text · cosine similarity Persistent local vector store · procedural chunking · similarity threshold: 0.70 OLLAMA · LLAMA 3.2 · 3B INSTRUCT · LOCAL INFERENCE System prompt: step-by-step · max 5 steps · citation mandatory · safety prefix on G4 trigger LLAMAINDEX ORCHESTRATOR Query engine · Document ingestion pipeline · Index management · Response synthesis EQUIPMENT MANUALS PDF · Structured procedures Haas, Fanuc, ABB, Siemens SOPs & WORK INSTR. PDF · TXT · Markdown ISO 9001 · AS9100 controlled INCIDENT / NCR REPORTS PDF · Structured text Non-conformance records INGESTION PIPELINE PyMuPDF parsing Procedural chunking nomic-embed-text CHROMADB Persistent embeddings Local disk · No cloud AIR-GAP BOUNDARY Production deployment only Not enforced in demo HTTP Cited response
Guardrail pipeline — request flow · 5 sequential layers
🎙
Voice
Input
G1
Query
Normaliser
Pre-retrieval
G2
Scope
Guard
Pre-retrieval
🔍
ChromaDB
Retrieval
G3
Confidence
Threshold
Post-retrieval
G4
Safety
Flag
Pre-generation
🤖
Llama 3.2
Generation
G5
Citation
Enforcer
Post-generation
📋
Cited
Response
Architecture & design decisions → Run the simulator →
Architecture documentation — six sections