01 — Version Overview

Three stages, one coherent path

Each version solves a specific problem and sets up the next. v0.1 proves the core loop works. v1.0 makes it production-ready. The future vision extends the platform to adjacent use cases without re-solving trust or safety.

v0.1
Current · MVP
Core loop validated

Voice query → 5-layer guardrail pipeline → cited procedure response. On-prem deployment model. Single document corpus. Single concurrent user. Deployed on HuggingFace Spaces for portfolio demonstration purposes — the production deployment model is documented in page-03.

Portfolio demo · Built
v1.0
Production path
Multi-user, multi-corpus, auditable

Department-level document namespacing. Role-based query permissions. Immutable audit log of every query and response. LlamaGuard integration on the safety path. Whisper local for fully air-gapped voice. Word and PowerPoint ingestion. Production hardware with GPU acceleration.

Designed · Not built
v2+
Platform vision
Governed intelligence utility

VaultRAG as a platform abstraction layer: multiple secure assistants (maintenance, quality, safety, engineering) sharing a common on-prem trust infrastructure without re-solving the sovereignty and guardrail problems for each new use case. Proactive alerts when documents are updated. Multi-site federation.

Vision · Architecture sketched
02 — MVP Scope

v0.1 feature inventory

The scoping principle for v0.1 was: every component must either make the core loop function or make a guardrail fire. Anything that did neither was deferred. This table is the complete inventory.

VaultRAG v0.1 — complete feature scope
Featurev0.1 StatusDecision rationale
Voice input via Web Speech API In v0.1 Core user interface requirement. A technician on the floor cannot type reliably. Without voice, the system fails its primary user scenario.
Text fallback input In v0.1 Voice recognition fails in extreme noise environments (>100dB). Keyboard fallback ensures the system never leaves a user without recourse.
PDF and TXT document ingestion In v0.1 The two dominant formats for manufacturing documentation. Everything else is roadmap.
Procedural section chunking In v0.1 Safety-critical requirement. Token-window chunking splits procedures and produces incomplete — potentially dangerous — responses. See ADR-008.
5-layer guardrail pipeline (G1–G5) In v0.1 Safety requirement in a manufacturing context. All five layers implemented as custom prompt-based logic. Not optional for the core architecture validation.
ChromaDB vector store In v0.1 Core retrieval infrastructure. The RAG pipeline cannot function without a vector store.
Llama 3.2 3B via Ollama In v0.1 Local LLM is the architectural invariant. Ollama is the simplest local serving solution. 3B chosen for demo environment RAM constraints. See ADR-001.
Mobile-responsive chat UI In v0.1 The primary user is on a phone on the factory floor. A desktop-first UI fails the primary persona entirely.
Single document corpus In v0.1 Multi-corpus management adds indexing complexity, namespace isolation, and corpus selection logic that are v1.0 concerns. MVP validates the single-corpus pipeline first.
User authentication / login Deferred v1.0 Authentication is an operational requirement but not a pipeline validation requirement. It adds session management, token handling, and role logic that are not needed to validate the core RAG loop. Documented as a production gap below.
Department-level document namespacing Deferred v1.0 Role-based corpus access requires user identity, which requires authentication. The dependency chain means this follows auth, not precedes it.
Immutable audit log Deferred v1.0 Essential for ISO 9001 compliance and internal governance. Not required to validate that the pipeline retrieves and cites correctly. Documented as a production gap below.
LlamaGuard integration Deferred v1.0 RAM-infeasible alongside Llama 3.2 3B on free demo tier. Custom G4 Safety Flag provides the MVP safety layer. LlamaGuard activates on G4 trigger in v1.0 — conditional, not always-on. See ADR-003.
Word / PowerPoint ingestion Deferred v1.0 Many SOPs live in .docx format. python-docx integration is straightforward but adds a dependency and format-specific chunking logic. Deferred to keep MVP ingestion pipeline clean.
Whisper local voice (air-gapped) Deferred v1.0 Web Speech API uses device STT (Google/Apple backend). For fully air-gapped facilities, local Whisper is required. Added RAM overhead makes it a production concern, not MVP. Documented as a production gap below. See ADR-005.
Text-to-speech response readback Deferred v1.0 Logical extension of the voice-first design. Web Speech Synthesis API makes this a one-line addition. Deferred to keep MVP UI surface minimal.
Query history and session memory Deferred v1.0 Persistent conversation history requires session storage and raises data retention questions. Each v0.1 query is stateless by design — simpler, safer, easier to audit.
Multi-site federation v2+ vision Multiple facility instances sharing a common document standard but isolated corpora. Requires a federation layer and inter-site governance model. Platform-level concern.
Proactive document update alerts v2+ vision Notify technicians when a procedure they previously queried has been updated. Requires query history (v1.0), document versioning, and a notification pipeline.
Custom disclaimer injection per department v2+ vision Prepend department-specific legal or compliance text to responses. Requires department namespacing and user identity — both v1.0 foundations.
03 — Known Production Gaps — v0.1

Known production gaps — v0.1

These are missing requirements in the current build — not deliberate architectural deferrals. They represent real limitations that would need to be addressed before v0.1 could be operated in a production manufacturing environment. Each has a documented v1.0 mitigation.

Data sovereignty scope
In the production deployment model, no data crosses the facility boundary. The demo prototype uses HuggingFace Spaces for inference and Web Speech API for voice — both are documented exceptions. v1.0 mitigations (local Whisper, on-prem server) restore the sovereignty invariant. The architecture is designed so the mitigations are additive — the application code does not change.
No user authentication — open to any device on the plant network

v0.1 has no login, session management, or identity layer. Any device connected to the plant WiFi can query the system and retrieve procedural documentation, including safety-critical LOTO procedures. There is no mechanism to restrict access by role, department, or individual user.

Security gap
v1.0 mitigation: JWT-based authentication. Role-based access control. Department-level namespace scoping tied to authenticated user identity.
v1.0 · Priority 1
No audit log — queries and responses are not persisted

v0.1 has no record of what was queried, what was retrieved, what guardrails fired, or what response was delivered. In a regulated manufacturing environment operating under ISO 9001, the absence of an audit trail means the system cannot be included in a management review and AI-assisted decisions cannot be traced.

Compliance gap
v1.0 mitigation: Append-only audit log with timestamp, user ID, normalised query, chunk references, guardrail outcomes, and response hash. Written before response delivery. Cannot be modified or deleted.
v1.0 · Priority 3
Single document namespace — no department-level isolation

v0.1 indexes all documents into a single ChromaDB collection. Any authenticated user (once auth is added) would have access to all indexed documents regardless of department, classification, or need-to-know. A maintenance technician would be able to query engineering drawings or quality management records they are not authorised to access.

Data integrity gap
v1.0 mitigation: ChromaDB collections mapped to organisational namespaces. Query routing scoped to the authenticated user's permitted collections. Depends on auth (Priority 1).
v1.0 · Priority 2
Web Speech API routes audio externally — violates air-gap in strict deployments

The Web Speech API uses the device's native speech recognition engine, which in practice routes audio to Google or Apple backends for transcription. In facilities operating under strict air-gap requirements — ITAR-controlled environments, defence supply chain, certain ISO 27001 scopes — this is a data sovereignty violation even if the inference layer is on-prem.

Sovereignty gap
v1.0 mitigation: Local Whisper medium (244M parameters) in a dedicated Docker container. Audio routed to the on-prem server, transcribed locally, no external API call. GPU-accelerated: <1s for a 5-second query.
v1.0 · Priority 5
04 — v1.0 Production Path

v1.0 production path — eight additions

These are not features for their own sake. Each v1.0 addition resolves a specific gap between the portfolio prototype and a system a manufacturer would trust in production. The ordering is intentional — auth enables namespacing; namespacing enables audit; audit enables compliance.

v1.0 · Priority 1
User authentication and role-based access

JWT-based authentication. Each user belongs to one or more departments. Queries are scoped to the user's permitted corpus namespaces. A maintenance technician cannot query the Legal department's documents. An engineering manager can query both.

Why first: Without auth, the system cannot enforce data governance at the user level. Every subsequent v1.0 feature depends on knowing who is asking.
v1.0 · Priority 2
Department-level document namespacing

ChromaDB collections mapped to organisational namespaces. Maintenance docs, quality SOPs, engineering drawings, and safety procedures indexed in isolated collections. Query routing respects the authenticated user's permitted namespaces.

Why now: A single flat corpus is a demo simplification. Production facilities have strict need-to-know controls on document access — especially in ISO-audited environments.
v1.0 · Priority 3
Immutable audit log — every query, every response

Append-only log with timestamp, user ID, normalised query, retrieved chunk references, guardrail outcomes, and response hash. Written before the response is delivered. Cannot be modified or deleted. Reviewed in ISO 9001 management reviews.

Why now: Regulated manufacturers require evidence that every AI-assisted decision is traceable. Without an audit trail, VaultRAG cannot be used in an ISO-audited environment. This is a compliance gate, not a nice-to-have.
v1.0 · Priority 4
LlamaGuard on G4 trigger — conditional, not always-on

LlamaGuard 2 activates only when the custom G4 Safety Flag fires — adding a safety model validation layer for the highest-stakes queries without the always-on RAM overhead. G4 acts as a triage gate; LlamaGuard acts as a secondary validator on G4-triggered queries only.

Why now: Custom keyword-based G4 has false-negative risk for novel hazard phrasing not in the keyword list. LlamaGuard adds adversarial robustness. The conditional activation pattern solves the RAM constraint identified in ADR-003.
v1.0 · Priority 5
Whisper local STT for air-gapped facilities

Replace Web Speech API with locally deployed Whisper medium (244M parameters) for facilities that prohibit any device-level cloud connection. Whisper runs in a separate Docker container, receives audio via internal HTTP, returns transcription with no external calls. GPU-accelerated: <1s for a 5-second query.

Why now: Web Speech API routes audio to Google or Apple STT backends — this violates strict air-gapped requirements. Facilities operating under classified or defence supply chain constraints need local STT. v0.1 documents this as a known sovereignty gap.
v1.0 · Priority 6
Word and PowerPoint document ingestion

python-docx for .docx parsing with heading-based chunking. python-pptx for .pptx parsing with slide-title chunking. Many manufacturing SOPs are authored in Word; many training materials in PowerPoint. Extending ingestion beyond PDF/TXT captures the majority of the real-world corpus.

Why now: The MVP corpus is limited to PDF and TXT. A real facility's documentation estate is typically 60–70% Word documents. Without .docx ingestion, significant portions of the knowledge base are inaccessible.
v1.0 · Priority 7
Text-to-speech response readback

Web Speech Synthesis API reads the response aloud after delivery. The technician holds their phone near their ear, speaks the query, and hears the procedure read step by step — fully hands-free. A single JavaScript call on top of the existing response flow.

Why now: The floor use case requires hands-free operation. Reading a response off a screen while one hand is on equipment is impractical. TTS completes the voice-first loop. It is deliberately not in v0.1 to keep the UI surface minimal and testable.
v1.0 · Priority 8
Document version management and re-indexing triggers

When a document is updated (new SOP revision, revised manual), the system detects the version change, re-indexes the affected chunks, and invalidates any cached responses referencing the old version. A document management webhook or scheduled check triggers re-ingestion automatically.

Why now: An ISO-controlled SOP that changes must be reflected immediately in the knowledge base. A system that answers from a superseded procedure is potentially more dangerous than no system at all. Version management is a production safety requirement.
05 — Production Architecture

How v0.1 becomes v1.0

The v0.1 architecture is designed so that every v1.0 addition slots into the existing structure without requiring a rebuild. The core pipeline — voice → guardrails → retrieval → generation → citation — does not change. Each v1.0 component wraps or extends an existing layer.

v0.1 → v1.0 upgrade path · additive, not replacement
V0.1 · MVP Web Speech API Voice → Text (browser) G1–G5 Guardrails Custom prompt-based No extra model ChromaDB · Single corpus Flat namespace Llama 3.2 3B · Ollama Local inference FastAPI + HTML UI Mobile-responsive · No auth PDF / TXT ingestion PyMuPDF · 2 formats additive upgrades V1.0 · PRODUCTION Whisper Local STT Replaces Web Speech · Air-gapped TTS Readback Web Speech Synthesis · Hands-free G1–G5 + LlamaGuard on G4 trigger LlamaGuard activates conditionally · G4 triage gate Adversarial robustness on safety-critical queries only ChromaDB · Namespaced Dept-level isolation Audit Log Append-only · ISO 9001 Llama 3.1 8B · Ollama Upgrade model string only GPU-accelerated server 32GB RAM · RTX 3080+ JWT Auth + Role-Based Access + Session Management User identity → dept namespace scoping → audit trail attribution PDF · TXT · DOCX · PPTX · Version Management python-docx + python-pptx + version change detection + auto re-index trigger NEW NEW NEW NEW
06 — Deferred Architectural Decisions

Deferred architectural decisions

These are architectural questions that were considered for v0.1 and deliberately excluded — not because they are unimportant, but because they do not affect the core architectural validation. They are not production gaps; they are optimisations and enhancements that sit above the minimum viable architecture.

Hybrid retrieval — dense + sparse (BM25)

Pure semantic search (dense retrieval) can miss exact keyword matches for specific part numbers, error codes, and model identifiers — exactly the kind of query a maintenance technician makes. Hybrid retrieval combining ChromaDB cosine similarity with a BM25 sparse index would improve precision on these high-specificity queries.

Deferred because: Pure semantic retrieval is sufficient to validate the pipeline architecture. Hybrid retrieval is an optimisation — valuable, but it does not change the guardrail design, the deployment model, or the sovereignty invariant. Added complexity without added architectural signal at MVP stage.
v1.0 candidate
Query rewriting — HyDE (Hypothetical Document Embeddings)

HyDE generates a hypothetical answer to the query before retrieval and uses that hypothetical answer's embedding to search the corpus — improving retrieval quality for queries phrased very differently from the document vocabulary. Relevant for voice queries where the transcription may not match the terminology in the manual.

Deferred because: Adds an extra LLM call per query (generation before retrieval), increasing latency and complexity. G1 Query Normaliser addresses the vocabulary gap for v0.1. HyDE is the next step if retrieval quality proves insufficient at production scale.
v1.0 candidate
Re-ranking with a cross-encoder model

After top-k retrieval, a cross-encoder re-ranker evaluates each retrieved chunk in the context of the specific query and re-orders them by relevance — typically improving precision at rank 1, which is the chunk most likely to ground the response. ColBERT or a lightweight cross-encoder would be candidates.

Deferred because: Adds another model to the inference stack (additional RAM, additional latency). The procedural chunking strategy already improves retrieval precision by keeping complete procedures intact. Re-ranking is an optimisation for larger corpora where multiple procedures may appear relevant.
v2+ candidate
Multi-modal ingestion — diagrams, schematics, images

Manufacturing manuals contain critical information in diagrams, wiring schematics, and exploded-view drawings. A purely text-based RAG system cannot index this content. A multi-modal pipeline using a vision model (LLaVA, Phi-3 Vision) to generate textual descriptions of figures before indexing would capture this signal.

Deferred because: Requires a vision model alongside the text LLM — significant RAM increase. Figure description quality with local models is variable. The architecture is designed to accommodate this: the ingestion pipeline is modular, and a vision pre-processing step can be inserted before ChromaDB indexing without changing the retrieval or generation layers.
v2+ candidate
Giskard automated vulnerability scanning

Giskard is a testing framework for LLM applications — it automatically generates adversarial test cases (prompt injections, hallucination probes, bias tests) and evaluates the pipeline against them. Valuable for production QA but not a runtime guardrail.

Deferred because: Giskard belongs in the CI/CD pipeline, not in the inference path. It is a pre-deployment testing tool. It will be added to the GitHub Actions workflow before v1.0 release.
v1.0 CI/CD
VaultRAG is a portfolio prototype. The architecture is designed to validate the on-prem RAG pattern for constrained manufacturing environments. The v1.0 path addresses the known production gaps documented above. The architectural deferrals are optimisations — the core pipeline is sound without them.