MLOps

ML Engineering
& MLOps.

Four ML components power this system. Each has a model selection rationale, a monitoring strategy, a drift detection mechanism, a retraining trigger, and a documented response to the hardest pushbacks a GCP architect or ML engineer would raise in a design review.

Speech-to-Text
Whisper large-v3
OpenAI OSS · Nov 2023
Cloud Run · GPU T4 spot · asia-south1
OSS containerised
Translation / NLU
NLLB-200 3.3B
Meta AI OSS · Jul 2022
Cloud Run · CPU · bundled with STT
OSS containerised
LLM Inference
Gemini 1.5 Flash
Google · Vertex AI GA
Vertex AI · managed endpoint · asia-south1
GCP managed
Embeddings / RAG
text-embedding-004
Google · Vertex AI GA
Vertex AI · pgvector · Supabase
GCP managed
01 — Model selection decisions & rebuttals
Every decision.
Every pushback answered.
Each model choice was made against a specific alternative. The decision, the alternative, and the rebuttal to the hardest counter-argument are documented here — the format a design review board or a GCP customer's chief architect would expect.
Decision
Deploy Whisper large-v3 as a containerised Cloud Run service on T4 spot GPU instances rather than calling Google Cloud STT v2 via API.

Whisper achieves WER < 8% across 99 languages. For Hindi, Telugu, and Marathi at the regional dialect level spoken by Rathi Textiles workers, Whisper large-v3 outperforms Google STT v2 on code-switching (Hinglish, Tenglish) by approximately 12–18% relative WER improvement.

Cost: ~$0.0003/min vs Google STT v2 at $0.006/15s — an 83% cost reduction at this interaction volume.
Alternative considered
Google Cloud STT v2 with Chirp model.

Chirp offers multilingual support including Indic languages and is fully managed with a Google SLA. No infrastructure to maintain. Simpler operational model.

Rejected because: (1) 83% higher per-minute cost with no quality advantage at the specific language/dialect profile of this use case. (2) Google STT still lags on code-switched speech — a Rathi Textiles worker saying "Kal mujhe chutti chahiye, casual leave" mixes Hindi and English in a single utterance. Whisper handles this natively. (3) Vendor dependency for a cost-critical component.
Rebuttals
"What's your SLA guarantee without a managed service?"
Cloud Run provides a 99.95% uptime SLA on the container infrastructure. The model itself is stateless — a failed container is replaced within seconds. We maintain a minimum-instance warm configuration (1 instance always on) eliminating cold-start latency. Fallback to Google STT v2 is wired in on any Whisper service error — the caller never experiences a failure, only a slightly higher per-call cost on that interaction.
"Model updates require a redeployment — how do you manage that?"
Whisper versions are pinned in the container image tag. Model updates are a Cloud Run revision deploy — zero-downtime traffic splitting, automated rollback on error rate spike. The same CI/CD pipeline handles it. We evaluate new Whisper releases against a held-out Indic language test set before promoting to production.
Decision
Deploy NLLB-200 3.3B (Meta AI, No Language Left Behind) bundled in the same Cloud Run container as Whisper. NLLB-200 was trained specifically on 200 languages including low-resource Indic languages, with a focus on translation quality for languages underrepresented in standard training corpora.

Task: normalise code-switched, dialect-heavy transcripts into structured English intent. Not full document translation — a narrow, well-defined task where NLLB-200 excels.

Cost: zero marginal cost — bundled in existing Cloud Run compute.
Alternative considered
Google Cloud Translation API v3 (Neural Machine Translation).

Managed, scalable, integrates natively with GCP. Supports 100+ languages including Hindi, Telugu, Marathi.

Rejected because: (1) $1.25/1M characters — at 500 interactions/month of ~200 characters each, this adds ~$0.13/month. Not significant in isolation, but a marginal cost with no quality advantage. (2) Google NMT performs well on formal text; it performs less well on the colloquial, code-switched utterances this system actually receives. NLLB-200 was trained with this exact use case in scope.
Rebuttals
"NLLB-200 3.3B is a large model — does it fit in the same container as Whisper?"
At inference, NLLB-200 3.3B requires ~7GB GPU memory. Whisper large-v3 requires ~10GB. A single A100 or L4 Cloud Run GPU instance (24–40GB VRAM) runs both comfortably. On T4 (16GB), we run Whisper large-v3 and NLLB-200 1.3B (the distilled variant) — equivalent quality for our specific task of intent extraction from short utterances.
"Isn't this over-engineering for a translation task that could use a prompt?"
Using Gemini Flash for translation would add ~200 tokens per interaction to the LLM call — approximately $0.015/month at our scale. Not cost-significant, but architecturally wrong. Translation is a deterministic, well-bounded task with established evaluation metrics (BLEU, chrF). Offloading it to a general-purpose LLM introduces unnecessary non-determinism into a pipeline step that should be stable and measurable.
Decision
Retain Gemini 1.5 Flash on Vertex AI for intent routing and Leave Agent reasoning. At $0.075/1M input tokens and $0.30/1M output tokens, it is the cheapest capable hosted LLM from any major provider as of Q1 2026.

This is the one component where we explicitly chose managed over OSS. The LLM is the critical path for latency, reasoning quality, and policy RAG synthesis. A wrong decision here — an incorrectly denied leave, a misread policy clause — has direct business and legal consequences.
Alternative considered
Llama 3.1 8B Instruct via Ollama on Cloud Run CPU, or Mistral 7B.

Both are capable instruction-following models. Both are free at inference. Both would eliminate the Gemini Flash cost entirely (~$0.04/month at our scale).

Rejected because: P99 latency on CPU Cloud Run exceeds 8 seconds for a complex policy RAG synthesis prompt (500+ tokens context). A worker on an IVR call waiting 8+ seconds for a leave decision is an unacceptable UX. GPU Cloud Run for Llama 3.1 8B costs more per month than Gemini Flash at our scale — the economics invert.
Rebuttals
"As scale grows, Gemini Flash costs grow linearly. At 500 employees the LLM cost dominates."
Agreed — and this is the planned migration point. At ~200 employees the economics shift. The architecture pins Gemini Flash via an abstracted LLM interface (BaseLLM in LangGraph). Swapping to a self-hosted Llama 3.1 70B on a dedicated Cloud Run GPU instance at that scale is a config change, not a rewrite. The migration trigger is defined: when monthly LLM cost exceeds the cost of a dedicated L4 GPU instance (~$180/month).
"Vertex AI data residency — does worker conversation data leave India?"
All Vertex AI inference is configured to asia-south1 (Mumbai). GCP's data residency controls prevent data routing outside the specified region for inference. For DPDP Act 2023 compliance, this is the required configuration. The Vertex AI service agreement includes a data processing addendum that satisfies Indian data localisation requirements for HR data.
Decision
Use retrieval-augmented generation (RAG) with pgvector to ground every HR decision in the policy document. The policy PDF is chunked (256 tokens, 32 token overlap), embedded via text-embedding-004, and stored in Supabase pgvector. Every Leave Agent query retrieves the top-3 relevant chunks and constructs the prompt context dynamically.

When Priya updates the policy PDF, re-indexing runs automatically. Policy changes take effect immediately — without a model update, without a retraining job, without a deployment.
Alternative considered
Fine-tuning Gemini Flash on Rathi Textiles HR policy Q&A pairs.

Fine-tuning would bake policy knowledge into the model weights. Potentially faster inference (no retrieval step). Could improve reasoning consistency on edge cases.

Rejected because: (1) Policy changes require a retraining job — unacceptable for a small business owner who needs to update leave quotas or add a new leave type without involving a developer. (2) Fine-tuned knowledge cannot be audited — you cannot ask "which policy clause did you use for this decision?" (3) Vertex AI fine-tuning cost for a small dataset does not justify the operational overhead at SMB scale.
Rebuttals
"RAG can hallucinate — the model might synthesise a policy that doesn't exist in the document."
Hallucination risk is mitigated at three layers. First: the system prompt instructs the agent to answer only from retrieved context and to return low confidence when context is insufficient. Second: every response includes the retrieved clause ID and page number — if the clause doesn't exist in the document, the confidence score drops below 0.80 and HITL fires. Third: the audit log records the exact retrieved chunks used for every decision — hallucinated reasoning would produce a non-existent clause ID, immediately detectable in review.
"pgvector on Supabase free tier — what happens when the policy document grows?"
A comprehensive 30-page SMB HR policy produces approximately 800–1,200 chunks at 256 tokens, consuming ~4MB of vector storage. Supabase free tier provides 500MB. At Rathi Textiles scale, we never approach the limit. At 200+ employees with a complex policy, migration to Supabase Pro ($25/month) or Vertex AI Vector Search adds one environment variable change — the pgvector query interface is identical.
Decision
Whisper fine-tuning jobs (triggered when STT confidence drift is detected) run as Cloud Run GPU jobs — not Vertex AI Pipelines. The job: loads base Whisper large-v3, applies LoRA fine-tuning on accumulated low-confidence transcripts with human-corrected labels, validates WER on hold-out set, pushes new container image to Artifact Registry, deploys to Cloud Run as a new revision with traffic splitting.

The entire pipeline is a ~200-line Python script containerised in a Cloud Run job.
Alternative considered
Vertex AI Pipelines (Kubeflow-based) with Vertex AI Training for fine-tuning.

Vertex AI Pipelines provides DAG orchestration, experiment tracking, model registry, lineage tracking, and managed training infrastructure. It is the canonical GCP MLOps solution.

Rejected because: operationally over-engineered for this use case. Whisper fine-tuning on SMB dialect data runs in ~40 minutes on a single T4 GPU. Vertex AI Pipelines introduces YAML DAG definitions, managed pipeline runner costs, and a Kubeflow learning curve — none of which add value for a single-step fine-tuning job that runs quarterly at most.
Rebuttals
"Without Vertex AI Pipelines you lose experiment tracking, model lineage, and reproducibility."
Experiment tracking is handled by Cloud Logging (training metrics written as structured log events) and Artifact Registry (container images tagged with model version, training data hash, WER on validation set). Model lineage is the container image tag itself — a deterministic reference to the exact model weights used. This is not as feature-rich as Vertex AI Experiments, but it is sufficient for the audit and reproducibility requirements of an SMB HR system, and costs zero.
"When should this system graduate to Vertex AI Pipelines?"
When two conditions are met: (1) Fine-tuning frequency exceeds monthly — meaning dialect drift is significant enough to require continuous learning. (2) Multiple model variants are being evaluated in parallel (e.g., Whisper large-v3 vs a distilled variant for cost reduction). At that point, the orchestration overhead of Vertex AI Pipelines pays for itself in visibility and governance. The migration is a pipeline YAML definition — the training code is unchanged.
02 — RAG indexing pipeline
From PDF upload
to live policy governance.
When Priya uploads a new HR Policy PDF — or updates an existing one — this pipeline runs automatically. From upload to the moment the Leave Agent reasons against the new document: under 3 minutes, with zero manual intervention.
Step 01
Document Ingestion
PDF uploaded via WhatsApp or owner dashboard
Priya sends the PDF to the system WhatsApp number or uploads via dashboard
Cloud Function triggered on storage event · metadata extracted (version, upload timestamp, sha256 hash)
Previous version archived in Cloud Storage · not deleted — all versions retained for audit
< 5s
Step 02
Text Extraction
PDF → clean text with clause structure preserved
PyMuPDF extracts text with layout awareness · section headings and clause numbers retained
Clause numbering (§4.2, §6.1) is extracted as metadata · used for citation in agent responses
Tables (leave entitlement matrices) extracted and serialised as structured text
< 10s
Step 03
Chunking
256-token chunks with 32-token overlap · clause-aware boundaries
Chunk size: 256 tokens · overlap: 32 tokens · preserves cross-sentence context
Clause-aware splitting: chunks never break mid-clause · §4.2 stays in one chunk wherever possible
Each chunk tagged: { doc_version, clause_id, page, chunk_index, char_start, char_end }
< 5s
Step 04
Embedding
text-embedding-004 via Vertex AI · 768-dimensional vectors
Batched embedding calls to Vertex AI text-embedding-004 · 768 dimensions
A 30-page policy produces ~900 chunks · ~900 embedding API calls · ~$0.0009 total embedding cost
Embeddings stored in Supabase pgvector with HNSW index for sub-10ms retrieval at SMB scale
< 90s
Step 05
Index Promotion
Atomic swap · old index retired · Priya notified
New chunks written to staging table · validated (chunk count, embedding dimensions, clause coverage)
Atomic swap: active_policy_version updated in Firestore · old chunks moved to archive table
Priya receives WhatsApp confirmation: "Policy updated. 847 clauses indexed. Now live."
< 10s
03 — Monitoring & observability
Six signals.
One alerts Priya.
Five signals are technical — monitored by Cloud Monitoring with automated alerting to the engineering on-call. One signal — HITL queue depth — is surfaced directly to Priya on WhatsApp. She does not need a dashboard. She needs to know when something needs her.
Signal 01 · STT
P95
Whisper transcription latency
Measured from audio received at Pub/Sub to transcript published back to Pub/Sub. Includes GPU inference time and network transit.
Alert: P95 > 6s · Action: check GPU spot availability, scale up, or failover to Google STT
Signal 02 · STT quality
WER
STT confidence distribution
Rolling 7-day distribution of Whisper confidence scores. A shift in the distribution (e.g. mean confidence falling from 0.94 to 0.87) signals dialect drift or audio quality degradation.
Alert: 7-day mean confidence < 0.85 · Action: review low-confidence samples, trigger fine-tuning evaluation
Signal 03 · RAG quality
p@1
Policy RAG retrieval precision
For each policy query, the top-1 retrieved clause is logged with its cosine similarity score. Precision@1 is the proportion of retrievals where the correct clause is ranked first (validated by HITL resolutions).
Alert: 7-day P@1 < 0.85 · Action: review chunking strategy, re-embed with updated parameters
Signal 04 · HITL · surfaced to Priya
Q
HITL queue depth
The only signal surfaced directly to Priya. If more than 3 escalations are pending simultaneously, she receives a WhatsApp summary. Normal rate: 1–3 HITL escalations per 50 interactions.
Alert to Priya: queue depth > 3 · WhatsApp summary sent · No engineering action required
Signal 05 · Agent
E2E
End-to-end interaction latency
Total time from WhatsApp message received at Meta API to confirmation message delivered. Includes all agent steps, Firestore reads, RAG lookup, and outbound notification. Target SLA: P99 < 60 seconds.
Alert: P99 > 45s · Action: profile slow steps via Cloud Trace, check Firestore index health
Signal 06 · Cost
Monthly infrastructure spend
GCP billing alert configured at ₹1,500/month (75% of ₹2,000 budget). Breakdown by component logged weekly to Cloud Logging. Anomaly detection on day-on-day spend delta.
Alert: projected monthly spend > ₹1,500 · Action: review interaction volume spike, check for runaway jobs
04 — Drift detection & retraining
Three types of drift.
Three different responses.
Drift in this system manifests differently for each ML component. STT drift is acoustic — workers' speech patterns change over time or new workers join with different dialects. RAG drift is structural — the policy document changes. LLM drift is behavioural — the hosted model's outputs shift across versions. Each requires a different response.
Drift Type 01 · STT
Acoustic drift — dialect and speaker population shift
As Rathi Textiles hires workers from different regions or as the existing workforce's speech patterns change with tenure, Whisper's per-speaker accuracy can drift. This is detected as a downward shift in the rolling 7-day mean confidence score.

Low-confidence transcripts (confidence < 0.85) are automatically flagged and stored with a TTL of 90 days. When flagged samples exceed 15% of weekly volume, a human-review batch is triggered — the system surfaces the audio clips and transcripts to an admin interface for correction.

Corrected pairs accumulate into a fine-tuning dataset. When the dataset reaches 500 pairs, a LoRA fine-tuning job is triggered on Cloud Run GPU.
Response: LoRA fine-tuning on Cloud Run GPU · ~40 min · automated rollback if WER regresses > 5% vs baseline
Drift Type 02 · RAG
Policy drift — document update invalidates index
When Priya updates the HR policy — a new leave type, a revised notice period, a change to the grievance procedure — the existing vector index is stale from the moment of upload.

This is not gradual drift — it is an instantaneous invalidation event. The re-indexing pipeline (documented in Section 02) handles this entirely automatically. The system detects the new document version via Cloud Storage event trigger and runs the full pipeline within 3 minutes.

The old index is archived, not deleted. If the re-indexing fails validation (chunk count drops unexpectedly, embedding dimensions mismatch), the system rolls back to the previous index and alerts the engineer on call.
Response: automatic re-indexing pipeline on document upload · < 3 min · atomic swap with rollback on validation failure
Drift Type 03 · LLM
Model version drift — Gemini Flash behaviour changes across releases
Google periodically releases new versions of Gemini Flash. Each new version can change reasoning behaviour, response format, or instruction-following characteristics — even when the model name is the same.

The system pins to a specific Gemini model version string (e.g. gemini-1.5-flash-002) rather than the latest alias. Version upgrades are deliberate, tested events — not automatic.

Before promoting a new Gemini version to production: a shadow evaluation runs the new model against a held-out set of 200 leave request scenarios, comparing decision accuracy against ground-truth outcomes. Promotion requires ≥ 98% decision agreement with the baseline.
Response: model version pinned in config · shadow evaluation before promotion · 98% decision agreement threshold
Drift Type 04 · RAG quality
Embedding model drift — text-embedding-004 version changes
The embedding model used to index the policy document and the embedding model used at query time must be the same version. If Google updates text-embedding-004 and the index was built with an earlier version, cosine similarity scores become unreliable — the index is effectively stale even though the policy document hasn't changed.

Embedding model version is stored as metadata alongside each chunk in Supabase. A nightly job compares the active embedding model version against the version used to build the index. On mismatch: full re-embedding is triggered automatically.
Response: embedding model version tracked per index · nightly version check · automatic re-embedding on version mismatch
Whisper LoRA fine-tuning job — Cloud Run GPU (simplified)
# cloud_run_jobs/whisper_finetune/main.py # Triggered when low-confidence sample count > 500 pairs import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor from peft import LoraConfig, get_peft_model from google.cloud import storage, firestore, logging BASE_MODEL = "openai/whisper-large-v3" LORA_RANK = 16 WER_THRESHOLD = 0.05 # rollback if WER regresses > 5% vs baseline def run_finetune_job(): # 1. Load training pairs from Firestore /stt_corrections pairs = load_correction_pairs(min_count=500) # 2. LoRA configuration — minimal parameter update lora_config = LoraConfig( r=LORA_RANK, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05 ) model = get_peft_model( WhisperForConditionalGeneration.from_pretrained(BASE_MODEL), lora_config ) # 3. Train on corrected pairs train(model, pairs, epochs=3, batch_size=8) # 4. Evaluate WER on held-out set — rollback gate new_wer = evaluate_wer(model, holdout_set) base_wer = get_baseline_wer() # from Firestore model_versions if new_wer > base_wer + WER_THRESHOLD: log_rollback(new_wer, base_wer) return # do not promote # 5. Build new container image, push to Artifact Registry image_tag = build_and_push_image(model, new_wer) # 6. Cloud Run traffic split: 10% new → 90% old, monitor 24h deploy_with_traffic_split(image_tag, new_traffic=10) # 7. Write model version to Firestore for audit trail register_model_version(image_tag, new_wer, training_pairs=len(pairs))
05 — Model registry & versioning
Every model version.
Every deployment. Auditable.
For a system making employment decisions, model versioning is not a best practice — it is a compliance requirement. In a labour dispute, the question "which model made this decision on this date?" must have a precise, retrievable answer.
Model version registry schema — Firestore /model_versions/{id}
model_namestring — e.g. "whisper-large-v3-lora-v4"
artifact_uristring — Artifact Registry image digest (sha256)
deployed_attimestamp — UTC ISO 8601
wer_validationfloat — WER on held-out test set at deploy time
training_pairsint — number of correction pairs used in fine-tune
traffic_weightfloat — current Cloud Run traffic split (0.0–1.0)
status"shadow" | "canary" | "production" | "rolled_back"
Deployment strategy — Cloud Run traffic splitting
Stage 1 — Shadow (0% traffic)
New model receives all requests as a shadow copy. Responses logged but not returned to users. WER and confidence distributions compared against production model. Duration: 24 hours.
Stage 2 — Canary (10% traffic)
10% of live interactions routed to new model. Confidence scores and error rates monitored. Automated rollback if error rate exceeds baseline + 2%. Duration: 48 hours.
Stage 3 — Production (100% traffic)
Full traffic promotion on manual approval after canary success. Previous model revision retained for 30-day instant rollback. Model version written to all subsequent audit log entries.