Page 07 · ML Engineering & MLOps

Five models. One explainability
contract. Zero black boxes.

Every ML model in the Autonomous Supply Chain is designed with its explanation contract before a single line of training code is written. SHAP values generated at inference. Model Cards versioned alongside models. EU AI Act Articles 9, 10, 13, 14, 15, and 17 satisfied by design. GDPR Article 22 position documented per model. Rollback SLA: ≤15 minutes.

5 EU AI Act
High-Risk Models
8 Vertex AI Pipeline
Steps
3 Shared Feature
Groups
≤15m Model Rollback
SLA
Section 01 · Five ML Models — EU AI Act Annex III

Five high-risk models.
Each with a defined XAI contract.

Every model below is classified as High-Risk under EU AI Act Annex III. Each carries a SHAP explanation contract — specified before training begins, generated at inference, and written to the immutable audit log before any procurement, sourcing, or operational action is taken.

Model 01
DemandIQ
Forecast Model
XGBoost + ARIMA Ensemble
Time-series ensemble
12-week rolling forecast
EU AI Act · Annex III High-Risk
Input Features — Vertex AI Feature Store
  • Salesforce pipeline value per SKU (30/60/90-day)
  • SAP IBP historical shipment data (36-month rolling)
  • Hospital procurement index (regional, by device category)
  • Macroeconomic indicators (EUR/MYR FX, PMI, logistics indices)
  • Regulatory approval calendar (new device approvals by market)
Target 12-week rolling demand forecast per SKU per site
Evaluation Metrics MAPE ≤12% (target) · WAPE · Bias
Model Card Intended use · Known limitations · Bias analysis · Training data provenance · Performance by device category
⟨φ⟩
XAI Contract · SHAP TreeExplainer
Top-5 feature contributions per forecast line — generated at inference time, written to audit log before any procurement trigger fires. Explanation artifact versioned with model in Vertex AI Model Registry.
Model 02
SupplierSentinel
Risk Classifier
Multi-factor XGBoost Classifier
Continuous risk scoring
per supplier per dimension
EU AI Act · Annex III High-Risk
Input Features — Vertex AI Feature Store
  • Supplier financial health score (Altman Z-score derived)
  • Geopolitical risk index (supplier country / region)
  • ESG compliance score (third-party + self-reported)
  • Sub-tier concentration score (single-source dependency)
  • News sentiment score (Gemini Vertex AI Search output)
  • Historical delivery performance (SAP Ariba)
  • Lead time volatility (SAP IBP)
Target Risk score 0–1 per supplier, per dimension, continuously
Evaluation Metrics AUC-ROC ≥0.85 · Precision/Recall at 0.75 threshold
Model Card Bias analysis by supplier geography · Known limitations on sub-tier data quality
⟨φ⟩
XAI Contract · SHAP TreeExplainer
Top-3 risk dimensions per supplier per event — generated before any sourcing action is taken. Supplier-level SHAP report attached to every sourcing recommendation delivered to procurement agent.
Model 03
InventoryOrchestrator
Reorder Model
Multi-output Regression
+ Safety stock optimiser
per SKU per site
EU AI Act · Annex III High-Risk
Input Features — Vertex AI Feature Store
  • DemandIQ forecast output (primary input)
  • SupplierSentinel risk score (safety stock buffer multiplier)
  • Current stock level per SKU per site (SAP S/4HANA real-time)
  • Lead time distribution per supplier (historical)
  • Holding cost per SKU (SAP finance integration)
  • Stockout cost estimate per SKU (CFO-approved parameters)
Target Reorder point + safety stock quantity per SKU per site
Evaluation Metrics Inventory cost reduction % · Stockout rate
⟨φ⟩
XAI Contract · SHAP
Attribution split across DemandIQ forecast vs SupplierSentinel risk vs lead time uncertainty — enabling procurement to understand why each reorder quantity was computed, not just what it is.
Model 04
QualityTrace NCR
Root Cause Classifier
Multi-class Gradient Boosted
+ Gemini document
intelligence layer
EU AI Act · Annex III High-Risk
Input Features — Vertex AI Feature Store
  • Batch record attributes (SAP S/4HANA + Veeva Vault)
  • Supplier certificate compliance status
  • Incoming inspection result vectors
  • Historical NCR root cause patterns (36-month)
  • Device lineage graph features (supplier → component → sub-assembly → finished device)
Target Root cause category: material defect · process deviation · supplier non-conformance · equipment failure · documentation error
Evaluation Metrics Top-1 accuracy ≥78% · Top-3 accuracy ≥93%
⟨φ⟩
XAI Contract · SHAP — 2h NCR SLA
Top contributing batch record features per root cause hypothesis — generated within 2h of NCR creation and attached to quality engineer HITL review surface before investigation is assigned.
⚖ MDR Obligation — Device lineage trace output satisfies MDR Article 87 vigilance reporting data requirement
Model 05
ContractIntelligence
TCO & Risk Scorer
Gemini 1.5 Pro + TCO Regression
1M token context
200+ clause types
EU AI Act · Annex III High-Risk
Two-layer Architecture
  • Gemini layer: Full-document clause classification and risk scoring across 200+ clause types. 1M token context — no chunking required.
  • TCO layer: Risk-adjusted total cost of ownership regression incorporating Gemini clause risk scores + SupplierSentinel risk + historical performance.
TCO Regression Features
  • Clause risk vector (Gemini output)
  • SupplierSentinel score · Historical delivery/quality performance
  • Unit price · Logistics cost estimate · Compliance cost estimate
Target Risk-adjusted TCO score per supplier per contract
⟨φ⟩
XAI Contract · SHAP + Gemini Clause Citations
SHAP on TCO regression — attribution to contract risk, supplier risk, and cost components. Plus Gemini clause citations for top-3 risk clauses, surfaced in legal reviewer HITL interface before any counter-proposal is drafted.
Section 02 · Vertex AI MLOps Pipeline

Eight-step training pipeline.
Applied uniformly across all five models.

Every model passes through the same Vertex AI Pipelines topology. Promotion is gated on metric thresholds. Model Cards are auto-generated from training run metadata. High-risk models under EU AI Act require a manual gate before full deployment.

01
Data Validation Great Expectations Pipeline Gate

Great Expectations schema validation runs against the Feature Store snapshot before any training computation begins. If schema drift is detected against the registered feature contract, the pipeline fails immediately — no training job is submitted, no compute is wasted, and an alert is raised to the ML Engineer on-call via PagerDuty.

02
Feature Engineering Vertex AI Feature Store

All feature transformations are executed with lineage metadata written to the Feature Store at each step. Transformation logic is versioned alongside the feature group — ensuring that every model version can be exactly reproduced from a point-in-time feature snapshot, satisfying EU AI Act data provenance requirements.

03
Distributed Model Training Vertex AI Training Vertex AI Vizier

Distributed training on Vertex AI Training with hyperparameter tuning via Vertex AI Vizier. Each training run generates a full experiment lineage record: hyperparameter configuration, training data snapshot hash, framework version, and compute configuration — all captured in the Vertex AI Experiments registry before the model artifact is produced.

04
Model Evaluation Champion Gate

Evaluation metrics are computed on the held-out evaluation set and compared to the registered champion model performance. Promotion to the next pipeline step is gated on metric thresholds — MAPE for DemandIQ, AUC-ROC for classifiers. A challenger model that does not improve on the champion is blocked from registration.

05
Model Card Generation Automated

Model Card is auto-populated from training run metadata, evaluation results, and bias analysis computed over demographic slices of the evaluation set. Sections include: intended use, training data provenance, evaluation results, bias analysis, ethical considerations, EU AI Act classification, regulatory obligations, and version history. No manual authoring required.

06
Model Registration Vertex AI Model Registry

Model artifact, Model Card, and SHAP explainer artifact are registered together as a versioned bundle in Vertex AI Model Registry. Each version carries a regulatory metadata tag set: EU AI Act risk class, applicable regulatory obligations, Model Card reference, and the identity of the training run that produced it.

07
Canary Deployment 10% Traffic Split

The challenger model receives 10% of live inference traffic alongside the champion. Champion/challenger performance comparison runs over a 48-hour observation window. Both models log SHAP explanations to the audit trail during this period, enabling side-by-side explainability comparison as well as metric comparison. Canary window / retraining conflict protocol: For SupplierSentinel (weekly retraining cadence), if a drift event fires during an active 48-hour canary window, the canary is immediately paused, the champion serves 100% traffic, and the retraining pipeline takes priority. The canary observation window restarts fresh once the new challenger clears the champion gate. This conflict protocol is encoded as a pipeline condition, not an operational procedure.

08
Full Promotion Conditional Promote — Metrics + Manual Gate

Promotion to 100% traffic is conditional on two gates passing in sequence. First, canary metrics must hold against champion thresholds over the full observation window. Second, because all five models are EU AI Act High-Risk, full promotion requires explicit manual approval from the designated ML Engineer and a compliance sign-off record before traffic shift. Neither gate can be bypassed. Rollback capability is preserved: if post-promotion metrics degrade within the 72-hour monitoring window, automated rollback to the previous champion is triggered and logged to the immutable audit trail.

⚖ EU AI Act Article 9 — Human oversight gate required before full deployment of any high-risk model
↩ Rollback — automated revert to champion within 72h window if post-promotion performance degrades
Section 03 · Drift Detection & Retraining Triggers

Three drift thresholds.
One automated response chain.

Feature drift, prediction drift, and performance drift are monitored continuously via Vertex AI Model Monitoring and Cloud Monitoring custom metrics. Every drift event is timestamped, logged with the model version reference, and routed through PagerDuty before any retraining trigger fires.

Feature Drift
  • Threshold PSI > 0.2 on any top-5 feature
  • Detection Method Vertex AI Model Monitoring — feature distribution monitoring
  • Response Alert → PagerDuty → ML Engineer on-call. Investigation required before retraining decision.
  • Audit Drift event written to immutable audit log with timestamp and model version reference.
Prediction Drift
  • Threshold KL divergence > 0.1 on output distribution
  • Detection Method Cloud Monitoring custom metrics — prediction distribution tracking per endpoint
  • Response Alert + mandatory investigation. Distinguishes between data shift and model degradation before triggering retraining.
  • Audit Drift event written to audit log. Investigation outcome recorded before any retraining pipeline is initiated.
Performance Drift
  • DemandIQ Threshold MAPE degrades >5pp against registered champion baseline
  • Classifier Threshold AUC-ROC drops >0.05 against registered champion baseline
  • Response Automatic retraining pipeline trigger. Champion remains serving until challenger is promoted through full pipeline.
Retraining Cadence
  • SupplierSentinel Weekly — high event frequency, geopolitical volatility
  • DemandIQ · InventoryOrchestrator Monthly — demand signal lag, procurement cycle alignment
  • QualityTrace · ContractIntelligence Quarterly — lower event frequency, stable clause taxonomy
Drift Detection Infrastructure — Event Flow
Vertex AI Model Monitoring
Cloud Monitoring Custom Metrics
Alerting Policy
PagerDuty
ML Engineer On-Call
Immutable Audit Log
⚖ GDPR Article 17 — Audit Log Retention & Right to Erasure
SHAP explanation artifacts written to the audit log at inference time include input feature values. Where any feature value is derived from personal data (e.g. named contact data within supplier records, news sentiment attributed to identifiable individuals), the following controls apply:
  • Retention policy: Audit log entries are retained for 7 years to satisfy MDR and ISO 13485 traceability obligations. SHAP artifacts containing personal data are pseudonymised at write time — the supplier entity ID is stored; raw personal identifiers are not written to the audit log.
  • Erasure conflict resolution: Where a GDPR Art. 17 erasure request conflicts with a mandatory MDR retention obligation, the MDR obligation takes precedence under GDPR Art. 17(3)(b) (processing necessary for compliance with a legal obligation). The DPO is notified of each such conflict and maintains a register.
  • Immutability scope: "Immutable audit log" means append-only with cryptographic integrity verification — not that erasure is technically impossible. The pseudonymisation layer ensures erasure of the personal identifier is possible without invalidating the audit record integrity.
Section 04 · Model Card — Reference Example

DemandIQ Forecast Model.
Fully worked Model Card.

The Model Card template below is the standard applied across all five models. Every section is auto-populated from Vertex AI training run metadata and evaluation results — Model Cards are generated artifacts of the pipeline, not manually authored documents.

DemandIQ Forecast Model

XGBoost + ARIMA Time-Series Ensemble · 12-week rolling demand forecast per SKU per site

Version: v2.4.1
Registry: vertex-model-registry/demandiq
EU AI Act Class: High-Risk Annex III
Card generated: 2026-04-09T08:14:22Z
Architecture: XGBoost + ARIMA hybrid ensemble. XGBoost captures non-linear relationships across cross-sectional features. ARIMA models residual temporal structure. Ensemble weights determined by Vertex AI Vizier hyperparameter tuning.
  • Training framework: XGBoost 1.7 · statsmodels ARIMA
  • Training platform: Vertex AI Training (custom container)
  • Hyperparameter tuning: Vertex AI Vizier — Bayesian optimisation
  • SHAP explainer: TreeExplainer (model-native, fast)
Primary use: 12-week rolling demand forecast per SKU per distribution site — used as the primary input to InventoryOrchestrator for reorder point and safety stock computation.
  • Intended users: Procurement planning agents; Supply Chain Planning function
  • Out-of-scope: Forecasting demand for device categories with fewer than 24 months of SAP IBP history
  • Out-of-scope: Markets where regulatory approval calendar data is unavailable
Data sources:
  • SAP IBP historical shipments — 36-month rolling window per SKU per site
  • Salesforce pipeline value — 30/60/90-day cohorts per SKU
  • Hospital procurement index — regional, by device category (third-party)
  • EUR/MYR FX, PMI, logistics cost indices — macroeconomic data feed
  • Regulatory approval calendar — new device approvals by market (regulatory intelligence provider)
Train / Validation / Test Split — Temporal
Strict temporal split to prevent leakage: Train 2023-04-01→2025-09-30 · Validation 2025-10-01→2025-12-31 · Test (held-out) 2026-01-01→2026-03-31. No future data leaks into training window. Evaluation set is re-anchored at each retraining cycle — held-out window always represents the most recent 13 weeks not seen during training.
Training data snapshot: 2023-04-01 to 2026-03-31 · Snapshot hash: sha256:7f4a…c21b
MAPE 9.4%
WAPE 8.1%
Bias +0.3%
Champion Δ MAPE −1.2pp
Evaluated on held-out 13-week rolling window across all active SKUs and sites. Performance by device category available in the full evaluation report attached to this Model Card version in Vertex AI Model Registry.
MAPE evaluated separately by device category, geographic region, and forecast horizon (4w, 8w, 12w). Known performance degradation at 12-week horizon for newly approved device categories with fewer than 6 months of post-approval shipment data — flagged as a known limitation.
  • MAPE at 4-week horizon: 6.8% (all categories)
  • MAPE at 12-week horizon: 12.1% (all categories)
  • MAPE at 12-week horizon — new device categories: 18.4%
Human oversight: All procurement recommendations derived from DemandIQ output above a quantity or value threshold route to a Procurement Manager HITL checkpoint before a purchase order is raised. SHAP top-5 feature contributions are presented at the HITL checkpoint.
  • No demographic data is used as a model feature
  • Forecast outputs do not directly affect individual employment decisions
  • Model Card versioned alongside model — auditors read the operational registry
GDPR Article 22 — Automated Decision-Making
DemandIQ outputs are used as inputs to a human decision, not as autonomous procurement decisions. A qualified Procurement Manager reviews and approves all PO triggers above the CFO-defined financial threshold. Below-threshold POs are automated but do not individually affect natural persons in the sense of Article 22. Right to explanation is satisfied by the SHAP report presented at HITL. No solely-automated decision with significant legal effect on a natural person is made from DemandIQ output alone.
Risk class: High-Risk — Annex III, Category 8b (critical infrastructure management systems that could pose risks to health, safety, or fundamental rights of natural persons).
  • Article 9: Risk management system documented and versioned in Model Registry
  • Article 10: Training data governance enforced via Feature Store lineage
  • Article 13: Transparency — Model Card accessible to deployer and affected parties
  • Article 14: Human oversight — HITL checkpoint specified before deployment
  • Article 15: Accuracy, robustness & cybersecurity — OOD detection, adversarial input gate, graceful degradation, VPC endpoint isolation (see Section 06)
  • Article 17: Quality management — pipeline gated at evaluation step
ISO 13485: Demand forecast outputs are used to determine safety stock levels for medical device components — supply chain traceability maintained via InventoryOrchestrator integration.
  • Forecast lineage (input features → forecast output → reorder trigger) preserved in audit log
  • Model version referenced in every procurement trigger event
  • Retraining pipeline output subject to same registration and promotion gate
EU AI Act Article 15 — Accuracy, Robustness & Cybersecurity
Robustness requirements maintained throughout model lifecycle:
  • Out-of-distribution detection: inference requests with feature values outside ±3σ of training distribution are flagged and routed to HITL before PO action
  • Adversarial input handling: Great Expectations schema gate at inference rejects malformed or anomalous feature payloads
  • Data degradation resilience: model degrades gracefully when ≤2 non-critical features are missing — missing feature imputation is logged and flagged in SHAP output
  • Cybersecurity: model endpoint accessible only via VPC Service Controls perimeter; no public inference endpoint exposed
Section 05 · Feature Store Design

Three shared feature groups.
No feature duplication across models.

Feature groups are designed to be shared across models — eliminating training/serving skew and ensuring that a feature value computed once is consumed consistently by every model that depends on it. Feature freshness SLOs are enforced by the ingestion pipeline, not just measured. Access to feature groups is governed by IAM role-based access control — models and agents consume features via service accounts with least-privilege scope. GDPR data minimisation applies: no feature group stores personal data beyond what is necessary for the declared model purpose; supplier_features contains entity-level scores only, not the underlying personal data records from which scores are derived.

supplier_features
Key: supplier_id
  • supplier_id
  • risk_score_composite
  • financial_health_score
  • esg_score
  • geopolitical_index
  • lead_time_p50
  • lead_time_p95
  • delivery_performance_rate
Consumers
SupplierSentinel InventoryOrchestrator ContractIntelligence
Freshness SLO ≤60 seconds Real-time Pub/Sub ingestion
demand_features
Key: sku_id · site_id
  • sku_id
  • site_id
  • salesforce_pipeline_value_30d
  • salesforce_pipeline_value_90d
  • ibp_forecast_baseline
  • hospital_procurement_index
  • macro_pmi_index
Consumers
DemandIQ InventoryOrchestrator
Freshness SLO ≤1 hour SAP IBP export + Salesforce sync
quality_features
Key: supplier_id · batch_id
  • supplier_id
  • batch_id
  • inspection_pass_rate
  • ncr_frequency_90d
  • cert_compliance_status
Consumers
QualityTrace SupplierSentinel
Freshness SLO ≤4 hours Veeva Vault + SAP batch sync
Full Pipeline SLA — Trigger to Model-Ready-for-Canary
XGBoost/classifier models: ≤4 hours · Gemini-based ContractIntelligence: ≤8 hours
SupplierSentinel Effective Monitoring Gap
Weekly retraining + 4h pipeline SLA + 48h canary = max 52h gap — within acceptable operational window. Canary conflict protocol (see Step 07) prevents gap extension.
Model Retraining Cadence Feature Drift Threshold Performance Drift Trigger
DemandIQ Monthly PSI > 0.2 on any top-5 feature MAPE degrades >5pp vs champion
SupplierSentinel Weekly PSI > 0.2 on any top-5 feature AUC-ROC drops >0.05 vs champion
InventoryOrchestrator Monthly PSI > 0.2 on any top-5 feature Inventory cost KPI degrades >3%
QualityTrace Quarterly PSI > 0.2 on any top-5 feature Top-1 accuracy drops >5pp vs champion
Compensating control: MDR-class NCRs always route to quality engineer HITL regardless of model confidence — classification degradation cannot silently affect safety-critical decisions
ContractIntelligence Quarterly PSI > 0.2 on clause risk vector distribution Procurement outcome correlation < 0.65 on 90-day rolling window (KL divergence on TCO output distribution > 0.1 used as leading indicator)
Section 06 · Robustness, Rollback & GDPR Article 22

EU AI Act Article 15.
Accuracy, robustness, and cybersecurity — throughout the lifecycle.

Article 15 requires that high-risk AI systems achieve appropriate levels of accuracy, robustness, and cybersecurity across their operational lifetime — not just at deployment. This section specifies the robustness controls applied uniformly to all five models, the rollback procedure triggered when a promoted model degrades, and the GDPR Article 22 position for the two models whose outputs most directly affect third-party decisions.

Out-of-Distribution Detection
  • Trigger Any inference request with ≥1 feature value outside ±3σ of training distribution
  • Response Request flagged. Prediction returned with ood_flag=true. Downstream agent routes to HITL — no autonomous action taken on OOD inferences.
  • Audit OOD event logged with feature vector, model version, and routing decision.
Adversarial & Malformed Input Handling
  • Inference Gate Great Expectations schema validation runs at inference — same contract as training pipeline Step 01. Malformed payloads are rejected before reaching model endpoint.
  • Endpoint Security All model endpoints are private, accessible only within VPC Service Controls perimeter. No public inference endpoint. IAP enforced for human-facing HITL surfaces.
  • Threat Model Model endpoints are not externally reachable. Primary attack surface is internal — covered by VPC perimeter, service account least-privilege, and Cloud Armor on API Gateway.
Graceful Degradation Under Feature Unavailability
  • Tolerance Models tolerate absence of ≤2 non-critical features. Missing features are imputed using training-set median. Imputation is flagged in SHAP output.
  • Critical Feature Unavailability If a critical feature (defined per model in Feature Store metadata) is unavailable, inference is blocked and escalated to ML Engineer on-call. No prediction with unknown reliability is served.
  • Classification Critical features per model are designated at registration time and versioned in Vertex AI Model Registry alongside the model artifact.
Rollback Procedure — All Five Models
  • Rollback Trigger Post-promotion performance metrics degrade beyond threshold within 72-hour monitoring window, OR a compliance issue is identified in the promoted model.
  • Rollback SLA Champion model restored to 100% traffic within ≤15 minutes of rollback decision. Traffic shift is automated via Vertex AI endpoint split update.
  • Rollback Authority Automated rollback: triggered by metric threshold breach. Manual rollback: any designated ML Engineer or Compliance Officer can trigger via ITSM ticket — no change freeze required.
  • Audit Rollback event written to immutable audit log: timestamp, triggered-by identity, reason code, previous champion version, and post-rollback verification metric. EU AI Act Article 9 rollback record.
GDPR Article 22 — Automated Decision-Making Position · Per Model
Model Art. 22 Applicability Safeguard Right to Explanation
DemandIQ Not applicable — outputs are inputs to a human procurement decision, not autonomous decisions affecting natural persons directly. HITL gate above financial threshold. Procurement Manager approval required before PO raised. SHAP top-5 features presented at HITL checkpoint.
SupplierSentinel Potentially applicable — risk scores may produce significant effects on supplier commercial relationships (individuals acting as sole traders or named representatives). All sourcing exclusion or downgrade actions require human procurement agent review. Score alone cannot trigger supplier removal — a human decision is required. HITL is structural, not configurable. Top-3 risk dimensions + SHAP per supplier event available to affected party on request via DPO channel. Response SLA: 72 hours.
InventoryOrchestrator Not applicable — reorder decisions affect inventory levels, not natural persons. HITL gate for orders above CFO-defined financial threshold. SHAP attribution available in procurement dashboard.
QualityTrace Not applicable — root cause classification affects batch records, not natural persons directly. All MDR-class NCRs route to quality engineer HITL unconditionally. Root cause SHAP report attached to quality engineer review surface.
ContractIntelligence Potentially applicable — clause risk scoring and TCO output may significantly affect contract outcomes for natural persons party to supplier agreements. Legal reviewer HITL is mandatory before any counter-proposal is drafted. No contract action is taken from model output alone. Gemini clause citations + SHAP TCO attribution surfaced in legal reviewer interface and available to affected party on request.