PG 07 — ML Engineering & MLOps · The Autonomous Supply Chain

Section 01 · Five ML Models — EU AI Act Annex III

Five high-risk models.
Each with a defined XAI contract.

Every model below is classified as High-Risk under EU AI Act Annex III. Each carries a SHAP explanation contract — specified before training begins, generated at inference, and written to the immutable audit log before any procurement, sourcing, or operational action is taken.

Input Features — Vertex AI Feature Store

Salesforce pipeline value per SKU (30/60/90-day)
SAP IBP historical shipment data (36-month rolling)
Hospital procurement index (regional, by device category)
Macroeconomic indicators (EUR/MYR FX, PMI, logistics indices)
Regulatory approval calendar (new device approvals by market)

Target 12-week rolling demand forecast per SKU per site

Evaluation Metrics MAPE ≤12% (target) · WAPE · Bias

Model Card Intended use · Known limitations · Bias analysis · Training data provenance · Performance by device category

⟨φ⟩

XAI Contract · SHAP TreeExplainer

Top-5 feature contributions per forecast line — generated at inference time, written to audit log before any procurement trigger fires. Explanation artifact versioned with model in Vertex AI Model Registry.

Input Features — Vertex AI Feature Store

Supplier financial health score (Altman Z-score derived)
Geopolitical risk index (supplier country / region)
ESG compliance score (third-party + self-reported)
Sub-tier concentration score (single-source dependency)
News sentiment score (Gemini Vertex AI Search output)
Historical delivery performance (SAP Ariba)
Lead time volatility (SAP IBP)

Target Risk score 0–1 per supplier, per dimension, continuously

Evaluation Metrics AUC-ROC ≥0.85 · Precision/Recall at 0.75 threshold

Model Card Bias analysis by supplier geography · Known limitations on sub-tier data quality

⟨φ⟩

XAI Contract · SHAP TreeExplainer

Top-3 risk dimensions per supplier per event — generated before any sourcing action is taken. Supplier-level SHAP report attached to every sourcing recommendation delivered to procurement agent.

Input Features — Vertex AI Feature Store

DemandIQ forecast output (primary input)
SupplierSentinel risk score (safety stock buffer multiplier)
Current stock level per SKU per site (SAP S/4HANA real-time)
Lead time distribution per supplier (historical)
Holding cost per SKU (SAP finance integration)
Stockout cost estimate per SKU (CFO-approved parameters)

Target Reorder point + safety stock quantity per SKU per site

Evaluation Metrics Inventory cost reduction % · Stockout rate

⟨φ⟩

XAI Contract · SHAP

Attribution split across DemandIQ forecast vs SupplierSentinel risk vs lead time uncertainty — enabling procurement to understand why each reorder quantity was computed, not just what it is.

Input Features — Vertex AI Feature Store

Batch record attributes (SAP S/4HANA + Veeva Vault)
Supplier certificate compliance status
Incoming inspection result vectors
Historical NCR root cause patterns (36-month)
Device lineage graph features (supplier → component → sub-assembly → finished device)

Target Root cause category: material defect · process deviation · supplier non-conformance · equipment failure · documentation error

Evaluation Metrics Top-1 accuracy ≥78% · Top-3 accuracy ≥93%

⟨φ⟩

XAI Contract · SHAP — 2h NCR SLA

Top contributing batch record features per root cause hypothesis — generated within 2h of NCR creation and attached to quality engineer HITL review surface before investigation is assigned.

⚖ MDR Obligation — Device lineage trace output satisfies MDR Article 87 vigilance reporting data requirement

Two-layer Architecture

Gemini layer: Full-document clause classification and risk scoring across 200+ clause types. 1M token context — no chunking required.
TCO layer: Risk-adjusted total cost of ownership regression incorporating Gemini clause risk scores + SupplierSentinel risk + historical performance.

TCO Regression Features

Clause risk vector (Gemini output)
SupplierSentinel score · Historical delivery/quality performance
Unit price · Logistics cost estimate · Compliance cost estimate

Target Risk-adjusted TCO score per supplier per contract

⟨φ⟩

XAI Contract · SHAP + Gemini Clause Citations

SHAP on TCO regression — attribution to contract risk, supplier risk, and cost components. Plus Gemini clause citations for top-3 risk clauses, surfaced in legal reviewer HITL interface before any counter-proposal is drafted.

Section 02 · Vertex AI MLOps Pipeline

Eight-step training pipeline.
Applied uniformly across all five models.

Every model passes through the same Vertex AI Pipelines topology. Promotion is gated on metric thresholds. Model Cards are auto-generated from training run metadata. High-risk models under EU AI Act require a manual gate before full deployment.

Data Validation Great Expectations Pipeline Gate

Great Expectations schema validation runs against the Feature Store snapshot before any training computation begins. If schema drift is detected against the registered feature contract, the pipeline fails immediately — no training job is submitted, no compute is wasted, and an alert is raised to the ML Engineer on-call via PagerDuty.

Feature Engineering Vertex AI Feature Store

All feature transformations are executed with lineage metadata written to the Feature Store at each step. Transformation logic is versioned alongside the feature group — ensuring that every model version can be exactly reproduced from a point-in-time feature snapshot, satisfying EU AI Act data provenance requirements.

Distributed Model Training Vertex AI Training Vertex AI Vizier

Distributed training on Vertex AI Training with hyperparameter tuning via Vertex AI Vizier. Each training run generates a full experiment lineage record: hyperparameter configuration, training data snapshot hash, framework version, and compute configuration — all captured in the Vertex AI Experiments registry before the model artifact is produced.

Model Evaluation Champion Gate

Evaluation metrics are computed on the held-out evaluation set and compared to the registered champion model performance. Promotion to the next pipeline step is gated on metric thresholds — MAPE for DemandIQ, AUC-ROC for classifiers. A challenger model that does not improve on the champion is blocked from registration.

Model Card Generation Automated

Model Card is auto-populated from training run metadata, evaluation results, and bias analysis computed over demographic slices of the evaluation set. Sections include: intended use, training data provenance, evaluation results, bias analysis, ethical considerations, EU AI Act classification, regulatory obligations, and version history. No manual authoring required.

Model Registration Vertex AI Model Registry

Model artifact, Model Card, and SHAP explainer artifact are registered together as a versioned bundle in Vertex AI Model Registry. Each version carries a regulatory metadata tag set: EU AI Act risk class, applicable regulatory obligations, Model Card reference, and the identity of the training run that produced it.

Canary Deployment 10% Traffic Split

The challenger model receives 10% of live inference traffic alongside the champion. Champion/challenger performance comparison runs over a 48-hour observation window. Both models log SHAP explanations to the audit trail during this period, enabling side-by-side explainability comparison as well as metric comparison. Canary window / retraining conflict protocol: For SupplierSentinel (weekly retraining cadence), if a drift event fires during an active 48-hour canary window, the canary is immediately paused, the champion serves 100% traffic, and the retraining pipeline takes priority. The canary observation window restarts fresh once the new challenger clears the champion gate. This conflict protocol is encoded as a pipeline condition, not an operational procedure.

Full Promotion Conditional Promote — Metrics + Manual Gate

Promotion to 100% traffic is conditional on two gates passing in sequence. First, canary metrics must hold against champion thresholds over the full observation window. Second, because all five models are EU AI Act High-Risk, full promotion requires explicit manual approval from the designated ML Engineer and a compliance sign-off record before traffic shift. Neither gate can be bypassed. Rollback capability is preserved: if post-promotion metrics degrade within the 72-hour monitoring window, automated rollback to the previous champion is triggered and logged to the immutable audit trail.

⚖ EU AI Act Article 9 — Human oversight gate required before full deployment of any high-risk model

↩ Rollback — automated revert to champion within 72h window if post-promotion performance degrades

Section 03 · Drift Detection & Retraining Triggers

Three drift thresholds.
One automated response chain.

Feature drift, prediction drift, and performance drift are monitored continuously via Vertex AI Model Monitoring and Cloud Monitoring custom metrics. Every drift event is timestamped, logged with the model version reference, and routed through PagerDuty before any retraining trigger fires.

Feature Drift

Threshold PSI > 0.2 on any top-5 feature
Detection Method Vertex AI Model Monitoring — feature distribution monitoring
Response Alert → PagerDuty → ML Engineer on-call. Investigation required before retraining decision.
Audit Drift event written to immutable audit log with timestamp and model version reference.

Prediction Drift

Threshold KL divergence > 0.1 on output distribution
Detection Method Cloud Monitoring custom metrics — prediction distribution tracking per endpoint
Response Alert + mandatory investigation. Distinguishes between data shift and model degradation before triggering retraining.
Audit Drift event written to audit log. Investigation outcome recorded before any retraining pipeline is initiated.

Performance Drift

DemandIQ Threshold MAPE degrades >5pp against registered champion baseline
Classifier Threshold AUC-ROC drops >0.05 against registered champion baseline
Response Automatic retraining pipeline trigger. Champion remains serving until challenger is promoted through full pipeline.

Retraining Cadence

SupplierSentinel Weekly — high event frequency, geopolitical volatility
DemandIQ · InventoryOrchestrator Monthly — demand signal lag, procurement cycle alignment
QualityTrace · ContractIntelligence Quarterly — lower event frequency, stable clause taxonomy

Drift Detection Infrastructure — Event Flow

Vertex AI Model Monitoring

→

Cloud Monitoring Custom Metrics

→

Alerting Policy

→

PagerDuty

→

ML Engineer On-Call

→

Immutable Audit Log

⚖ GDPR Article 17 — Audit Log Retention & Right to Erasure

SHAP explanation artifacts written to the audit log at inference time include input feature values. Where any feature value is derived from personal data (e.g. named contact data within supplier records, news sentiment attributed to identifiable individuals), the following controls apply:

─Retention policy: Audit log entries are retained for 7 years to satisfy MDR and ISO 13485 traceability obligations. SHAP artifacts containing personal data are pseudonymised at write time — the supplier entity ID is stored; raw personal identifiers are not written to the audit log.
─Erasure conflict resolution: Where a GDPR Art. 17 erasure request conflicts with a mandatory MDR retention obligation, the MDR obligation takes precedence under GDPR Art. 17(3)(b) (processing necessary for compliance with a legal obligation). The DPO is notified of each such conflict and maintains a register.
─Immutability scope: "Immutable audit log" means append-only with cryptographic integrity verification — not that erasure is technically impossible. The pseudonymisation layer ensures erasure of the personal identifier is possible without invalidating the audit record integrity.

Section 04 · Model Card — Reference Example

DemandIQ Forecast Model.
Fully worked Model Card.

The Model Card template below is the standard applied across all five models. Every section is auto-populated from Vertex AI training run metadata and evaluation results — Model Cards are generated artifacts of the pipeline, not manually authored documents.

DemandIQ Forecast Model

XGBoost + ARIMA Time-Series Ensemble · 12-week rolling demand forecast per SKU per site

Version: v2.4.1
Registry: vertex-model-registry/demandiq
EU AI Act Class: High-Risk Annex III
Card generated: 2026-04-09T08:14:22Z

01 Model Details

Architecture: XGBoost + ARIMA hybrid ensemble. XGBoost captures non-linear relationships across cross-sectional features. ARIMA models residual temporal structure. Ensemble weights determined by Vertex AI Vizier hyperparameter tuning.

Training framework: XGBoost 1.7 · statsmodels ARIMA
Training platform: Vertex AI Training (custom container)
Hyperparameter tuning: Vertex AI Vizier — Bayesian optimisation
SHAP explainer: TreeExplainer (model-native, fast)

02 Intended Use

Primary use: 12-week rolling demand forecast per SKU per distribution site — used as the primary input to InventoryOrchestrator for reorder point and safety stock computation.

Intended users: Procurement planning agents; Supply Chain Planning function
Out-of-scope: Forecasting demand for device categories with fewer than 24 months of SAP IBP history
Out-of-scope: Markets where regulatory approval calendar data is unavailable

03 Training Data

Data sources:

SAP IBP historical shipments — 36-month rolling window per SKU per site
Salesforce pipeline value — 30/60/90-day cohorts per SKU
Hospital procurement index — regional, by device category (third-party)
EUR/MYR FX, PMI, logistics cost indices — macroeconomic data feed
Regulatory approval calendar — new device approvals by market (regulatory intelligence provider)

Train / Validation / Test Split — Temporal

Strict temporal split to prevent leakage: Train 2023-04-01→2025-09-30 · Validation 2025-10-01→2025-12-31 · Test (held-out) 2026-01-01→2026-03-31. No future data leaks into training window. Evaluation set is re-anchored at each retraining cycle — held-out window always represents the most recent 13 weeks not seen during training.

Training data snapshot: 2023-04-01 to 2026-03-31 · Snapshot hash: sha256:7f4a…c21b

04 Evaluation Results

MAPE 9.4%

WAPE 8.1%

Bias +0.3%

Champion Δ MAPE −1.2pp

Evaluated on held-out 13-week rolling window across all active SKUs and sites. Performance by device category available in the full evaluation report attached to this Model Card version in Vertex AI Model Registry.

05 Bias Analysis

MAPE evaluated separately by device category, geographic region, and forecast horizon (4w, 8w, 12w). Known performance degradation at 12-week horizon for newly approved device categories with fewer than 6 months of post-approval shipment data — flagged as a known limitation.

MAPE at 4-week horizon: 6.8% (all categories)
MAPE at 12-week horizon: 12.1% (all categories)
MAPE at 12-week horizon — new device categories: 18.4%

06 Ethical Considerations

Human oversight: All procurement recommendations derived from DemandIQ output above a quantity or value threshold route to a Procurement Manager HITL checkpoint before a purchase order is raised. SHAP top-5 feature contributions are presented at the HITL checkpoint.

No demographic data is used as a model feature
Forecast outputs do not directly affect individual employment decisions
Model Card versioned alongside model — auditors read the operational registry

GDPR Article 22 — Automated Decision-Making

DemandIQ outputs are used as inputs to a human decision, not as autonomous procurement decisions. A qualified Procurement Manager reviews and approves all PO triggers above the CFO-defined financial threshold. Below-threshold POs are automated but do not individually affect natural persons in the sense of Article 22. Right to explanation is satisfied by the SHAP report presented at HITL. No solely-automated decision with significant legal effect on a natural person is made from DemandIQ output alone.

07 EU AI Act Classification

Risk class: High-Risk — Annex III, Category 8b (critical infrastructure management systems that could pose risks to health, safety, or fundamental rights of natural persons).

Article 9: Risk management system documented and versioned in Model Registry
Article 10: Training data governance enforced via Feature Store lineage
Article 13: Transparency — Model Card accessible to deployer and affected parties
Article 14: Human oversight — HITL checkpoint specified before deployment
Article 15: Accuracy, robustness & cybersecurity — OOD detection, adversarial input gate, graceful degradation, VPC endpoint isolation (see Section 06)
Article 17: Quality management — pipeline gated at evaluation step

08 Regulatory Obligations

ISO 13485: Demand forecast outputs are used to determine safety stock levels for medical device components — supply chain traceability maintained via InventoryOrchestrator integration.

Forecast lineage (input features → forecast output → reorder trigger) preserved in audit log
Model version referenced in every procurement trigger event
Retraining pipeline output subject to same registration and promotion gate

EU AI Act Article 15 — Accuracy, Robustness & Cybersecurity

Robustness requirements maintained throughout model lifecycle:

─Out-of-distribution detection: inference requests with feature values outside ±3σ of training distribution are flagged and routed to HITL before PO action
─Adversarial input handling: Great Expectations schema gate at inference rejects malformed or anomalous feature payloads
─Data degradation resilience: model degrades gracefully when ≤2 non-critical features are missing — missing feature imputation is logged and flagged in SHAP output
─Cybersecurity: model endpoint accessible only via VPC Service Controls perimeter; no public inference endpoint exposed

Section 05 · Feature Store Design

Three shared feature groups.
No feature duplication across models.

Feature groups are designed to be shared across models — eliminating training/serving skew and ensuring that a feature value computed once is consumed consistently by every model that depends on it. Feature freshness SLOs are enforced by the ingestion pipeline, not just measured. Access to feature groups is governed by IAM role-based access control — models and agents consume features via service accounts with least-privilege scope. GDPR data minimisation applies: no feature group stores personal data beyond what is necessary for the declared model purpose; supplier_features contains entity-level scores only, not the underlying personal data records from which scores are derived.

supplier_features

Key: supplier_id

supplier_id
risk_score_composite
financial_health_score
esg_score
geopolitical_index
lead_time_p50
lead_time_p95
delivery_performance_rate

Consumers

SupplierSentinel InventoryOrchestrator ContractIntelligence

Freshness SLO ≤60 seconds Real-time Pub/Sub ingestion

demand_features

Key: sku_id · site_id

sku_id
site_id
salesforce_pipeline_value_30d
salesforce_pipeline_value_90d
ibp_forecast_baseline
hospital_procurement_index
macro_pmi_index

Consumers

DemandIQ InventoryOrchestrator

Freshness SLO ≤1 hour SAP IBP export + Salesforce sync

quality_features

Key: supplier_id · batch_id

supplier_id
batch_id
inspection_pass_rate
ncr_frequency_90d
cert_compliance_status

Consumers

QualityTrace SupplierSentinel

Freshness SLO ≤4 hours Veeva Vault + SAP batch sync

Full Pipeline SLA — Trigger to Model-Ready-for-Canary

XGBoost/classifier models: ≤4 hours · Gemini-based ContractIntelligence: ≤8 hours

SupplierSentinel Effective Monitoring Gap

Weekly retraining + 4h pipeline SLA + 48h canary = max 52h gap — within acceptable operational window. Canary conflict protocol (see Step 07) prevents gap extension.

Model	Retraining Cadence	Feature Drift Threshold	Performance Drift Trigger
DemandIQ	Monthly	PSI > 0.2 on any top-5 feature	MAPE degrades >5pp vs champion
SupplierSentinel	Weekly	PSI > 0.2 on any top-5 feature	AUC-ROC drops >0.05 vs champion
InventoryOrchestrator	Monthly	PSI > 0.2 on any top-5 feature	Inventory cost KPI degrades >3%
QualityTrace	Quarterly	PSI > 0.2 on any top-5 feature	Top-1 accuracy drops >5pp vs champion Compensating control: MDR-class NCRs always route to quality engineer HITL regardless of model confidence — classification degradation cannot silently affect safety-critical decisions
ContractIntelligence	Quarterly	PSI > 0.2 on clause risk vector distribution	Procurement outcome correlation < 0.65 on 90-day rolling window (KL divergence on TCO output distribution > 0.1 used as leading indicator)

Section 06 · Robustness, Rollback & GDPR Article 22

EU AI Act Article 15.
Accuracy, robustness, and cybersecurity — throughout the lifecycle.

Article 15 requires that high-risk AI systems achieve appropriate levels of accuracy, robustness, and cybersecurity across their operational lifetime — not just at deployment. This section specifies the robustness controls applied uniformly to all five models, the rollback procedure triggered when a promoted model degrades, and the GDPR Article 22 position for the two models whose outputs most directly affect third-party decisions.

Out-of-Distribution Detection

Trigger Any inference request with ≥1 feature value outside ±3σ of training distribution
Response Request flagged. Prediction returned with ood_flag=true. Downstream agent routes to HITL — no autonomous action taken on OOD inferences.
Audit OOD event logged with feature vector, model version, and routing decision.

Adversarial & Malformed Input Handling

Inference Gate Great Expectations schema validation runs at inference — same contract as training pipeline Step 01. Malformed payloads are rejected before reaching model endpoint.
Endpoint Security All model endpoints are private, accessible only within VPC Service Controls perimeter. No public inference endpoint. IAP enforced for human-facing HITL surfaces.
Threat Model Model endpoints are not externally reachable. Primary attack surface is internal — covered by VPC perimeter, service account least-privilege, and Cloud Armor on API Gateway.

Graceful Degradation Under Feature Unavailability

Tolerance Models tolerate absence of ≤2 non-critical features. Missing features are imputed using training-set median. Imputation is flagged in SHAP output.
Critical Feature Unavailability If a critical feature (defined per model in Feature Store metadata) is unavailable, inference is blocked and escalated to ML Engineer on-call. No prediction with unknown reliability is served.
Classification Critical features per model are designated at registration time and versioned in Vertex AI Model Registry alongside the model artifact.

Rollback Procedure — All Five Models

Rollback Trigger Post-promotion performance metrics degrade beyond threshold within 72-hour monitoring window, OR a compliance issue is identified in the promoted model.
Rollback SLA Champion model restored to 100% traffic within ≤15 minutes of rollback decision. Traffic shift is automated via Vertex AI endpoint split update.
Rollback Authority Automated rollback: triggered by metric threshold breach. Manual rollback: any designated ML Engineer or Compliance Officer can trigger via ITSM ticket — no change freeze required.
Audit Rollback event written to immutable audit log: timestamp, triggered-by identity, reason code, previous champion version, and post-rollback verification metric. EU AI Act Article 9 rollback record.

GDPR Article 22 — Automated Decision-Making Position · Per Model

Model	Art. 22 Applicability	Safeguard	Right to Explanation
DemandIQ	Not applicable — outputs are inputs to a human procurement decision, not autonomous decisions affecting natural persons directly.	HITL gate above financial threshold. Procurement Manager approval required before PO raised.	SHAP top-5 features presented at HITL checkpoint.
SupplierSentinel	Potentially applicable — risk scores may produce significant effects on supplier commercial relationships (individuals acting as sole traders or named representatives).	All sourcing exclusion or downgrade actions require human procurement agent review. Score alone cannot trigger supplier removal — a human decision is required. HITL is structural, not configurable.	Top-3 risk dimensions + SHAP per supplier event available to affected party on request via DPO channel. Response SLA: 72 hours.
InventoryOrchestrator	Not applicable — reorder decisions affect inventory levels, not natural persons.	HITL gate for orders above CFO-defined financial threshold.	SHAP attribution available in procurement dashboard.
QualityTrace	Not applicable — root cause classification affects batch records, not natural persons directly.	All MDR-class NCRs route to quality engineer HITL unconditionally.	Root cause SHAP report attached to quality engineer review surface.
ContractIntelligence	Potentially applicable — clause risk scoring and TCO output may significantly affect contract outcomes for natural persons party to supplier agreements.	Legal reviewer HITL is mandatory before any counter-proposal is drafted. No contract action is taken from model output alone.	Gemini clause citations + SHAP TCO attribution surfaced in legal reviewer interface and available to affected party on request.

Five models. One explainabilitycontract. Zero black boxes.

Five high-risk models.Each with a defined XAI contract.

Eight-step training pipeline.Applied uniformly across all five models.

Three drift thresholds.One automated response chain.

DemandIQ Forecast Model.Fully worked Model Card.

DemandIQ Forecast Model

Three shared feature groups.No feature duplication across models.

EU AI Act Article 15.Accuracy, robustness, and cybersecurity — throughout the lifecycle.

Five models. One explainability
contract. Zero black boxes.

Five high-risk models.
Each with a defined XAI contract.

Eight-step training pipeline.
Applied uniformly across all five models.

Three drift thresholds.
One automated response chain.

DemandIQ Forecast Model.
Fully worked Model Card.

Three shared feature groups.
No feature duplication across models.

EU AI Act Article 15.
Accuracy, robustness, and cybersecurity — throughout the lifecycle.