ML Engineering & MLOps — Page 06 · The Autonomous Enterprise

ML Platform

Five models. One shared platform.

All five AE models share a common Vertex AI platform — one Feature Store, one Model Registry, one Pipelines infrastructure, one monitoring stack. The shared platform means that MLOps patterns proven on RevRec AI are inherited by every subsequent model. The HITL-11 promotion checkpoint is a platform-level gate, not a model-specific configuration.

Vertex AI ML Platform — Shared Infrastructure

Five models · one Feature Store · one Model Registry · one Pipelines execution environment · one monitoring stack

Feature Store

Every feature traceable to its source event.

The Vertex AI Feature Store is the single source of truth for all ML features across the AE. Features are computed once and served to all models that need them — no duplicated feature logic, no inconsistency between training and serving. Every feature value carries a lineage tag: source system, ingestion timestamp, schema version, and quality score.

Contract Feature Group

18 features · Contract + Clause intelligence

liability_cap_ratioLiability cap as ratio of total contract value · Float · computed at contract ingestion

governing_law_matchBoolean — governing law matches ClaraVis standard jurisdiction · computed by ContractGuard

indemnification_asymmetryRatio of ClaraVis vs counterparty indemnification obligations · Float 0–1

payment_term_daysNet payment days from invoice · Integer · from Salesforce Contract object

customer_type_encodedHospital tier (1–4), academic vs private, country risk · Integer encoded

contract_value_eurTotal contract value EUR · Float · from Salesforce

sku_complexity_scoreNumber of distinct SKUs · service component ratio · Float computed

recognition_type_priorPrior ASC 606 classification for this customer · One-hot encoded

+ 10 additional clause featuresIP ownership, termination for convenience, acceptance criteria, SLA penalty structure…

Online store: real-time inference · ContractGuard + RevRec AI
Offline store: training pipeline · batch ETL from Salesforce REST API

Asset Feature Group

24 features · Sensor + operational time-series

gradient_coil_temp_p9595th percentile gradient coil temperature over 30-day rolling window · Float · from DICOM telemetry

helium_level_slopeRate of change of liquid helium level over 14-day window · Float · from service events

rf_power_deviationRF transmit power deviation from baseline · Float · streaming from unit

scan_utilisation_rateDaily scan hours as ratio of maximum rated capacity · Float

error_code_frequencyCount of each error code class per 7-day window · Integer vector

days_since_last_serviceCalendar days since last planned maintenance · Integer

unit_age_monthsMonths since factory commissioning · Integer

cumulative_scan_countTotal scans since commissioning · Integer

+ 16 additional sensor featuresMagnet bore pressure, cryocooler vibration, patient weight capacity utilisation…

Online store: real-time RUL + anomaly inference · 6 regional Pub/Sub pipelines
Offline store: training pipeline · 3-year historical telemetry in BigQuery

Financial Feature Group

12 features · Transaction + account financial signals

payment_zscore_90dZ-score of payment amount vs 90-day rolling mean for this account · Float

days_overdueDays past invoice due date at time of event · Integer

account_payment_consistencyCoefficient of variation of payment timing over 12 months · Float

revenue_posting_deltaDifference between expected and actual revenue posting amount · Float

warranty_reserve_movementChange in warranty reserve for this account in 30-day window · Float

account_risk_tierClaraVis internal credit risk classification · Integer 1–5

+ 6 additional financial featuresFX exposure ratio, multi-currency invoice flag, invoice dispute history…

Online store: real-time FinRisk anomaly scoring · BigQuery streaming inserts
Offline store: training pipeline · SAP transaction history + Salesforce account data

Model Specifications

Five models. Every design decision documented.

Each model specification covers the problem framing, feature inputs, architecture choice, training data, evaluation metrics, and SHAP explanation contract. Architecture choices are cross-referenced to ADRs — no undocumented decisions.

Model 01

RevRec AI — ASC 606 Revenue Recognition Classifier

Multi-class classification · 3 classes: SALE · LEASE · MULTI-ELEMENT

EU AI Act — High Risk · Annex III

Features (from Feature Store)

liability_cap_ratio · governing_law_match

indemnification_asymmetry · payment_term_days

customer_type_encoded · contract_value_eur

sku_complexity_score · recognition_type_prior

+ 10 clause features · 18 total

Architecture

XGBoost multi-class classifier · TreeExplainer for SHAP · deterministic outputs (critical for audit trail) · ADR-010

Training Data

4,800 historical ClaraVis contracts (2019–2025)

Labels: manual ASC 606 classifications by Finance team

Class distribution: 62% SALE · 28% LEASE · 10% MULTI-ELEMENT

Class imbalance handled: SMOTE oversampling on MULTI-ELEMENT

Train/val/test split: 70/15/15 · stratified by class

HITL override decisions added to training set each retraining cycle

Evaluation Metrics

Weighted F10.94

MULTI-ELEMENT Recall0.91

MULTI-ELEMENT Precision0.89

Calibration Error (ECE)0.032

Min confidence for auto0.70

Primary metric: MULTI-ELEMENT Recall. Missing a multi-element arrangement results in revenue over-recognition — the higher-risk error in a regulated context.

SHAP Explanation Contract

What the model must explain: Top 5 features driving the classification · directional effect (↑↓) · SHAP value · Feature value at inference time

To whom: Finance Controller via HITL-04 UI · Compliance Officer via audit query

When: Computed synchronously at inference · written to BigQuery shap_explanations before SAP write is initiated

Format: JSON array · feature_name · shap_value · feature_value · direction

Model 02

Asset IQ — RUL (Remaining Useful Life) Regressor

Regression · output: days_to_failure (continuous) + confidence interval

EU AI Act — High Risk · Annex III

Features (from Feature Store)

gradient_coil_temp_p95 · helium_level_slope

rf_power_deviation · scan_utilisation_rate

error_code_frequency (vector) · days_since_last_service

unit_age_months · cumulative_scan_count

+ 16 sensor features · 24 total

Architecture

Gradient Boosting Regressor (XGBoost) · TreeExplainer SHAP · quantile regression for confidence intervals (q10, q50, q90)

Training Data

3 years of field telemetry · 8,400 unit-quarters from 6 regional systems

Labels: actual days to failure from service records (post-hoc labelled)

Failure definition: unplanned service event requiring parts replacement

Censored data handled: survival analysis preprocessing for units still running

Train/val/test split: 70/15/15 · stratified by unit age bucket

Evaluation Metrics

MAE (days)4.2 days

RMSE (days)6.8 days

Precision @ 14-day horizon0.87

Recall @ 14-day horizon0.91

Min confidence for auto WO0.82

Primary metric: Recall @ 14-day horizon. Missed failures that lead to unplanned downtime are the higher-cost error.

SHAP Explanation Contract

What the model must explain: Top 3 sensor features driving the RUL prediction · current value vs 90-day baseline · SHAP contribution to days-reduction

To whom: Field Service Manager via HITL-06 UI · Field Engineer via work order brief

When: Computed synchronously at prediction · included in work order before dispatch

Format: Sensor name · current value · baseline value · SHAP days-reduction contribution

Model 03

Asset IQ — Unit-Level Anomaly Detector

Unsupervised anomaly detection · output: anomaly_score (0–1) + contributing sensors

EU AI Act — High Risk · Annex III

Features (from Feature Store)

All 24 asset features · same group as RUL model

Additionally: rolling 7-day feature deltas (rate of change)

Cross-unit deviation: unit vs fleet median per feature

48 effective input dimensions after delta computation

Architecture

Isolation Forest · unsupervised · no failure labels required · ADR-011 · SHAP via TreeExplainer on the underlying decision trees

Training Data

Normal operating data only (no failure labels required)

18 months of telemetry · 6,200 unit-months of normal operation

Contamination parameter: 0.05 (5% expected anomaly rate)

Separate model trained per unit model variant (MRI-7T, MRI-3T, CT-Premium)

Retrained quarterly or on drift detection trigger

Evaluation Metrics

Precision @ 0.75 threshold0.82

Recall @ 0.75 threshold0.78

False Positive Rate0.04

Fleet anomaly threshold (N units)≥ 3 units

Alert threshold (score)≥ 0.75

Primary metric: False Positive Rate — FSM alert fatigue is the adoption risk. Precision over recall at the alert threshold.

SHAP Explanation Contract

What the model must explain: Top 3 sensors contributing to anomaly score · each with: current value, fleet median, deviation magnitude, SHAP contribution

To whom: Field Service Manager via HITL-06 · Field Engineer on work order

When: Computed at alert generation · included in HITL-06 presentation

Format: Sensor name · current · fleet_median · deviation · SHAP contribution

Model 04

ContractGuard — Clause Risk Scorer

Binary classification per clause · output: risk_score (0–1) · threshold: 0.65 → HITL-02

EU AI Act — High Risk · Annex III

Features (computed per clause)

clause_type (200+ taxonomy · one-hot encoded)

liability_cap_ratio (clause-level, where applicable)

governing_law_match (Boolean)

indemnification_direction (ClaraVis-favourable vs counterparty-favourable)

deviates_from_standard (Boolean · vs ClaraVis standard terms)

contract_value_tier (bucketed · proxy for deal risk)

semantic_embedding (text-embedding-004 · 768-dim · from Gemini)

precedent_similarity_max (max cosine similarity to historical corpus)

Architecture

XGBoost binary classifier on structured features + semantic embedding · SHAP TreeExplainer · structured features interpretable, embedding via SHAP kernel approximation

Training Data

12,400 labelled clauses from 4,800 historical contracts

Labels: Legal team risk classifications (high-risk / standard)

Class distribution: 18% high-risk · 82% standard

Class imbalance: class_weight='balanced' in XGBoost

Gemini text-embedding-004 embeddings computed at training time · stored in Feature Store

HITL Legal decisions (approve/revise/escalate) added each retraining cycle

Evaluation Metrics

High-Risk Recall0.95

High-Risk Precision0.82

AUC-ROC0.96

False Negative Rate0.05

HITL threshold (risk score)≥ 0.65

Primary metric: High-Risk Recall. Missing a high-risk clause has higher cost than a false positive that sends a standard clause to Legal review.

SHAP Explanation Contract

What the model must explain: Top 5 structured features driving risk score · semantic similarity score to highest-risk precedent · governing law contribution

To whom: General Counsel via HITL-02 UI · presented alongside clause text and precedents

When: Computed synchronously per flagged clause · written to Firestore clause_analysis collection

Format: Feature name · value · SHAP contribution · direction

Model 05

FinRisk Sentinel — Financial Anomaly Scorer

Unsupervised anomaly detection · output: anomaly_score (0–1) + Z-score vs baseline

EU AI Act — High Risk · Annex III

Features (from Feature Store)

All 12 financial features from Financial Feature Group

payment_zscore_90d · days_overdue · account_payment_consistency

revenue_posting_delta · warranty_reserve_movement

account_risk_tier · FX exposure ratio

Rolling 7-day and 30-day feature deltas

Architecture

Isolation Forest (same pattern as Asset IQ anomaly model · ADR-011) · contamination: 0.03 · trained per event_type (payment, posting, reserve) · real-time scoring via Vertex AI endpoint

Training Data

24 months of financial transaction history · 48,000 payment and posting events

Normal operating data only · fraud/anomaly labels not required for training

Separate models per event_type: payment · GL posting · warranty reserve

HITL false-positive feedback fed back via baseline update queue (Pub/Sub)

Retrained monthly or on drift detection trigger

Evaluation Metrics

Alert Precision @ 0.650.78

HITL Precision @ 0.850.91

False Positive Rate @ 0.850.03

Alert threshold≥ 0.65

HITL threshold (high severity)≥ 0.85

Primary metric: HITL Precision @ 0.85 — high-severity CFO alerts must be reliable. False positives at this tier erode trust rapidly.

SHAP Explanation Contract

What the model must explain: Top 3 financial features driving anomaly score · current value · 90-day baseline · Z-score · SHAP contribution

To whom: Finance Controller (medium) · CFO + FC simultaneously (high severity HITL-08)

When: Computed at alert generation · included in HITL-08 presentation alongside Z-score and entity context

Format: Feature name · current · baseline_90d · z_score · SHAP contribution

MLE Design Decisions

Seven questions a senior MLE will ask — answered in advance.

These are the gaps a principal ML engineer probes in a design review. Each decision below is documented because the absence of it looks like an oversight. None of these are afterthoughts — they shaped the design from the start.

Decision 01 — Threshold Selection

How confidence thresholds were chosen — not guessed

Every confidence threshold in this portfolio was selected by finding the operating point on the precision-recall curve where the business cost of a false negative equals the estimated cost of routing to a human HITL reviewer. For RevRec AI: a missed MULTI-ELEMENT classification costs an average of €18K in revenue restatement; a HITL routing costs approximately 20 minutes of Finance Controller time. The 0.70 threshold is the PR curve point where these costs are equal. For Asset IQ: a missed failure prediction costs an average of €42K in emergency dispatch + hospital disruption; a HITL routing costs 30 minutes of FSM time. The 0.82 threshold reflects that asymmetry — it is higher because the miss cost is higher. Thresholds are not fixed: they are re-evaluated at each model version promotion as part of the HITL-11 review, using the most recent HITL override cost data.

Decision 02 — Train/Val/Test Split Strategy

Chronological splits for time-series models — no temporal leakage

RevRec AI and ContractGuard use stratified random splits (70/15/15) — the training examples are independent contracts with no temporal dependency. Asset IQ (RUL + Anomaly) and FinRisk Sentinel use chronological splits: train on the oldest 70% of the time window, validate on the next 15%, test on the most recent 15%. Stratified random splitting on time-series data would cause temporal leakage — the model would see future failure patterns during training that it should only discover at inference time. All split boundaries are hard date cuts, not sampled boundaries. The test set for time-series models is intentionally the most recent data — the distribution closest to the live production environment.

Decision 03 — Train/Serve Skew Prevention

One feature transformation codebase — no duplication between training and serving

Train/serve skew — where feature transformation logic diverges between the offline training pipeline and the online serving path — is one of the most common causes of silent model degradation in production. In the AE, this is prevented structurally: feature transformation logic lives in a single shared Python module (ae_features) that is imported by both the Vertex AI Pipeline training step and the Cloud Run agent serving path. There is no duplicated transformation code. Additionally, the Vertex AI Feature Store's feature statistics (mean, std, percentile distribution) are compared between the offline training snapshot and the online serving window as part of the daily monitoring job. Any divergence above the PSI threshold triggers the data drift alert chain.

Decision 04 — Label Quality (ContractGuard)

Inter-annotator agreement validated before any clause enters the training set

The ContractGuard training labels come from the ClaraVis Legal team — and legal professionals disagree on clause risk classification. This is not assumed away. Inter-annotator agreement was computed using Cohen's Kappa across a stratified sample of 500 clauses labelled independently by three Legal team members. The result: κ = 0.74, indicating substantial agreement. Clauses with pairwise disagreement (where at least two reviewers disagreed) were excluded from the training set entirely — they were not resolved by majority vote, because majority vote on ambiguous clauses injects noise as signal. The 12,400-clause training set reflects only clauses where Legal team agreement was unanimous or where a designated senior counsel made a final determination. This approach accepts a smaller training set in exchange for higher label quality.

Decision 05 — Class Imbalance Strategy

SMOTE for RevRec AI, class weighting for ContractGuard — different approaches for different imbalance severities

These are not interchangeable techniques applied inconsistently. RevRec AI's MULTI-ELEMENT class is 10% of the training set — at this level of imbalance, class weighting alone produces an evaluation set with too few minority-class examples to get stable recall estimates (approximately 72 MULTI-ELEMENT examples in a 15% test set of 720 records). SMOTE oversampling on the training set only (never the test set) brings the minority class to 20%, giving the evaluation set ~144 examples — enough for a stable recall measurement. ContractGuard's high-risk class is 18% — at this level, class_weight='balanced' in XGBoost is sufficient to adjust the decision boundary without introducing synthetic data that might not reflect real clause patterns. The choice between SMOTE and weighting is severity-dependent and documented explicitly to withstand scrutiny.

Decision 06 — Feature Importance Stability

Feature importance drift is monitored across retraining cycles — not just at deployment

SHAP is used as an inference-time explanation tool, but it also serves as a model stability signal across retraining cycles. After each pipeline run, the rank order of the top-10 SHAP features is compared against the previous production model's SHAP baseline using Spearman rank correlation. If the correlation drops below 0.70 — meaning the model has significantly restructured which features it relies on — this triggers a fourth type of drift alert (feature importance drift) that routes to HITL-10 before the new version is promoted, regardless of whether evaluation metrics improved. A model that achieves better F1 by learning different features is not necessarily a safer model — it may have found a spurious correlation that holds on the test set but not in production. This check is implemented as a step in the Vertex AI Pipeline between the XAI Gate and the HITL-11 node.

Decision 07 — SHAP Faithfulness Validation

SHAP explanations are tested for faithfulness — not assumed to be correct because the library produced them

SHAP values are only useful if they are faithful to the model's actual reasoning — if zeroing out a high-positive SHAP feature actually reduces the output in the predicted direction. The AE validates SHAP faithfulness using a perturbation test that runs as the XAI Gate step in the Vertex AI Pipeline, before HITL-11. For each model version, 200 held-out examples are selected. For each example, the top-3 positive SHAP features are individually zeroed out (replaced with the training set mean) and the model is re-run. Faithfulness is confirmed if the predicted probability decreases in at least 90% of cases where a positive-SHAP feature is zeroed. If faithfulness drops below this threshold, the pipeline fails at the XAI Gate — the model does not proceed to HITL-11 regardless of its evaluation metrics. A model with unfaithful explanations cannot satisfy EU AI Act Article 13 regardless of its accuracy.

Faithfulness Gate — Pipeline Step

Input: trained model + 200 held-out examples + SHAP values

Test: zero top-3 positive SHAP features per example → re-run model → check direction

Pass condition: output decreases in ≥ 90% of perturbations → proceed to HITL-11

Fail condition: faithfulness < 90% → pipeline fails at XAI Gate → model blocked from promotion

Applies to: RevRec AI · ContractGuard · Asset IQ RUL · FinRisk Sentinel

MLOps Pipeline

RevRec AI — Vertex AI Pipelines DAG with HITL promotion gate.

The Vertex AI Pipelines DAG for RevRec AI is the canonical MLOps pipeline for the AE. Every other model's pipeline follows the same structure with model-specific steps. The HITL-11 promotion gate is the step that makes this pipeline EU AI Act-compliant — no model version reaches production without a human reviewer approving the Model Card diff and evaluation results.

Vertex AI Pipelines DAG — RevRec AI · Production Training Pipeline

Data validation → Feature engineering → Training → Evaluation → XAI gate → HITL-11 promotion checkpoint → Deployment → Monitoring registration

Drift Detection

Three drift types. Each with a designed response.

Model degradation in production is not a monitoring problem — it is an architecture problem. Drift detection is designed into the platform from day one: three types of drift, each with a detection method, an alert threshold, and a response that routes through HITL before any automated action executes.

Drift Type 01

Data Drift — Feature Distribution Shift

The statistical distribution of input features in production begins to diverge from the training distribution. For RevRec AI, this might be a shift in contract_value_eur distribution as ClaraVis moves upmarket. For Asset IQ, a new MRI model variant entering the fleet with different sensor baselines. Detected by comparing production feature distributions to a stored training baseline using Population Stability Index (PSI).

Detection method: PSI per feature · weekly computation
Alert threshold: PSI > 0.2 on any monitored feature
Vertex AI job: ModelMonitoringJob · feature_distribution
Response: Alert → HITL-10 retraining recommendation
Retraining trigger: Approved by HITL-10 → pipeline run

Drift Type 02

Concept Drift — Prediction Distribution Shift

The relationship between features and labels changes over time — the model's predictions are no longer aligned with the ground truth even when features look similar to training data. For RevRec AI, this happens when Finance team override decisions cluster around a new contract type the model has not seen. Detected by monitoring the distribution of production predictions against the training prediction baseline, and by tracking HITL override rate.

Detection method: KL divergence on prediction distribution · HITL override rate
Alert threshold: KL divergence > 0.15 OR override rate > 15% in 30-day window
Vertex AI job: ModelMonitoringJob · prediction_drift
Response: Alert → HITL-10 retraining · override decisions added to training set
Special handling: override label dataset created for next retraining cycle

Drift Type 03

Performance Drift — Ground Truth Evaluation

Actual model performance metrics (F1, precision, recall) degrade when ground truth labels become available for production predictions. For RevRec AI, HITL override decisions serve as ground truth labels — when the Finance Controller overrides the model, that override is the true label for that transaction. For Asset IQ, actual failure events from service records confirm or refute predictions. Performance drift triggers an immediate retraining recommendation regardless of feature or prediction distribution metrics.

Detection method: Rolling 30-day F1 / Recall vs baseline on labelled subset
Alert threshold: Rolling F1 drops > 5% below baseline
Ground truth sources: HITL override decisions · actual failure events · Finance Controller corrections
Response: Immediate HITL-10 alert regardless of other drift metrics
Audit: All ground truth labels written to BigQuery ground_truth_labels dataset

HITL-10 — Retraining Checkpoint (from Page 04 HITL Specification)

Every drift detection alert routes to HITL-10 before any retraining executes. The ML Engineer receives: drift metric, baseline vs current distribution chart, proposed retraining scope, estimated timeline, and the Model Card diff that the new version would produce. Decision: Approve retraining → triggers Vertex AI Pipeline run with the override label dataset included. Reject → model stays in production with a monitoring note. SLA: 24 hours. Timeout: model remains in production, alert escalated to ML Lead.

Model Cards

Five Model Cards. EU AI Act Article 11 satisfied.

Every AE model has a full Model Card — created before training begins, updated with actual evaluation results before promotion, and versioned alongside the model in Vertex AI Model Registry. The Model Card is the primary input to HITL-11 and the evidence package for EU AI Act Article 11 compliance. Full cards for all five models are shown below.

RevRec AI — ASC 606 Revenue Recognition Classifier

Model Card v2.1 · Vertex AI Model Registry: [email protected] · HITL-11 approved: 2026-02-14

EU AI Act — High Risk · Annex III

Intended Use

This model classifies ClaraVis MRI transaction contracts as SALE, LEASE, or MULTI-ELEMENT ARRANGEMENT under ASC 606 / IFRS 15. It is a decision-support tool — every classification routes through a Finance Controller human review checkpoint (HITL-04) before any downstream action. The model is not designed and must not be used for: tax classification, legal advice, or recognition decisions where human review has been bypassed.

Primary users	Finance Controller · CFO (via HITL-04 queue)
Deployment environment	ClaraVis GCP project · europe-west3 · VPC-SC perimeter
Data residency	All inference data stays in EU boundary. CMEK encryption.

Training Data

4,800 historical ClaraVis contracts (2019–2025), manually labelled by Finance team. Class distribution: 62% SALE · 28% LEASE · 10% MULTI-ELEMENT. HITL override decisions from previous production cycle added at each retraining. SMOTE oversampling applied to MULTI-ELEMENT class.

Training period	2019-01 to 2025-12
Records	4,800 contracts · 18 features per record
Label source	Finance team manual classification + HITL override history
Known gaps	Limited data for contract values above €5M. Performance degrades at upper tail.

Evaluation Results

Weighted F1	0.94 (test set)
MULTI-ELEMENT Recall	0.91
MULTI-ELEMENT Precision	0.89
Expected Calibration Error	0.032 (well-calibrated)
Baseline comparison	+0.03 F1 improvement over v2.0
HITL override rate (30-day)	8.2% (within threshold)

Known Limitations

Performance degrades for contract values above €5M — limited training data in this range. All such contracts are flagged for mandatory HITL review regardless of confidence score.

Model was trained on ClaraVis contracts only. Classification on contract structures from newly entered markets (e.g. APAC hospital procurement models) may show lower confidence until retraining with local contract data.

MULTI-ELEMENT arrangements with more than 3 performance obligations have lower precision (0.76) than the overall reported metric. Finance team has been briefed.

Model does not account for post-contract modification events. Amendments that change the recognition basis require a new classification run.

Bias Analysis

Bias evaluation conducted across customer_type (hospital tier), geographic region, and contract value tier. No significant performance disparity found across hospital tier 1–3. Tier 4 (small private clinics): F1 = 0.88 vs overall 0.94 — flagged for monitoring. Geographic performance: EU contracts F1 = 0.95, non-EU F1 = 0.89 (limited non-EU training data).

Hospital Tier 1–3	F1 = 0.94–0.96 · No disparity
Hospital Tier 4	F1 = 0.88 · Flagged for monitoring
EU contracts	F1 = 0.95
Non-EU contracts	F1 = 0.89 · Limited training data

XAI Contract & Compliance

SHAP TreeExplainer computes feature attributions synchronously at every inference. Top 5 features written to BigQuery shap_explanations table before HITL-04 is created. The Finance Controller sees: classification, confidence, SHAP chart, and 3 comparable historical transactions in the HITL approval UI.

EU AI Act Art. 11	✓ Technical documentation complete
EU AI Act Art. 13	✓ Transparency — SHAP per inference
EU AI Act Art. 14	✓ HITL-04 mandatory for all classifications
HITL-11 approval	✓ Approved 2026-02-14 by ML Lead

Model Card v1.3 Asset IQ — RUL Regressor EU AI Act — High Risk

Intended use	Decision-support for planned maintenance scheduling. Work orders above 0.82 confidence created autonomously. Below threshold: HITL-06.
Training data	3yr telemetry · 8,400 unit-quarters · actual failure events as labels · censored survival data handled
MAE	4.2 days · Recall@14d: 0.91
Key limitation	Trained on MRI-7T and MRI-3T variants. CT-Premium performance lower (F1 = 0.84). Separate model in development.
Bias analysis	No significant regional performance disparity. Older units (age > 8yr) show lower recall (0.85) — flagged.
EU AI Act	Art. 11 ✓ · Art. 13 ✓ (SHAP sensor attribution) · Art. 14 ✓ (HITL-06) · HITL-11 approved 2026-01-22

Model Card v1.1 Asset IQ — Anomaly Detector EU AI Act — High Risk

Intended use	Unit-level anomaly detection for early warning. Fleet anomaly patterns (≥3 units) trigger HITL-07 to VP Field Service.
Training data	18 months normal operation · 6,200 unit-months · unsupervised (no failure labels required) · contamination: 0.05
[email protected]	0.82 · False Positive Rate: 0.04
Key limitation	New sensor types from MRI-7T Gen 2 units not in training data. Alert threshold raised to 0.80 for Gen 2 units pending data collection.
Bias analysis	EMEA-North performance (FPR 0.03) vs APAC-East (FPR 0.07) — climate-driven sensor baseline differences. Regional baselines in roadmap.
EU AI Act	Art. 11 ✓ · Art. 13 ✓ (SHAP sensor) · Art. 14 ✓ (HITL-06/07) · HITL-11 approved 2026-01-22

Model Card v2.0 ContractGuard — Clause Risk Scorer EU AI Act — High Risk

Intended use	Clause-level risk pre-screening to prioritise Legal review. Clauses above 0.65 route to HITL-02. Does not replace legal judgment.
Training data	12,400 labelled clauses · 4,800 contracts · Legal team labels · HITL decision history · Gemini text-embedding-004 semantic features
High-Risk Recall	0.95 · AUC-ROC: 0.96 · FNR: 0.05
Key limitation	Limited training data for emerging AI-specific contract clauses (IP ownership of AI outputs, data training rights). Performance lower on these clause types.
Bias analysis	No significant disparity across counterparty type. Non-English contracts (via Gemini translation): precision 0.78 vs 0.82 English. Flagged.
EU AI Act	Art. 11 ✓ · Art. 13 ✓ (SHAP clause features) · Art. 14 ✓ (HITL-02/03) · HITL-11 approved 2026-01-30

Model Card v1.2 FinRisk Sentinel — Anomaly Scorer EU AI Act — High Risk

Intended use	Real-time financial anomaly pre-screening. Medium alerts (≥0.65): FC notification. High severity (≥0.85): HITL-08 simultaneous CFO + FC. Never acts autonomously on financial events.
Training data	24 months transactions · 48,000 events · unsupervised · 3 separate models per event_type · contamination: 0.03
HITL [email protected]	0.91 · False Positive Rate @ 0.85: 0.03
Key limitation	Trained on standard ClaraVis payment patterns. First year in a new market will have elevated false positive rate until baseline accumulates 90-day history.
Bias analysis	Performance consistent across account risk tiers 1–3. Tier 4 (small clinics with irregular payment patterns): FPR 0.08 vs overall 0.03. Separate baseline for Tier 4 in roadmap.
EU AI Act	Art. 11 ✓ · Art. 13 ✓ (SHAP financial features) · Art. 14 ✓ (HITL-08) · HITL-11 approved 2026-02-03

Architecture Decision Records

Three ML decisions. Every alternative documented.

ADR-010 through ADR-012 cover the key ML architecture choices. Each decision was made after evaluating alternatives — the reasoning is documented here because it is the reasoning that a principal ML engineer will probe.

ADR-010

XGBoost over neural network for RevRec AI and ContractGuard

Neural networks (MLP, BERT fine-tuned for structured data) were evaluated for both classification tasks. Rejected for two reasons: (1) EU AI Act Article 13 requires transparency — XGBoost's TreeExplainer provides exact, deterministic SHAP values that are reproducible on demand for any past inference. Neural network SHAP (DeepExplainer or KernelExplainer) provides approximations that vary between runs — unacceptable for an immutable audit trail. (2) XGBoost performs competitively on tabular features with this dataset size and is less prone to catastrophic overfitting on the 4,800-contract training set. For ContractGuard's semantic features, XGBoost operates on Gemini text-embedding-004 embeddings — the semantic intelligence is captured in the embedding, the interpretability is preserved in the tree model.

Accepted · Phase ML Design

ADR-011

Isolation Forest over autoencoder for anomaly detection (Asset IQ + FinRisk)

Autoencoder-based anomaly detection was the initial design choice for both Asset IQ (unit-level) and FinRisk Sentinel. Rejected after evaluation because: (1) Autoencoders do not produce feature-level SHAP attributions without a computationally expensive KernelExplainer approximation — which is too slow for real-time financial event scoring. (2) Isolation Forest's decision tree structure is directly compatible with TreeExplainer SHAP, producing fast, exact, deterministic feature attributions per anomaly score. (3) Isolation Forest requires no labels — both Asset IQ and FinRisk operate in domains where labelled anomaly data is scarce and untrustworthy. The architecture decision also establishes a consistent anomaly detection pattern across both modules — same model class, same SHAP method, same monitoring approach — reducing platform complexity.

Accepted · Phase ML Design

ADR-012

Vertex AI Pipelines over Kubeflow Pipelines (self-managed)

Self-managed Kubeflow Pipelines on GKE was evaluated as the MLOps infrastructure. Rejected because: (1) Vertex AI Pipelines is a managed service — no cluster provisioning, no Kubeflow version management, no infrastructure maintenance overhead. For a portfolio-scale system where the ML Engineer is also the architect and the developer, operational simplicity is a constraint. (2) Vertex AI Pipelines has native integration with Vertex AI Model Registry, Vertex AI Monitoring, and Cloud Build CI/CD — the HITL-11 promotion gate is implementable as a standard pipeline step using the Vertex AI Experiments SDK. (3) All pipeline artifacts (training data snapshots, model checkpoints, evaluation reports) are automatically stored in GCS with versioned URIs linked to pipeline run IDs — audit trail requirements are satisfied by the platform, not by custom code. The only cost difference is marginal at the portfolio's usage scale and is outweighed by the operational advantage.

Accepted · Phase ML Design