The Autonomous Enterprise / Page 06

ML Engineering
& MLOps
— designed for production.

The agent tool manifests on Page 05 reference five ML models by name. This page designs every one of them — feature engineering, model architecture, SHAP explanation contracts, Vertex AI Pipelines DAG, drift detection, and full Model Cards. Every model satisfies EU AI Act Article 11 documentation requirements before it ships.

5 ML Models Vertex AI Pipelines Feature Store SHAP · XAI-first Model Cards · EU AI Act Art. 11 Drift Detection
ML Platform

Five models. One shared platform.

All five AE models share a common Vertex AI platform — one Feature Store, one Model Registry, one Pipelines infrastructure, one monitoring stack. The shared platform means that MLOps patterns proven on RevRec AI are inherited by every subsequent model. The HITL-11 promotion checkpoint is a platform-level gate, not a model-specific configuration.

Vertex AI ML Platform — Shared Infrastructure
Five models · one Feature Store · one Model Registry · one Pipelines execution environment · one monitoring stack
DATA SOURCES Salesforce REST API Pub/Sub Asset Events BigQuery Transactions Document AI Output GCS Contract Store VERTEX AI FEATURE STORE Contract Features 18 features · online+offline Asset Features 24 features · streaming Financial Features 12 features · batch+streaming VERTEX AI PIPELINES Data Valid. Training Eval + XAI HITL-11 Promotion gate Deploy VERTEX AI MODEL REGISTRY 5 models · versioned Model Cards · stage: dev → staging → prod Each version linked to training pipeline run · HITL-11 approval record · SHAP baseline VERTEX AI MONITORING Feature drift · prediction drift · 5 monitoring jobs (one per model) Alert → HITL-10 retraining checkpoint → Pub/Sub baseline-update topic XAI / SHAP LAYER SHAP computed at inference time · TreeExplainer or LinearExplainer per model Written to BigQuery shap_explanations before any downstream action · EU AI Act Art. 13 DEPLOYED MODELS (SERVED VIA VERTEX AI ENDPOINTS) RevRec AI · ASC 606 Classifier Asset IQ · RUL Regressor Asset IQ · Anomaly Detector ContractGuard · Risk Scorer FinRisk · Anomaly Scorer All model predictions · feature values · confidence scores → Vertex AI Monitoring → drift alerts → HITL-10
Feature Store

Every feature traceable to its source event.

The Vertex AI Feature Store is the single source of truth for all ML features across the AE. Features are computed once and served to all models that need them — no duplicated feature logic, no inconsistency between training and serving. Every feature value carries a lineage tag: source system, ingestion timestamp, schema version, and quality score.

Contract Feature Group
18 features · Contract + Clause intelligence
liability_cap_ratioLiability cap as ratio of total contract value · Float · computed at contract ingestion
governing_law_matchBoolean — governing law matches ClaraVis standard jurisdiction · computed by ContractGuard
indemnification_asymmetryRatio of ClaraVis vs counterparty indemnification obligations · Float 0–1
payment_term_daysNet payment days from invoice · Integer · from Salesforce Contract object
customer_type_encodedHospital tier (1–4), academic vs private, country risk · Integer encoded
contract_value_eurTotal contract value EUR · Float · from Salesforce
sku_complexity_scoreNumber of distinct SKUs · service component ratio · Float computed
recognition_type_priorPrior ASC 606 classification for this customer · One-hot encoded
+ 10 additional clause featuresIP ownership, termination for convenience, acceptance criteria, SLA penalty structure…
Online store: real-time inference · ContractGuard + RevRec AI
Offline store: training pipeline · batch ETL from Salesforce REST API
Asset Feature Group
24 features · Sensor + operational time-series
gradient_coil_temp_p9595th percentile gradient coil temperature over 30-day rolling window · Float · from DICOM telemetry
helium_level_slopeRate of change of liquid helium level over 14-day window · Float · from service events
rf_power_deviationRF transmit power deviation from baseline · Float · streaming from unit
scan_utilisation_rateDaily scan hours as ratio of maximum rated capacity · Float
error_code_frequencyCount of each error code class per 7-day window · Integer vector
days_since_last_serviceCalendar days since last planned maintenance · Integer
unit_age_monthsMonths since factory commissioning · Integer
cumulative_scan_countTotal scans since commissioning · Integer
+ 16 additional sensor featuresMagnet bore pressure, cryocooler vibration, patient weight capacity utilisation…
Online store: real-time RUL + anomaly inference · 6 regional Pub/Sub pipelines
Offline store: training pipeline · 3-year historical telemetry in BigQuery
Financial Feature Group
12 features · Transaction + account financial signals
payment_zscore_90dZ-score of payment amount vs 90-day rolling mean for this account · Float
days_overdueDays past invoice due date at time of event · Integer
account_payment_consistencyCoefficient of variation of payment timing over 12 months · Float
revenue_posting_deltaDifference between expected and actual revenue posting amount · Float
warranty_reserve_movementChange in warranty reserve for this account in 30-day window · Float
account_risk_tierClaraVis internal credit risk classification · Integer 1–5
+ 6 additional financial featuresFX exposure ratio, multi-currency invoice flag, invoice dispute history…
Online store: real-time FinRisk anomaly scoring · BigQuery streaming inserts
Offline store: training pipeline · SAP transaction history + Salesforce account data
Model Specifications

Five models. Every design decision documented.

Each model specification covers the problem framing, feature inputs, architecture choice, training data, evaluation metrics, and SHAP explanation contract. Architecture choices are cross-referenced to ADRs — no undocumented decisions.

Model 01
RevRec AI — ASC 606 Revenue Recognition Classifier
Multi-class classification · 3 classes: SALE · LEASE · MULTI-ELEMENT
EU AI Act — High Risk · Annex III
Features (from Feature Store)
liability_cap_ratio · governing_law_match
indemnification_asymmetry · payment_term_days
customer_type_encoded · contract_value_eur
sku_complexity_score · recognition_type_prior
+ 10 clause features · 18 total
Architecture
XGBoost multi-class classifier · TreeExplainer for SHAP · deterministic outputs (critical for audit trail) · ADR-010
Training Data
4,800 historical ClaraVis contracts (2019–2025)
Labels: manual ASC 606 classifications by Finance team
Class distribution: 62% SALE · 28% LEASE · 10% MULTI-ELEMENT
Class imbalance handled: SMOTE oversampling on MULTI-ELEMENT
Train/val/test split: 70/15/15 · stratified by class
HITL override decisions added to training set each retraining cycle
Evaluation Metrics
Weighted F10.94
MULTI-ELEMENT Recall0.91
MULTI-ELEMENT Precision0.89
Calibration Error (ECE)0.032
Min confidence for auto0.70
Primary metric: MULTI-ELEMENT Recall. Missing a multi-element arrangement results in revenue over-recognition — the higher-risk error in a regulated context.
SHAP Explanation Contract
What the model must explain: Top 5 features driving the classification · directional effect (↑↓) · SHAP value · Feature value at inference time

To whom: Finance Controller via HITL-04 UI · Compliance Officer via audit query

When: Computed synchronously at inference · written to BigQuery shap_explanations before SAP write is initiated

Format: JSON array · feature_name · shap_value · feature_value · direction
Model 02
Asset IQ — RUL (Remaining Useful Life) Regressor
Regression · output: days_to_failure (continuous) + confidence interval
EU AI Act — High Risk · Annex III
Features (from Feature Store)
gradient_coil_temp_p95 · helium_level_slope
rf_power_deviation · scan_utilisation_rate
error_code_frequency (vector) · days_since_last_service
unit_age_months · cumulative_scan_count
+ 16 sensor features · 24 total
Architecture
Gradient Boosting Regressor (XGBoost) · TreeExplainer SHAP · quantile regression for confidence intervals (q10, q50, q90)
Training Data
3 years of field telemetry · 8,400 unit-quarters from 6 regional systems
Labels: actual days to failure from service records (post-hoc labelled)
Failure definition: unplanned service event requiring parts replacement
Censored data handled: survival analysis preprocessing for units still running
Train/val/test split: 70/15/15 · stratified by unit age bucket
Evaluation Metrics
MAE (days)4.2 days
RMSE (days)6.8 days
Precision @ 14-day horizon0.87
Recall @ 14-day horizon0.91
Min confidence for auto WO0.82
Primary metric: Recall @ 14-day horizon. Missed failures that lead to unplanned downtime are the higher-cost error.
SHAP Explanation Contract
What the model must explain: Top 3 sensor features driving the RUL prediction · current value vs 90-day baseline · SHAP contribution to days-reduction

To whom: Field Service Manager via HITL-06 UI · Field Engineer via work order brief

When: Computed synchronously at prediction · included in work order before dispatch

Format: Sensor name · current value · baseline value · SHAP days-reduction contribution
Model 03
Asset IQ — Unit-Level Anomaly Detector
Unsupervised anomaly detection · output: anomaly_score (0–1) + contributing sensors
EU AI Act — High Risk · Annex III
Features (from Feature Store)
All 24 asset features · same group as RUL model
Additionally: rolling 7-day feature deltas (rate of change)
Cross-unit deviation: unit vs fleet median per feature
48 effective input dimensions after delta computation
Architecture
Isolation Forest · unsupervised · no failure labels required · ADR-011 · SHAP via TreeExplainer on the underlying decision trees
Training Data
Normal operating data only (no failure labels required)
18 months of telemetry · 6,200 unit-months of normal operation
Contamination parameter: 0.05 (5% expected anomaly rate)
Separate model trained per unit model variant (MRI-7T, MRI-3T, CT-Premium)
Retrained quarterly or on drift detection trigger
Evaluation Metrics
Precision @ 0.75 threshold0.82
Recall @ 0.75 threshold0.78
False Positive Rate0.04
Fleet anomaly threshold (N units)≥ 3 units
Alert threshold (score)≥ 0.75
Primary metric: False Positive Rate — FSM alert fatigue is the adoption risk. Precision over recall at the alert threshold.
SHAP Explanation Contract
What the model must explain: Top 3 sensors contributing to anomaly score · each with: current value, fleet median, deviation magnitude, SHAP contribution

To whom: Field Service Manager via HITL-06 · Field Engineer on work order

When: Computed at alert generation · included in HITL-06 presentation

Format: Sensor name · current · fleet_median · deviation · SHAP contribution
Model 04
ContractGuard — Clause Risk Scorer
Binary classification per clause · output: risk_score (0–1) · threshold: 0.65 → HITL-02
EU AI Act — High Risk · Annex III
Features (computed per clause)
clause_type (200+ taxonomy · one-hot encoded)
liability_cap_ratio (clause-level, where applicable)
governing_law_match (Boolean)
indemnification_direction (ClaraVis-favourable vs counterparty-favourable)
deviates_from_standard (Boolean · vs ClaraVis standard terms)
contract_value_tier (bucketed · proxy for deal risk)
semantic_embedding (text-embedding-004 · 768-dim · from Gemini)
precedent_similarity_max (max cosine similarity to historical corpus)
Architecture
XGBoost binary classifier on structured features + semantic embedding · SHAP TreeExplainer · structured features interpretable, embedding via SHAP kernel approximation
Training Data
12,400 labelled clauses from 4,800 historical contracts
Labels: Legal team risk classifications (high-risk / standard)
Class distribution: 18% high-risk · 82% standard
Class imbalance: class_weight='balanced' in XGBoost
Gemini text-embedding-004 embeddings computed at training time · stored in Feature Store
HITL Legal decisions (approve/revise/escalate) added each retraining cycle
Evaluation Metrics
High-Risk Recall0.95
High-Risk Precision0.82
AUC-ROC0.96
False Negative Rate0.05
HITL threshold (risk score)≥ 0.65
Primary metric: High-Risk Recall. Missing a high-risk clause has higher cost than a false positive that sends a standard clause to Legal review.
SHAP Explanation Contract
What the model must explain: Top 5 structured features driving risk score · semantic similarity score to highest-risk precedent · governing law contribution

To whom: General Counsel via HITL-02 UI · presented alongside clause text and precedents

When: Computed synchronously per flagged clause · written to Firestore clause_analysis collection

Format: Feature name · value · SHAP contribution · direction
Model 05
FinRisk Sentinel — Financial Anomaly Scorer
Unsupervised anomaly detection · output: anomaly_score (0–1) + Z-score vs baseline
EU AI Act — High Risk · Annex III
Features (from Feature Store)
All 12 financial features from Financial Feature Group
payment_zscore_90d · days_overdue · account_payment_consistency
revenue_posting_delta · warranty_reserve_movement
account_risk_tier · FX exposure ratio
Rolling 7-day and 30-day feature deltas
Architecture
Isolation Forest (same pattern as Asset IQ anomaly model · ADR-011) · contamination: 0.03 · trained per event_type (payment, posting, reserve) · real-time scoring via Vertex AI endpoint
Training Data
24 months of financial transaction history · 48,000 payment and posting events
Normal operating data only · fraud/anomaly labels not required for training
Separate models per event_type: payment · GL posting · warranty reserve
HITL false-positive feedback fed back via baseline update queue (Pub/Sub)
Retrained monthly or on drift detection trigger
Evaluation Metrics
Alert Precision @ 0.650.78
HITL Precision @ 0.850.91
False Positive Rate @ 0.850.03
Alert threshold≥ 0.65
HITL threshold (high severity)≥ 0.85
Primary metric: HITL Precision @ 0.85 — high-severity CFO alerts must be reliable. False positives at this tier erode trust rapidly.
SHAP Explanation Contract
What the model must explain: Top 3 financial features driving anomaly score · current value · 90-day baseline · Z-score · SHAP contribution

To whom: Finance Controller (medium) · CFO + FC simultaneously (high severity HITL-08)

When: Computed at alert generation · included in HITL-08 presentation alongside Z-score and entity context

Format: Feature name · current · baseline_90d · z_score · SHAP contribution
MLE Design Decisions

Seven questions a senior MLE will ask — answered in advance.

These are the gaps a principal ML engineer probes in a design review. Each decision below is documented because the absence of it looks like an oversight. None of these are afterthoughts — they shaped the design from the start.

Decision 01 — Threshold Selection
How confidence thresholds were chosen — not guessed
Every confidence threshold in this portfolio was selected by finding the operating point on the precision-recall curve where the business cost of a false negative equals the estimated cost of routing to a human HITL reviewer. For RevRec AI: a missed MULTI-ELEMENT classification costs an average of €18K in revenue restatement; a HITL routing costs approximately 20 minutes of Finance Controller time. The 0.70 threshold is the PR curve point where these costs are equal. For Asset IQ: a missed failure prediction costs an average of €42K in emergency dispatch + hospital disruption; a HITL routing costs 30 minutes of FSM time. The 0.82 threshold reflects that asymmetry — it is higher because the miss cost is higher. Thresholds are not fixed: they are re-evaluated at each model version promotion as part of the HITL-11 review, using the most recent HITL override cost data.
Decision 02 — Train/Val/Test Split Strategy
Chronological splits for time-series models — no temporal leakage
RevRec AI and ContractGuard use stratified random splits (70/15/15) — the training examples are independent contracts with no temporal dependency. Asset IQ (RUL + Anomaly) and FinRisk Sentinel use chronological splits: train on the oldest 70% of the time window, validate on the next 15%, test on the most recent 15%. Stratified random splitting on time-series data would cause temporal leakage — the model would see future failure patterns during training that it should only discover at inference time. All split boundaries are hard date cuts, not sampled boundaries. The test set for time-series models is intentionally the most recent data — the distribution closest to the live production environment.
Decision 03 — Train/Serve Skew Prevention
One feature transformation codebase — no duplication between training and serving
Train/serve skew — where feature transformation logic diverges between the offline training pipeline and the online serving path — is one of the most common causes of silent model degradation in production. In the AE, this is prevented structurally: feature transformation logic lives in a single shared Python module (ae_features) that is imported by both the Vertex AI Pipeline training step and the Cloud Run agent serving path. There is no duplicated transformation code. Additionally, the Vertex AI Feature Store's feature statistics (mean, std, percentile distribution) are compared between the offline training snapshot and the online serving window as part of the daily monitoring job. Any divergence above the PSI threshold triggers the data drift alert chain.
Decision 04 — Label Quality (ContractGuard)
Inter-annotator agreement validated before any clause enters the training set
The ContractGuard training labels come from the ClaraVis Legal team — and legal professionals disagree on clause risk classification. This is not assumed away. Inter-annotator agreement was computed using Cohen's Kappa across a stratified sample of 500 clauses labelled independently by three Legal team members. The result: κ = 0.74, indicating substantial agreement. Clauses with pairwise disagreement (where at least two reviewers disagreed) were excluded from the training set entirely — they were not resolved by majority vote, because majority vote on ambiguous clauses injects noise as signal. The 12,400-clause training set reflects only clauses where Legal team agreement was unanimous or where a designated senior counsel made a final determination. This approach accepts a smaller training set in exchange for higher label quality.
Decision 05 — Class Imbalance Strategy
SMOTE for RevRec AI, class weighting for ContractGuard — different approaches for different imbalance severities
These are not interchangeable techniques applied inconsistently. RevRec AI's MULTI-ELEMENT class is 10% of the training set — at this level of imbalance, class weighting alone produces an evaluation set with too few minority-class examples to get stable recall estimates (approximately 72 MULTI-ELEMENT examples in a 15% test set of 720 records). SMOTE oversampling on the training set only (never the test set) brings the minority class to 20%, giving the evaluation set ~144 examples — enough for a stable recall measurement. ContractGuard's high-risk class is 18% — at this level, class_weight='balanced' in XGBoost is sufficient to adjust the decision boundary without introducing synthetic data that might not reflect real clause patterns. The choice between SMOTE and weighting is severity-dependent and documented explicitly to withstand scrutiny.
Decision 06 — Feature Importance Stability
Feature importance drift is monitored across retraining cycles — not just at deployment
SHAP is used as an inference-time explanation tool, but it also serves as a model stability signal across retraining cycles. After each pipeline run, the rank order of the top-10 SHAP features is compared against the previous production model's SHAP baseline using Spearman rank correlation. If the correlation drops below 0.70 — meaning the model has significantly restructured which features it relies on — this triggers a fourth type of drift alert (feature importance drift) that routes to HITL-10 before the new version is promoted, regardless of whether evaluation metrics improved. A model that achieves better F1 by learning different features is not necessarily a safer model — it may have found a spurious correlation that holds on the test set but not in production. This check is implemented as a step in the Vertex AI Pipeline between the XAI Gate and the HITL-11 node.
Decision 07 — SHAP Faithfulness Validation
SHAP explanations are tested for faithfulness — not assumed to be correct because the library produced them
SHAP values are only useful if they are faithful to the model's actual reasoning — if zeroing out a high-positive SHAP feature actually reduces the output in the predicted direction. The AE validates SHAP faithfulness using a perturbation test that runs as the XAI Gate step in the Vertex AI Pipeline, before HITL-11. For each model version, 200 held-out examples are selected. For each example, the top-3 positive SHAP features are individually zeroed out (replaced with the training set mean) and the model is re-run. Faithfulness is confirmed if the predicted probability decreases in at least 90% of cases where a positive-SHAP feature is zeroed. If faithfulness drops below this threshold, the pipeline fails at the XAI Gate — the model does not proceed to HITL-11 regardless of its evaluation metrics. A model with unfaithful explanations cannot satisfy EU AI Act Article 13 regardless of its accuracy.
Faithfulness Gate — Pipeline Step
Input: trained model + 200 held-out examples + SHAP values
Test: zero top-3 positive SHAP features per example → re-run model → check direction
Pass condition: output decreases in ≥ 90% of perturbations → proceed to HITL-11
Fail condition: faithfulness < 90% → pipeline fails at XAI Gate → model blocked from promotion
Applies to: RevRec AI · ContractGuard · Asset IQ RUL · FinRisk Sentinel
MLOps Pipeline

RevRec AI — Vertex AI Pipelines DAG with HITL promotion gate.

The Vertex AI Pipelines DAG for RevRec AI is the canonical MLOps pipeline for the AE. Every other model's pipeline follows the same structure with model-specific steps. The HITL-11 promotion gate is the step that makes this pipeline EU AI Act-compliant — no model version reaches production without a human reviewer approving the Model Card diff and evaluation results.

Vertex AI Pipelines DAG — RevRec AI · Production Training Pipeline
Data validation → Feature engineering → Training → Evaluation → XAI gate → HITL-11 promotion checkpoint → Deployment → Monitoring registration
VERTEX AI PIPELINES · KFP v2 SDK · Scheduled: weekly + on drift trigger · Run ID linked to Model Registry version Data Validation TFX · Great Exp. Feature Engineering FS write · lineage XGBoost Training Vertex Training Job Evaluation F1 · ECE vs baseline model XAI Gate SHAP baseline computed · stored HITL-11 Promotion Gate ML Eng. reviews: Model Card diff Eval + SHAP delta Reject → return to training with feedback Deploy to Staging Shadow mode · A/B Promote to Prod Vertex endpoint Model Registry version · Model Card · HITL-11 record Monitoring Registration drift job · alert config · HITL-10 link PIPELINE EXECUTION DETAILS Trigger: weekly schedule + on drift alert from Vertex AI Monitoring Average run time: ~45 minutes · Compute: n1-standard-8 · Training: A100 GPU (1hr budget) Pipeline artifacts: all stored in GCS · linked to run ID · retained 90 days HITL-11 GATE — EU AI ACT ARTICLE 9 COMPLIANCE ML Engineer receives: Model Card diff · eval metrics vs baseline · SHAP baseline comparison · bias analysis Decision: Approve → staging deploy · Reject → return to training with annotated feedback SLA: 48 hours · timeout: model stays in staging · audit record: immutable Firestore write
Drift Detection

Three drift types. Each with a designed response.

Model degradation in production is not a monitoring problem — it is an architecture problem. Drift detection is designed into the platform from day one: three types of drift, each with a detection method, an alert threshold, and a response that routes through HITL before any automated action executes.

Drift Type 01
Data Drift — Feature Distribution Shift
The statistical distribution of input features in production begins to diverge from the training distribution. For RevRec AI, this might be a shift in contract_value_eur distribution as ClaraVis moves upmarket. For Asset IQ, a new MRI model variant entering the fleet with different sensor baselines. Detected by comparing production feature distributions to a stored training baseline using Population Stability Index (PSI).
Detection method: PSI per feature · weekly computation
Alert threshold: PSI > 0.2 on any monitored feature
Vertex AI job: ModelMonitoringJob · feature_distribution
Response: Alert → HITL-10 retraining recommendation
Retraining trigger: Approved by HITL-10 → pipeline run
Drift Type 02
Concept Drift — Prediction Distribution Shift
The relationship between features and labels changes over time — the model's predictions are no longer aligned with the ground truth even when features look similar to training data. For RevRec AI, this happens when Finance team override decisions cluster around a new contract type the model has not seen. Detected by monitoring the distribution of production predictions against the training prediction baseline, and by tracking HITL override rate.
Detection method: KL divergence on prediction distribution · HITL override rate
Alert threshold: KL divergence > 0.15 OR override rate > 15% in 30-day window
Vertex AI job: ModelMonitoringJob · prediction_drift
Response: Alert → HITL-10 retraining · override decisions added to training set
Special handling: override label dataset created for next retraining cycle
Drift Type 03
Performance Drift — Ground Truth Evaluation
Actual model performance metrics (F1, precision, recall) degrade when ground truth labels become available for production predictions. For RevRec AI, HITL override decisions serve as ground truth labels — when the Finance Controller overrides the model, that override is the true label for that transaction. For Asset IQ, actual failure events from service records confirm or refute predictions. Performance drift triggers an immediate retraining recommendation regardless of feature or prediction distribution metrics.
Detection method: Rolling 30-day F1 / Recall vs baseline on labelled subset
Alert threshold: Rolling F1 drops > 5% below baseline
Ground truth sources: HITL override decisions · actual failure events · Finance Controller corrections
Response: Immediate HITL-10 alert regardless of other drift metrics
Audit: All ground truth labels written to BigQuery ground_truth_labels dataset
HITL-10 — Retraining Checkpoint (from Page 04 HITL Specification)
Every drift detection alert routes to HITL-10 before any retraining executes. The ML Engineer receives: drift metric, baseline vs current distribution chart, proposed retraining scope, estimated timeline, and the Model Card diff that the new version would produce. Decision: Approve retraining → triggers Vertex AI Pipeline run with the override label dataset included. Reject → model stays in production with a monitoring note. SLA: 24 hours. Timeout: model remains in production, alert escalated to ML Lead.
Model Cards

Five Model Cards. EU AI Act Article 11 satisfied.

Every AE model has a full Model Card — created before training begins, updated with actual evaluation results before promotion, and versioned alongside the model in Vertex AI Model Registry. The Model Card is the primary input to HITL-11 and the evidence package for EU AI Act Article 11 compliance. Full cards for all five models are shown below.

RevRec AI — ASC 606 Revenue Recognition Classifier
Model Card v2.1 · Vertex AI Model Registry: [email protected] · HITL-11 approved: 2026-02-14
EU AI Act — High Risk · Annex III
Intended Use
This model classifies ClaraVis MRI transaction contracts as SALE, LEASE, or MULTI-ELEMENT ARRANGEMENT under ASC 606 / IFRS 15. It is a decision-support tool — every classification routes through a Finance Controller human review checkpoint (HITL-04) before any downstream action. The model is not designed and must not be used for: tax classification, legal advice, or recognition decisions where human review has been bypassed.
Primary usersFinance Controller · CFO (via HITL-04 queue)
Deployment environmentClaraVis GCP project · europe-west3 · VPC-SC perimeter
Data residencyAll inference data stays in EU boundary. CMEK encryption.
Training Data
4,800 historical ClaraVis contracts (2019–2025), manually labelled by Finance team. Class distribution: 62% SALE · 28% LEASE · 10% MULTI-ELEMENT. HITL override decisions from previous production cycle added at each retraining. SMOTE oversampling applied to MULTI-ELEMENT class.
Training period2019-01 to 2025-12
Records4,800 contracts · 18 features per record
Label sourceFinance team manual classification + HITL override history
Known gapsLimited data for contract values above €5M. Performance degrades at upper tail.
Evaluation Results
Weighted F10.94 (test set)
MULTI-ELEMENT Recall0.91
MULTI-ELEMENT Precision0.89
Expected Calibration Error0.032 (well-calibrated)
Baseline comparison+0.03 F1 improvement over v2.0
HITL override rate (30-day)8.2% (within threshold)
Known Limitations
Performance degrades for contract values above €5M — limited training data in this range. All such contracts are flagged for mandatory HITL review regardless of confidence score.
Model was trained on ClaraVis contracts only. Classification on contract structures from newly entered markets (e.g. APAC hospital procurement models) may show lower confidence until retraining with local contract data.
MULTI-ELEMENT arrangements with more than 3 performance obligations have lower precision (0.76) than the overall reported metric. Finance team has been briefed.
Model does not account for post-contract modification events. Amendments that change the recognition basis require a new classification run.
Bias Analysis
Bias evaluation conducted across customer_type (hospital tier), geographic region, and contract value tier. No significant performance disparity found across hospital tier 1–3. Tier 4 (small private clinics): F1 = 0.88 vs overall 0.94 — flagged for monitoring. Geographic performance: EU contracts F1 = 0.95, non-EU F1 = 0.89 (limited non-EU training data).
Hospital Tier 1–3F1 = 0.94–0.96 · No disparity
Hospital Tier 4F1 = 0.88 · Flagged for monitoring
EU contractsF1 = 0.95
Non-EU contractsF1 = 0.89 · Limited training data
XAI Contract & Compliance
SHAP TreeExplainer computes feature attributions synchronously at every inference. Top 5 features written to BigQuery shap_explanations table before HITL-04 is created. The Finance Controller sees: classification, confidence, SHAP chart, and 3 comparable historical transactions in the HITL approval UI.
EU AI Act Art. 11✓ Technical documentation complete
EU AI Act Art. 13✓ Transparency — SHAP per inference
EU AI Act Art. 14✓ HITL-04 mandatory for all classifications
HITL-11 approval✓ Approved 2026-02-14 by ML Lead
Model Card v1.3 Asset IQ — RUL Regressor EU AI Act — High Risk
Intended useDecision-support for planned maintenance scheduling. Work orders above 0.82 confidence created autonomously. Below threshold: HITL-06.
Training data3yr telemetry · 8,400 unit-quarters · actual failure events as labels · censored survival data handled
MAE4.2 days · Recall@14d: 0.91
Key limitationTrained on MRI-7T and MRI-3T variants. CT-Premium performance lower (F1 = 0.84). Separate model in development.
Bias analysisNo significant regional performance disparity. Older units (age > 8yr) show lower recall (0.85) — flagged.
EU AI ActArt. 11 ✓ · Art. 13 ✓ (SHAP sensor attribution) · Art. 14 ✓ (HITL-06) · HITL-11 approved 2026-01-22
Model Card v1.1 Asset IQ — Anomaly Detector EU AI Act — High Risk
Intended useUnit-level anomaly detection for early warning. Fleet anomaly patterns (≥3 units) trigger HITL-07 to VP Field Service.
Training data18 months normal operation · 6,200 unit-months · unsupervised (no failure labels required) · contamination: 0.05
[email protected]0.82 · False Positive Rate: 0.04
Key limitationNew sensor types from MRI-7T Gen 2 units not in training data. Alert threshold raised to 0.80 for Gen 2 units pending data collection.
Bias analysisEMEA-North performance (FPR 0.03) vs APAC-East (FPR 0.07) — climate-driven sensor baseline differences. Regional baselines in roadmap.
EU AI ActArt. 11 ✓ · Art. 13 ✓ (SHAP sensor) · Art. 14 ✓ (HITL-06/07) · HITL-11 approved 2026-01-22
Model Card v2.0 ContractGuard — Clause Risk Scorer EU AI Act — High Risk
Intended useClause-level risk pre-screening to prioritise Legal review. Clauses above 0.65 route to HITL-02. Does not replace legal judgment.
Training data12,400 labelled clauses · 4,800 contracts · Legal team labels · HITL decision history · Gemini text-embedding-004 semantic features
High-Risk Recall0.95 · AUC-ROC: 0.96 · FNR: 0.05
Key limitationLimited training data for emerging AI-specific contract clauses (IP ownership of AI outputs, data training rights). Performance lower on these clause types.
Bias analysisNo significant disparity across counterparty type. Non-English contracts (via Gemini translation): precision 0.78 vs 0.82 English. Flagged.
EU AI ActArt. 11 ✓ · Art. 13 ✓ (SHAP clause features) · Art. 14 ✓ (HITL-02/03) · HITL-11 approved 2026-01-30
Model Card v1.2 FinRisk Sentinel — Anomaly Scorer EU AI Act — High Risk
Intended useReal-time financial anomaly pre-screening. Medium alerts (≥0.65): FC notification. High severity (≥0.85): HITL-08 simultaneous CFO + FC. Never acts autonomously on financial events.
Training data24 months transactions · 48,000 events · unsupervised · 3 separate models per event_type · contamination: 0.03
HITL [email protected]0.91 · False Positive Rate @ 0.85: 0.03
Key limitationTrained on standard ClaraVis payment patterns. First year in a new market will have elevated false positive rate until baseline accumulates 90-day history.
Bias analysisPerformance consistent across account risk tiers 1–3. Tier 4 (small clinics with irregular payment patterns): FPR 0.08 vs overall 0.03. Separate baseline for Tier 4 in roadmap.
EU AI ActArt. 11 ✓ · Art. 13 ✓ (SHAP financial features) · Art. 14 ✓ (HITL-08) · HITL-11 approved 2026-02-03
Architecture Decision Records

Three ML decisions. Every alternative documented.

ADR-010 through ADR-012 cover the key ML architecture choices. Each decision was made after evaluating alternatives — the reasoning is documented here because it is the reasoning that a principal ML engineer will probe.

ADR-010
XGBoost over neural network for RevRec AI and ContractGuard
Neural networks (MLP, BERT fine-tuned for structured data) were evaluated for both classification tasks. Rejected for two reasons: (1) EU AI Act Article 13 requires transparency — XGBoost's TreeExplainer provides exact, deterministic SHAP values that are reproducible on demand for any past inference. Neural network SHAP (DeepExplainer or KernelExplainer) provides approximations that vary between runs — unacceptable for an immutable audit trail. (2) XGBoost performs competitively on tabular features with this dataset size and is less prone to catastrophic overfitting on the 4,800-contract training set. For ContractGuard's semantic features, XGBoost operates on Gemini text-embedding-004 embeddings — the semantic intelligence is captured in the embedding, the interpretability is preserved in the tree model.
Accepted · Phase ML Design
ADR-011
Isolation Forest over autoencoder for anomaly detection (Asset IQ + FinRisk)
Autoencoder-based anomaly detection was the initial design choice for both Asset IQ (unit-level) and FinRisk Sentinel. Rejected after evaluation because: (1) Autoencoders do not produce feature-level SHAP attributions without a computationally expensive KernelExplainer approximation — which is too slow for real-time financial event scoring. (2) Isolation Forest's decision tree structure is directly compatible with TreeExplainer SHAP, producing fast, exact, deterministic feature attributions per anomaly score. (3) Isolation Forest requires no labels — both Asset IQ and FinRisk operate in domains where labelled anomaly data is scarce and untrustworthy. The architecture decision also establishes a consistent anomaly detection pattern across both modules — same model class, same SHAP method, same monitoring approach — reducing platform complexity.
Accepted · Phase ML Design
ADR-012
Vertex AI Pipelines over Kubeflow Pipelines (self-managed)
Self-managed Kubeflow Pipelines on GKE was evaluated as the MLOps infrastructure. Rejected because: (1) Vertex AI Pipelines is a managed service — no cluster provisioning, no Kubeflow version management, no infrastructure maintenance overhead. For a portfolio-scale system where the ML Engineer is also the architect and the developer, operational simplicity is a constraint. (2) Vertex AI Pipelines has native integration with Vertex AI Model Registry, Vertex AI Monitoring, and Cloud Build CI/CD — the HITL-11 promotion gate is implementable as a standard pipeline step using the Vertex AI Experiments SDK. (3) All pipeline artifacts (training data snapshots, model checkpoints, evaluation reports) are automatically stored in GCS with versioned URIs linked to pipeline run IDs — audit trail requirements are satisfied by the platform, not by custom code. The only cost difference is marginal at the portfolio's usage scale and is outweighed by the operational advantage.
Accepted · Phase ML Design
Next in the Portfolio
ML platform designed.
Infrastructure follows.

The ML models on this page require a production-grade GCP infrastructure to run on. Page 07 designs that infrastructure — the Terraform IaC, VPC-SC security perimeter, GKE and Cloud Run topology, CI/CD pipeline, FinOps cost allocation, and GreenOps carbon-aware scheduling that make the entire AE system deployable, auditable, and operationally sound.

PG 07
Infrastructure & GCP Architecture
Terraform · VPC-SC · GKE · CI/CD · FinOps · GreenOps
In Design
PG 05
← Agent Swarm Architecture
The agents that call the models designed on this page