Pages 01–09 design and deploy the AS. This page answers what happens next. Phase H (Architecture Change Management) at the suite level — cross-module SLO interactions, alert routing across all 11 HITL checkpoints, model governance calendar, on-call runbook, and the quarterly chaos engineering cadence that proves the system is as resilient as it is specified to be.
Page 07 defined per-module SLOs for infrastructure. This table extends those definitions to cover the full AS Suite post go-live — including the cross-module burn rate interactions that only become visible when all eight modules are running simultaneously. The audit trail SLO at 99.99% is the anchor that constrains all other error budget decisions.
| Module | Domain | SLI — what is measured | SLO Target | Error Budget (30d) | Burn Rate Alert | Cross-module interaction |
|---|---|---|---|---|---|---|
| CCAI Sales Agent | Commercial | % requests returning 2xx within 8s | 99.5% | 3.6 hours | 14.4× over 1h | Outage routes inbound inquiries to AE directly — graceful degradation with no downstream module dependency |
| ContractGuard | Commercial | % analyses completing without error within 120s | 99.0% | 7.2 hours | 6× over 6h | Outage: RevRec AI falls back to reading contract data directly from Salesforce — reduced feature richness, fully operational |
| RevRec AI | Financial | % classifications completing + SHAP generated within 10s | 99.9% | 43 minutes | 36× over 1h — page immediately | Any classification failure blocks the SAP write by design. FinRisk Sentinel continues independently — no dependency. Strategy Dashboard loses RevRec panel data. |
| FinRisk Sentinel | Financial | % financial events scored within 5 minutes of ingestion | 99.5% | 3.6 hours | 14.4× over 1h | Outage during streaming lag: anomaly alerts stop. CFO and Finance Controller must monitor BigQuery directly. No impact on other modules. |
| Asset IQ | Operations | % daily prediction runs completing within 2h window | 99.0% | 7.2 hours | 6× over 6h | Daily run failure: next scheduled run is the recovery. Fleet anomaly detection pauses. GreenOps loses Asset IQ batch jobs to schedule — no other cross-module impact. |
| GreenOps Platform | Operations | % batch ML jobs scheduled to optimal carbon window within ±6h | 95.0% | 36 hours | Non-critical — alert at 50% budget burn | Outage: batch ML training runs immediately without carbon deferral. Financial and operational SLOs unaffected. ESG reporting loses carbon savings data for the affected window. |
| Strategy Dashboard | Platform | % dashboard panels rendering with data <60s stale | 99.0% | 7.2 hours | 6× over 6h | Read-only. Dashboard outage has zero impact on any operational module. CTO and CFO fall back to direct BigQuery queries during recovery window. |
| HITL Framework | Platform | % HITL checkpoints created and presented within 60s of trigger | 99.9% | 43 minutes | 36× over 1h — page immediately | Critical shared dependency. HITL failure blocks: RevRec AI SAP writes, ContractGuard counter-proposals, Asset IQ work orders, FinRisk CFO alerts. All modules degrade simultaneously. P0 regardless of which HITL checkpoint fails. |
| Audit Trail | Platform | % agent actions with audit record committed within 2s | 99.99% | 4 minutes | Any failure = immediate page | The anchor SLO. A gap in the audit trail is a compliance event under EU AI Act Art. 12 — not an operational incident. All modules must halt non-essential operations until audit trail is restored. 4-minute error budget means there is effectively no tolerance for audit write failures. |
A cross-module view of what triggers each alert priority, who receives it, what the expected response is, and the SLA for that response. This matrix covers both infrastructure alerts (from page 07) and HITL-specific operational alerts that only emerge once all modules are live simultaneously.
Model governance is not a deployment-time activity. Every model in the AS Suite has a recurring governance cadence — the scheduled events that keep each model compliant, well-calibrated, and trusted by the humans who approve its outputs. This calendar makes the governance posture operational, not aspirational.
| Model | Event | Cadence | Trigger condition | What is reviewed | Owner · Gate |
|---|---|---|---|---|---|
| RevRec AI ASC 606 Classifier |
Drift check | Weekly | Scheduled · automatic | PSI per feature vs training baseline · KL divergence on prediction distribution · HITL override rate in 30-day window vs previous 30 days | ML Engineer · automated |
| SHAP stability | Monthly | After each retraining cycle | Spearman rank correlation of top-10 SHAP features vs previous production model. Alert if ρ < 0.70 — feature importance drift HITL-10 triggered. | ML Engineer · HITL-10 | |
| Model Card update | Monthly | After HITL-11 promotion | Evaluation metrics vs previous version · bias analysis refresh · HITL override decision dataset incorporated · ECE recalculated | ML Lead · HITL-11 | |
| EU AI Act review | Quarterly | Scheduled · Q1/Q2/Q3/Q4 | Full Article 9 risk management review · Art. 13 SHAP faithfulness test · Art. 14 HITL checkpoint audit · Model Card completeness check | Compliance Officer · CCO sign-off | |
| Asset IQ RUL RUL Regressor |
Drift check | Weekly | Scheduled · automatic | Sensor feature PSI vs training baseline · MAE on rolling 30-day labelled subset (confirmed failures) · Precision@14d tracking | ML Engineer · automated |
| Ground truth label review | Monthly | Scheduled · automatic | All confirmed failure events from the past month matched against predictions. False negatives (missed failures) flagged for training set inclusion. Censored data window updated. | ML Engineer · Field Service Lead | |
| ISO 13485 DHR audit | Quarterly | Scheduled · Q1/Q2/Q3/Q4 | Device History Record completeness check — every work order generated by Asset IQ must have a traceable DHR event in BigQuery. Any gap is a regulatory finding. | Quality / Regulatory · QA Manager | |
| Asset IQ Anomaly Isolation Forest |
False positive rate | Weekly | Scheduled · automatic | FPR in production vs training baseline (0.04). Regional breakdown — APAC-East historically runs higher. Alert if any region exceeds 0.08. | ML Engineer · automated |
| Contamination review | Quarterly | Scheduled | Review contamination parameter (currently 0.05) against observed anomaly rate in production. If production anomaly rate diverges by >2× from contamination setting, retrain with updated parameter. | ML Engineer · ML Lead | |
| Regional baseline update | Quarterly | On roadmap | Deploy separate baseline models per region to address EMEA-North vs APAC-East FPR disparity. Each regional model trained on regional telemetry only. | ML Engineer · Field Service Lead | |
| ContractGuard Clause Risk Scorer |
Legal label refresh | Monthly | HITL Legal decisions (approve/revise/escalate) from the past month added to the training candidate set. Inter-annotator agreement re-computed on any new clause types. Clauses with disagreement excluded. | ML Engineer · General Counsel | |
| Non-English performance | Quarterly | Scheduled | High-Risk Recall and Precision for non-English contracts (currently Precision 0.78 vs 0.82 English). Track improvement trajectory as non-English training data accumulates from HITL decisions. | ML Engineer · Legal Lead | |
| FinRisk Sentinel Anomaly Scorer |
Baseline update | Monthly | Retrain on rolling 24-month window. False positive feedback from HITL decisions incorporated via Pub/Sub baseline update queue. Per-event-type model refresh (payment, GL posting, warranty reserve). | ML Engineer · Finance Controller | |
| Tier 4 FPR review | Quarterly | Scheduled | Small clinic (Tier 4) FPR currently 0.08 vs overall 0.03. Track whether accumulated Tier 4 HITL decisions are improving the baseline. Separate Tier 4 model on roadmap if FPR remains >0.06 after 2 quarters. | ML Engineer · Finance Controller |
The runbook below covers the most consequential incident type: a combined HITL framework + Audit Trail failure — the scenario where EU AI Act compliance exposure and operational failure occur simultaneously. Every step is specific, timed, and traceable. The runbook is a living artifact — updated after every incident retrospective.
gcloud firestore query --collection hitl_events --filter "state=PENDING AND created_at<[T-5min]"Page 07 defined six infrastructure-level chaos experiments. These six extend that programme to the suite level — testing cross-module failure propagation, the HITL framework under load, and model governance under adversarial conditions. Every experiment has a specific expected outcome. If the outcome does not match, the architecture has a gap.
ADR-017 covers the most consequential Day 2 design decision: how HITL SLA breaches are handled at the suite level when the Finance Controller is the single point of approval for all RevRec AI classifications.
Pages 01–10 constitute a complete enterprise architecture portfolio covering TOGAF Phases A through H: strategy, stakeholder analysis, architecture development, delivery planning, agent and ML design, infrastructure, adoption, suite index, and operational governance. Every decision is documented. Every claim traces to an artifact.