The Autonomous Seller / Page 10

Day 2
Operations
— TOGAF Phase H.

Pages 01–09 design and deploy the AS. This page answers what happens next. Phase H (Architecture Change Management) at the suite level — cross-module SLO interactions, alert routing across all 11 HITL checkpoints, model governance calendar, on-call runbook, and the quarterly chaos engineering cadence that proves the system is as resilient as it is specified to be.

TOGAF Phase H Cross-Module SLOs 11 HITL Checkpoints Model Governance Calendar Chaos Engineering Cadence ADR-017
Cross-Module SLOs

Eight modules. One operational posture.

Page 07 defined per-module SLOs for infrastructure. This table extends those definitions to cover the full AS Suite post go-live — including the cross-module burn rate interactions that only become visible when all eight modules are running simultaneously. The audit trail SLO at 99.99% is the anchor that constrains all other error budget decisions.

Module Domain SLI — what is measured SLO Target Error Budget (30d) Burn Rate Alert Cross-module interaction
CCAI Sales Agent Commercial % requests returning 2xx within 8s 99.5% 3.6 hours 14.4× over 1h Outage routes inbound inquiries to AE directly — graceful degradation with no downstream module dependency
ContractGuard Commercial % analyses completing without error within 120s 99.0% 7.2 hours 6× over 6h Outage: RevRec AI falls back to reading contract data directly from Salesforce — reduced feature richness, fully operational
RevRec AI Financial % classifications completing + SHAP generated within 10s 99.9% 43 minutes 36× over 1h — page immediately Any classification failure blocks the SAP write by design. FinRisk Sentinel continues independently — no dependency. Strategy Dashboard loses RevRec panel data.
FinRisk Sentinel Financial % financial events scored within 5 minutes of ingestion 99.5% 3.6 hours 14.4× over 1h Outage during streaming lag: anomaly alerts stop. CFO and Finance Controller must monitor BigQuery directly. No impact on other modules.
Asset IQ Operations % daily prediction runs completing within 2h window 99.0% 7.2 hours 6× over 6h Daily run failure: next scheduled run is the recovery. Fleet anomaly detection pauses. GreenOps loses Asset IQ batch jobs to schedule — no other cross-module impact.
GreenOps Platform Operations % batch ML jobs scheduled to optimal carbon window within ±6h 95.0% 36 hours Non-critical — alert at 50% budget burn Outage: batch ML training runs immediately without carbon deferral. Financial and operational SLOs unaffected. ESG reporting loses carbon savings data for the affected window.
Strategy Dashboard Platform % dashboard panels rendering with data <60s stale 99.0% 7.2 hours 6× over 6h Read-only. Dashboard outage has zero impact on any operational module. CTO and CFO fall back to direct BigQuery queries during recovery window.
HITL Framework Platform % HITL checkpoints created and presented within 60s of trigger 99.9% 43 minutes 36× over 1h — page immediately Critical shared dependency. HITL failure blocks: RevRec AI SAP writes, ContractGuard counter-proposals, Asset IQ work orders, FinRisk CFO alerts. All modules degrade simultaneously. P0 regardless of which HITL checkpoint fails.
Audit Trail Platform % agent actions with audit record committed within 2s 99.99% 4 minutes Any failure = immediate page The anchor SLO. A gap in the audit trail is a compliance event under EU AI Act Art. 12 — not an operational incident. All modules must halt non-essential operations until audit trail is restored. 4-minute error budget means there is effectively no tolerance for audit write failures.
Critical cross-module burn rate interaction
If HITL Framework and Audit Trail burn simultaneously — the two platform SLOs with 36× and immediate page thresholds — the on-call responder must treat this as a P0 multi-module incident regardless of whether individual module SLOs are within budget. A HITL failure that also produces audit write gaps is a regulatory exposure, not just an operational one. The incident commander role activates automatically under this condition and cannot be delegated below Staff Engineer level.
Alert Routing Matrix

Every HITL breach. Every SLO burn. One routing decision.

A cross-module view of what triggers each alert priority, who receives it, what the expected response is, and the SLA for that response. This matrix covers both infrastructure alerts (from page 07) and HITL-specific operational alerts that only emerge once all modules are live simultaneously.

P0 — Immediate page
Compliance-affecting failure
· Audit trail write failure (any module)
· HITL framework creation failure (any checkpoint)
· RevRec AI HITL creation failure — SAP write blocked
· VPC-SC perimeter breach attempt
· SCC CRITICAL security finding
· HITL Framework + Audit Trail simultaneous burn
Receives: On-call SRE + ML Lead + Incident Commander (Staff Eng+)
SLA: Acknowledge within 5 minutes · incident declared within 10
EU AI Act: Any P0 involving audit or HITL must be logged as a compliance incident — documented in Firestore incident record before resolution
P1 — Page within 15 min
Business-critical degradation
· HITL SLA breach across ≥2 modules simultaneously
· FinRisk Sentinel streaming lag >10 minutes
· Circuit breaker open on Orchestrator (all agents affected)
· RevRec AI error budget >50% burned in <6h
· HITL override rate spike >30% in any module (30-min window)
· Asset IQ daily run missed by >2 hours
Receives: On-call SRE + module owner
SLA: Acknowledge within 15 minutes · root cause within 1 hour
Note: HITL override rate spike may indicate model drift before Vertex AI Monitoring detects it — route immediately to ML Lead even if infrastructure is healthy
P2 — Notify within 1 hour
Elevated risk — monitor closely
· Error budget burn rate >14.4× on any module (1h window)
· SCC HIGH security finding
· Model drift alert from Vertex AI Monitoring (any model)
· HITL override rate >15% sustained over 30-day window
· CMEK key rotation failure
· Budget alert at 80% of monthly allocation
Receives: On-call SRE (notification only, no page)
SLA: Acknowledge within 1 hour · remediation plan within 4 hours
HITL override rate: If sustained above 15% → trigger HITL-10 retraining review regardless of whether drift detection has fired
P3 — Next business day
Operational awareness items
· Budget alert at 50% of monthly allocation
· GreenOps scheduling missed carbon window (>25% of jobs)
· CCAI Sales Agent elevated P99 latency (>6s)
· SCC MEDIUM finding
· Data Governance quarantine rate >2% of daily ingestion
· Feature Store PSI warning (>0.1, below alert threshold of 0.2)
Receives: Slack notification to #as-ops channel
SLA: Review at next standup · remediation within 5 business days
Data Governance quarantine rate: Early warning for schema drift — review before it triggers the PSI alert threshold
Model Governance Calendar

Five models. One governance cadence.

Model governance is not a deployment-time activity. Every model in the AS Suite has a recurring governance cadence — the scheduled events that keep each model compliant, well-calibrated, and trusted by the humans who approve its outputs. This calendar makes the governance posture operational, not aspirational.

Model Event Cadence Trigger condition What is reviewed Owner · Gate
RevRec AI
ASC 606 Classifier
Drift check Weekly Scheduled · automatic PSI per feature vs training baseline · KL divergence on prediction distribution · HITL override rate in 30-day window vs previous 30 days ML Engineer · automated
SHAP stability Monthly After each retraining cycle Spearman rank correlation of top-10 SHAP features vs previous production model. Alert if ρ < 0.70 — feature importance drift HITL-10 triggered. ML Engineer · HITL-10
Model Card update Monthly After HITL-11 promotion Evaluation metrics vs previous version · bias analysis refresh · HITL override decision dataset incorporated · ECE recalculated ML Lead · HITL-11
EU AI Act review Quarterly Scheduled · Q1/Q2/Q3/Q4 Full Article 9 risk management review · Art. 13 SHAP faithfulness test · Art. 14 HITL checkpoint audit · Model Card completeness check Compliance Officer · CCO sign-off
Asset IQ RUL
RUL Regressor
Drift check Weekly Scheduled · automatic Sensor feature PSI vs training baseline · MAE on rolling 30-day labelled subset (confirmed failures) · Precision@14d tracking ML Engineer · automated
Ground truth label review Monthly Scheduled · automatic All confirmed failure events from the past month matched against predictions. False negatives (missed failures) flagged for training set inclusion. Censored data window updated. ML Engineer · Field Service Lead
ISO 13485 DHR audit Quarterly Scheduled · Q1/Q2/Q3/Q4 Device History Record completeness check — every work order generated by Asset IQ must have a traceable DHR event in BigQuery. Any gap is a regulatory finding. Quality / Regulatory · QA Manager
Asset IQ Anomaly
Isolation Forest
False positive rate Weekly Scheduled · automatic FPR in production vs training baseline (0.04). Regional breakdown — APAC-East historically runs higher. Alert if any region exceeds 0.08. ML Engineer · automated
Contamination review Quarterly Scheduled Review contamination parameter (currently 0.05) against observed anomaly rate in production. If production anomaly rate diverges by >2× from contamination setting, retrain with updated parameter. ML Engineer · ML Lead
Regional baseline update Quarterly On roadmap Deploy separate baseline models per region to address EMEA-North vs APAC-East FPR disparity. Each regional model trained on regional telemetry only. ML Engineer · Field Service Lead
ContractGuard
Clause Risk Scorer
Legal label refresh Monthly HITL Legal decisions (approve/revise/escalate) from the past month added to the training candidate set. Inter-annotator agreement re-computed on any new clause types. Clauses with disagreement excluded. ML Engineer · General Counsel
Non-English performance Quarterly Scheduled High-Risk Recall and Precision for non-English contracts (currently Precision 0.78 vs 0.82 English). Track improvement trajectory as non-English training data accumulates from HITL decisions. ML Engineer · Legal Lead
FinRisk Sentinel
Anomaly Scorer
Baseline update Monthly Retrain on rolling 24-month window. False positive feedback from HITL decisions incorporated via Pub/Sub baseline update queue. Per-event-type model refresh (payment, GL posting, warranty reserve). ML Engineer · Finance Controller
Tier 4 FPR review Quarterly Scheduled Small clinic (Tier 4) FPR currently 0.08 vs overall 0.03. Track whether accumulated Tier 4 HITL decisions are improving the baseline. Separate Tier 4 model on roadmap if FPR remains >0.06 after 2 quarters. ML Engineer · Finance Controller
On-Call Runbook

A P0 fires at 2am. Exactly what happens next.

The runbook below covers the most consequential incident type: a combined HITL framework + Audit Trail failure — the scenario where EU AI Act compliance exposure and operational failure occur simultaneously. Every step is specific, timed, and traceable. The runbook is a living artifact — updated after every incident retrospective.

01
Page fires — acknowledge within 5 minutes
PagerDuty alert fires simultaneously to On-Call SRE, ML Lead, and Incident Commander (Staff Engineer or above). Acknowledge in PagerDuty within 5 minutes — this timestamps the incident start for EU AI Act compliance documentation. If acknowledgement does not occur within 5 minutes, PagerDuty escalates to the Engineering Manager automatically.
SLA: 5 minutes to acknowledge · EU AI Act: incident clock starts at page time, not acknowledgement time
02
Open incident record in Firestore before any remediation action
Before touching any infrastructure, the on-call SRE opens an incident record in the Firestore incident_records collection with: incident_id, page_timestamp, acknowledger, initial_symptoms. This is not optional — under EU AI Act Article 12, any gap in system operation that affects the audit trail must itself be documented. The incident record is the documentation of the gap. Write it first.
SLA: Incident record must exist before any terraform, gcloud, or kubectl command is executed in production
03
Establish blast radius — which HITL checkpoints are currently blocked?
Query Firestore for any HITL checkpoints in PENDING state that have not transitioned in the past 5 minutes. These are checkpoints that were triggered but the HITL framework could not present to the reviewer. Each one represents a decision that is blocked. List them: checkpoint ID, module, persona, SLA deadline. If any HITL-04 (RevRec AI) checkpoint is blocked, escalate immediately to Finance Controller — they need to know that a SAP write is pending human approval that cannot be presented to them through the system.
Command: gcloud firestore query --collection hitl_events --filter "state=PENDING AND created_at<[T-5min]"
04
Check Security Command Center before touching infrastructure
Open SCC before any remediation. If the HITL or Audit Trail failure was triggered by a VPC-SC perimeter violation or a CMEK key issue, remediating the surface symptom without addressing the root cause will cause a second incident. SCC findings take 60 seconds to propagate after a CRITICAL event — if no CRITICAL finding is present and the incident is 5+ minutes old, the failure is likely infrastructure, not security.
SLA: SCC check must complete before any Cloud Run restart or terraform apply is attempted
05
Restore HITL framework — audit trail restoration runs in parallel
Two tracks run simultaneously. Track A: On-Call SRE restores the HITL Cloud Run service (check Cloud Run revision health, check Firestore connectivity, check Secret Manager token validity). Track B: Second responder verifies Firestore audit write path — run a synthetic audit record write to the hitl_events collection and confirm it commits within 2s. If either track fails, escalate to the Principal Engineer on the engineering escalation path immediately.
Target: HITL framework restored within 15 minutes of incident start. Audit trail restored within 10 minutes (tighter — every minute of gap is a compliance exposure).
06
Replay any blocked HITL checkpoints
Once the HITL framework is confirmed healthy, replay each blocked checkpoint identified in Step 03. The Orchestrator's replay function re-presents the checkpoint to the appropriate reviewer with a note that the presentation was delayed due to a system incident. The delay timestamp and incident ID are written to the HITL event record as an immutable annotation — the reviewer sees the original decision context plus the incident note. EU AI Act auditors reviewing the record will see a delay, not a gap.
All blocked checkpoints must be replayed within 30 minutes of HITL framework restoration. Each replay is logged to the incident record.
07
Declare incident resolved and trigger retrospective
Incident is resolved when: HITL framework is healthy (synthetic checkpoint creates successfully), Audit Trail write SLO is green (verified by Cloud Monitoring), all blocked checkpoints have been replayed, and the incident record in Firestore is updated with: resolution_timestamp, root_cause_category, immediate_remediation_actions. The retrospective ticket must be created within 24 hours of resolution — not after the next sprint planning. It must include: timeline, blast radius (which HITL checkpoints were blocked, which modules were affected), root cause, and the specific architecture change or runbook update that prevents recurrence.
Retrospective ticket: within 24 hours of resolution. Architecture change PR (if required): within 1 sprint.
Chaos Engineering Cadence

Suite-level failure drills. Quarterly. Mandatory.

Page 07 defined six infrastructure-level chaos experiments. These six extend that programme to the suite level — testing cross-module failure propagation, the HITL framework under load, and model governance under adversarial conditions. Every experiment has a specific expected outcome. If the outcome does not match, the architecture has a gap.

Suite Experiment 01
Quarterly · Month 1
HITL framework failure under full suite load
Simulate HITL framework unavailability for 8 minutes during a period when all four HITL-generating modules (RevRec AI, ContractGuard, Asset IQ, FinRisk Sentinel) have active checkpoints in-flight simultaneously.
Expected: All four modules enter CIRCUIT OPEN state independently. Each module's circuit breaker routes its pending decision to a manual fallback — Finance Controller, Legal, FSM, and CFO each receive a manual review notification with the context package the agent had assembled. No module waits for another. HITL event records for blocked checkpoints show the incident annotation after replay. Cross-module burn rate remains within P1 threshold (not P0) because each module degrades independently.
Suite Experiment 02
Quarterly · Month 1
Model drift injection — RevRec AI and ContractGuard simultaneously
Inject synthetic feature distribution shift simultaneously into the contract_value_eur feature (RevRec AI) and the governing_law_match feature (ContractGuard). Both should trigger PSI > 0.2 alert within the weekly drift detection window.
Expected: Vertex AI Monitoring fires two separate drift alerts within 7 days of injection. Each routes independently to HITL-10 retraining checkpoint. The HITL-10 queue shows both alerts simultaneously — ML Engineer receives them as separate items with separate evidence packages. Critically: RevRec AI and ContractGuard continue operating in production during the drift alert period — drift does not trigger automatic shutdown, only a retraining recommendation.
Suite Experiment 03
Quarterly · Month 2
Data Governance quarantine cascade
Introduce a schema version mismatch in the APAC-East asset telemetry pipeline that affects 15% of daily records. Verify that Data Governance correctly quarantines the affected records, that Asset IQ inference excludes them, and that the quarantine does not propagate to ContractGuard or RevRec AI feature groups.
Expected: Data Governance quarantines affected records within one ingestion cycle. Asset IQ runs on the 85% of clean records — prediction quality is lower but inference continues. Strategy Dashboard shows Asset IQ data quality indicator as degraded. ContractGuard and RevRec AI feature stores are unaffected — they use independent feature groups. Data steward receives a P2 alert, not a P0. Quarantine is lifted only after data steward approves the corrected records.
Suite Experiment 04
Quarterly · Month 2
HITL SLA breach cascade across all modules
Simulate all Finance Controller HITL checkpoints (HITL-04 for RevRec AI) timing out simultaneously — the Finance Controller is unreachable for 4 hours and 5 minutes, triggering the SLA breach escalation path.
Expected: Cloud Scheduler timeout job fires at t+4h for each blocked HITL-04. Each escalates to CFO automatically with the original classification, SHAP explanation, and SLA breach note. No SAP writes occur during the breach period — the write guard holds regardless of SLA status. The HITL event records for each breached checkpoint show the escalation as an immutable annotation. EU AI Act compliance record shows SLA breaches documented, not hidden. The Strategy Dashboard HITL panel shows elevated breach count — CTO can see the operational state in real time.
Suite Experiment 05
Quarterly · Month 3
SHAP faithfulness gate failure in production pipeline
Inject a synthetic model version into the RevRec AI Vertex AI Pipeline where the SHAP perturbation test returns faithfulness of 87% — below the 90% threshold that gates HITL-11 promotion.
Expected: Pipeline fails at the XAI Gate step. Model does not proceed to HITL-11 regardless of its evaluation metrics. ML Engineer receives a pipeline failure notification with the specific faithfulness test result and the examples where perturbation did not produce the expected direction change. Previous production model remains in service. Incident is logged as a pipeline failure, not an operational incident — the gate worked correctly. Retrospective reviews whether the faithfulness failure was due to a model architecture change or a data quality issue.
Suite Experiment 06
Quarterly · Month 3
Full suite rebuild from Terraform state — suite-level verification
Extend the page-07 infrastructure rebuild experiment to the full AS Suite: after rebuilding the GCP infrastructure from Terraform state, verify that all eight module pages are reachable, that the HITL framework creates and presents a synthetic checkpoint, that Vertex AI Feature Store serves the expected features, and that the Strategy Dashboard renders all four panels with live data within 10 minutes of infrastructure restore.
Expected: Full infrastructure rebuild in under 45 minutes (page-07 target). Suite-level verification adds 10 minutes. Total time from terraform apply completion to full suite operational: under 55 minutes. This experiment validates that the AS is not just infrastructure-as-code but suite-as-code — the complete operational state is reproducible from the repository and the Terraform state file alone.
Architecture Decision Record

One operational decision worth documenting.

ADR-017 covers the most consequential Day 2 design decision: how HITL SLA breaches are handled at the suite level when the Finance Controller is the single point of approval for all RevRec AI classifications.

ADR-017 · Day 2 Operations
Automatic CFO escalation on HITL-04 SLA breach — not manual triage
Decision
When a HITL-04 (RevRec AI Finance Controller review) checkpoint exceeds its 4-hour SLA, escalation to the CFO is triggered automatically by Cloud Scheduler — not by a human triage decision. The CFO receives the original classification, SHAP explanation, comparable transactions, and a note that the Finance Controller SLA was exceeded. The SLA breach is documented as an immutable annotation in the HITL event record. SAP write remains blocked throughout — the breach does not bypass the approval requirement.
Alternatives Rejected
Manual triage on SLA breach: requires a human to notice the breach, assess its severity, decide who to escalate to, and initiate the escalation. In a 24-hour operation with a Finance Controller who may be in a different timezone, a manual triage step adds unpredictable latency and introduces a human failure point in a compliance-critical path. The EU AI Act does not accept "the escalation was delayed because no one noticed the SLA breach" as a valid explanation for a gap in the oversight record. Auto-escalation is structurally superior because it is deterministic. Allow timeout to auto-approve: categorically rejected. The SAP write guard is architecturally enforced — no approval record, no write, regardless of SLA status. Timeout-based auto-approval would require removing the mandatory HITL record ID parameter from the SAP write call, which would violate the core architectural constraint established in ADR-003.
Consequences
The CFO receives HITL-04 escalations whenever the Finance Controller is unavailable or unresponsive. This is an intentional design choice: the CFO is the appropriate escalation authority for revenue recognition decisions, and their receiving a HITL queue item is not an exceptional event — it is the designed fallback. The consequence of this design is that the CFO must be trained on the HITL-04 review UI and understand the SHAP explanation format before the AS goes live. This training requirement is a H2 go-live prerequisite, not an operational afterthought. The 24-hour retrospective requirement after any SLA breach ensures that repeated breaches are diagnosed and addressed — whether the root cause is Finance Controller availability, HITL UI usability, or model confidence distribution.
Accepted · Day 2 Operations
Portfolio Complete
Ten pages.
One complete system.

Pages 01–10 constitute a complete enterprise architecture portfolio covering TOGAF Phases A through H: strategy, stakeholder analysis, architecture development, delivery planning, agent and ML design, infrastructure, adoption, suite index, and operational governance. Every decision is documented. Every claim traces to an artifact.

PG 09
← AS Suite Module Index
Eight modules · demo pathways · dependencies
INDEX
← Portfolio Overview
Return to the full portfolio index