Day 2 Operations — Page 10 · The Autonomous Seller

Cross-Module SLOs

Eight modules. One operational posture.

Page 07 defined per-module SLOs for infrastructure. This table extends those definitions to cover the full AS Suite post go-live — including the cross-module burn rate interactions that only become visible when all eight modules are running simultaneously. The audit trail SLO at 99.99% is the anchor that constrains all other error budget decisions.

Module	Domain	SLI — what is measured	SLO Target	Error Budget (30d)	Burn Rate Alert	Cross-module interaction
CCAI Sales Agent	Commercial	% requests returning 2xx within 8s	99.5%	3.6 hours	14.4× over 1h	Outage routes inbound inquiries to AE directly — graceful degradation with no downstream module dependency
ContractGuard	Commercial	% analyses completing without error within 120s	99.0%	7.2 hours	6× over 6h	Outage: RevRec AI falls back to reading contract data directly from Salesforce — reduced feature richness, fully operational
RevRec AI	Financial	% classifications completing + SHAP generated within 10s	99.9%	43 minutes	36× over 1h — page immediately	Any classification failure blocks the SAP write by design. FinRisk Sentinel continues independently — no dependency. Strategy Dashboard loses RevRec panel data.
FinRisk Sentinel	Financial	% financial events scored within 5 minutes of ingestion	99.5%	3.6 hours	14.4× over 1h	Outage during streaming lag: anomaly alerts stop. CFO and Finance Controller must monitor BigQuery directly. No impact on other modules.
Asset IQ	Operations	% daily prediction runs completing within 2h window	99.0%	7.2 hours	6× over 6h	Daily run failure: next scheduled run is the recovery. Fleet anomaly detection pauses. GreenOps loses Asset IQ batch jobs to schedule — no other cross-module impact.
GreenOps Platform	Operations	% batch ML jobs scheduled to optimal carbon window within ±6h	95.0%	36 hours	Non-critical — alert at 50% budget burn	Outage: batch ML training runs immediately without carbon deferral. Financial and operational SLOs unaffected. ESG reporting loses carbon savings data for the affected window.
Strategy Dashboard	Platform	% dashboard panels rendering with data <60s stale	99.0%	7.2 hours	6× over 6h	Read-only. Dashboard outage has zero impact on any operational module. CTO and CFO fall back to direct BigQuery queries during recovery window.
HITL Framework	Platform	% HITL checkpoints created and presented within 60s of trigger	99.9%	43 minutes	36× over 1h — page immediately	Critical shared dependency. HITL failure blocks: RevRec AI SAP writes, ContractGuard counter-proposals, Asset IQ work orders, FinRisk CFO alerts. All modules degrade simultaneously. P0 regardless of which HITL checkpoint fails.
Audit Trail	Platform	% agent actions with audit record committed within 2s	99.99%	4 minutes	Any failure = immediate page	The anchor SLO. A gap in the audit trail is a compliance event under EU AI Act Art. 12 — not an operational incident. All modules must halt non-essential operations until audit trail is restored. 4-minute error budget means there is effectively no tolerance for audit write failures.

Critical cross-module burn rate interaction

If HITL Framework and Audit Trail burn simultaneously — the two platform SLOs with 36× and immediate page thresholds — the on-call responder must treat this as a P0 multi-module incident regardless of whether individual module SLOs are within budget. A HITL failure that also produces audit write gaps is a regulatory exposure, not just an operational one. The incident commander role activates automatically under this condition and cannot be delegated below Staff Engineer level.

Alert Routing Matrix

Every HITL breach. Every SLO burn. One routing decision.

A cross-module view of what triggers each alert priority, who receives it, what the expected response is, and the SLA for that response. This matrix covers both infrastructure alerts (from page 07) and HITL-specific operational alerts that only emerge once all modules are live simultaneously.

P0 — Immediate page

Compliance-affecting failure

· Audit trail write failure (any module)
· HITL framework creation failure (any checkpoint)
· RevRec AI HITL creation failure — SAP write blocked
· VPC-SC perimeter breach attempt
· SCC CRITICAL security finding
· HITL Framework + Audit Trail simultaneous burn

Receives: On-call SRE + ML Lead + Incident Commander (Staff Eng+)
SLA: Acknowledge within 5 minutes · incident declared within 10
EU AI Act: Any P0 involving audit or HITL must be logged as a compliance incident — documented in Firestore incident record before resolution

P1 — Page within 15 min

Business-critical degradation

· HITL SLA breach across ≥2 modules simultaneously
· FinRisk Sentinel streaming lag >10 minutes
· Circuit breaker open on Orchestrator (all agents affected)
· RevRec AI error budget >50% burned in <6h
· HITL override rate spike >30% in any module (30-min window)
· Asset IQ daily run missed by >2 hours

Receives: On-call SRE + module owner
SLA: Acknowledge within 15 minutes · root cause within 1 hour
Note: HITL override rate spike may indicate model drift before Vertex AI Monitoring detects it — route immediately to ML Lead even if infrastructure is healthy

P2 — Notify within 1 hour

Elevated risk — monitor closely

· Error budget burn rate >14.4× on any module (1h window)
· SCC HIGH security finding
· Model drift alert from Vertex AI Monitoring (any model)
· HITL override rate >15% sustained over 30-day window
· CMEK key rotation failure
· Budget alert at 80% of monthly allocation

Receives: On-call SRE (notification only, no page)
SLA: Acknowledge within 1 hour · remediation plan within 4 hours
HITL override rate: If sustained above 15% → trigger HITL-10 retraining review regardless of whether drift detection has fired

P3 — Next business day

Operational awareness items

· Budget alert at 50% of monthly allocation
· GreenOps scheduling missed carbon window (>25% of jobs)
· CCAI Sales Agent elevated P99 latency (>6s)
· SCC MEDIUM finding
· Data Governance quarantine rate >2% of daily ingestion
· Feature Store PSI warning (>0.1, below alert threshold of 0.2)

Receives: Slack notification to #as-ops channel
SLA: Review at next standup · remediation within 5 business days
Data Governance quarantine rate: Early warning for schema drift — review before it triggers the PSI alert threshold

Model Governance Calendar

Five models. One governance cadence.

Model governance is not a deployment-time activity. Every model in the AS Suite has a recurring governance cadence — the scheduled events that keep each model compliant, well-calibrated, and trusted by the humans who approve its outputs. This calendar makes the governance posture operational, not aspirational.

Model	Event	Cadence	Trigger condition	What is reviewed	Owner · Gate
RevRec AI ASC 606 Classifier	Drift check	Weekly	Scheduled · automatic	PSI per feature vs training baseline · KL divergence on prediction distribution · HITL override rate in 30-day window vs previous 30 days	ML Engineer · automated
	SHAP stability	Monthly	After each retraining cycle	Spearman rank correlation of top-10 SHAP features vs previous production model. Alert if ρ < 0.70 — feature importance drift HITL-10 triggered.	ML Engineer · HITL-10
	Model Card update	Monthly	After HITL-11 promotion	Evaluation metrics vs previous version · bias analysis refresh · HITL override decision dataset incorporated · ECE recalculated	ML Lead · HITL-11
	EU AI Act review	Quarterly	Scheduled · Q1/Q2/Q3/Q4	Full Article 9 risk management review · Art. 13 SHAP faithfulness test · Art. 14 HITL checkpoint audit · Model Card completeness check	Compliance Officer · CCO sign-off
Asset IQ RUL RUL Regressor	Drift check	Weekly	Scheduled · automatic	Sensor feature PSI vs training baseline · MAE on rolling 30-day labelled subset (confirmed failures) · Precision@14d tracking	ML Engineer · automated
	Ground truth label review	Monthly	Scheduled · automatic	All confirmed failure events from the past month matched against predictions. False negatives (missed failures) flagged for training set inclusion. Censored data window updated.	ML Engineer · Field Service Lead
	ISO 13485 DHR audit	Quarterly	Scheduled · Q1/Q2/Q3/Q4	Device History Record completeness check — every work order generated by Asset IQ must have a traceable DHR event in BigQuery. Any gap is a regulatory finding.	Quality / Regulatory · QA Manager
Asset IQ Anomaly Isolation Forest	False positive rate	Weekly	Scheduled · automatic	FPR in production vs training baseline (0.04). Regional breakdown — APAC-East historically runs higher. Alert if any region exceeds 0.08.	ML Engineer · automated
	Contamination review	Quarterly	Scheduled	Review contamination parameter (currently 0.05) against observed anomaly rate in production. If production anomaly rate diverges by >2× from contamination setting, retrain with updated parameter.	ML Engineer · ML Lead
	Regional baseline update	Quarterly	On roadmap	Deploy separate baseline models per region to address EMEA-North vs APAC-East FPR disparity. Each regional model trained on regional telemetry only.	ML Engineer · Field Service Lead
ContractGuard Clause Risk Scorer	Legal label refresh	Monthly	HITL Legal decisions (approve/revise/escalate) from the past month added to the training candidate set. Inter-annotator agreement re-computed on any new clause types. Clauses with disagreement excluded.	ML Engineer · General Counsel
ContractGuard Clause Risk Scorer	Non-English performance	Quarterly	Scheduled	High-Risk Recall and Precision for non-English contracts (currently Precision 0.78 vs 0.82 English). Track improvement trajectory as non-English training data accumulates from HITL decisions.	ML Engineer · Legal Lead
FinRisk Sentinel Anomaly Scorer	Baseline update	Monthly	Retrain on rolling 24-month window. False positive feedback from HITL decisions incorporated via Pub/Sub baseline update queue. Per-event-type model refresh (payment, GL posting, warranty reserve).	ML Engineer · Finance Controller
FinRisk Sentinel Anomaly Scorer	Tier 4 FPR review	Quarterly	Scheduled	Small clinic (Tier 4) FPR currently 0.08 vs overall 0.03. Track whether accumulated Tier 4 HITL decisions are improving the baseline. Separate Tier 4 model on roadmap if FPR remains >0.06 after 2 quarters.	ML Engineer · Finance Controller

On-Call Runbook

A P0 fires at 2am. Exactly what happens next.

The runbook below covers the most consequential incident type: a combined HITL framework + Audit Trail failure — the scenario where EU AI Act compliance exposure and operational failure occur simultaneously. Every step is specific, timed, and traceable. The runbook is a living artifact — updated after every incident retrospective.

Page fires — acknowledge within 5 minutes

PagerDuty alert fires simultaneously to On-Call SRE, ML Lead, and Incident Commander (Staff Engineer or above). Acknowledge in PagerDuty within 5 minutes — this timestamps the incident start for EU AI Act compliance documentation. If acknowledgement does not occur within 5 minutes, PagerDuty escalates to the Engineering Manager automatically.

SLA: 5 minutes to acknowledge · EU AI Act: incident clock starts at page time, not acknowledgement time

Open incident record in Firestore before any remediation action

Before touching any infrastructure, the on-call SRE opens an incident record in the Firestore incident_records collection with: incident_id, page_timestamp, acknowledger, initial_symptoms. This is not optional — under EU AI Act Article 12, any gap in system operation that affects the audit trail must itself be documented. The incident record is the documentation of the gap. Write it first.

SLA: Incident record must exist before any terraform, gcloud, or kubectl command is executed in production

Establish blast radius — which HITL checkpoints are currently blocked?

Query Firestore for any HITL checkpoints in PENDING state that have not transitioned in the past 5 minutes. These are checkpoints that were triggered but the HITL framework could not present to the reviewer. Each one represents a decision that is blocked. List them: checkpoint ID, module, persona, SLA deadline. If any HITL-04 (RevRec AI) checkpoint is blocked, escalate immediately to Finance Controller — they need to know that a SAP write is pending human approval that cannot be presented to them through the system.

Command: gcloud firestore query --collection hitl_events --filter "state=PENDING AND created_at<[T-5min]"

Check Security Command Center before touching infrastructure

Open SCC before any remediation. If the HITL or Audit Trail failure was triggered by a VPC-SC perimeter violation or a CMEK key issue, remediating the surface symptom without addressing the root cause will cause a second incident. SCC findings take 60 seconds to propagate after a CRITICAL event — if no CRITICAL finding is present and the incident is 5+ minutes old, the failure is likely infrastructure, not security.

SLA: SCC check must complete before any Cloud Run restart or terraform apply is attempted

Restore HITL framework — audit trail restoration runs in parallel

Two tracks run simultaneously. Track A: On-Call SRE restores the HITL Cloud Run service (check Cloud Run revision health, check Firestore connectivity, check Secret Manager token validity). Track B: Second responder verifies Firestore audit write path — run a synthetic audit record write to the hitl_events collection and confirm it commits within 2s. If either track fails, escalate to the Principal Engineer on the engineering escalation path immediately.

Target: HITL framework restored within 15 minutes of incident start. Audit trail restored within 10 minutes (tighter — every minute of gap is a compliance exposure).

Replay any blocked HITL checkpoints

Once the HITL framework is confirmed healthy, replay each blocked checkpoint identified in Step 03. The Orchestrator's replay function re-presents the checkpoint to the appropriate reviewer with a note that the presentation was delayed due to a system incident. The delay timestamp and incident ID are written to the HITL event record as an immutable annotation — the reviewer sees the original decision context plus the incident note. EU AI Act auditors reviewing the record will see a delay, not a gap.

All blocked checkpoints must be replayed within 30 minutes of HITL framework restoration. Each replay is logged to the incident record.

Declare incident resolved and trigger retrospective

Incident is resolved when: HITL framework is healthy (synthetic checkpoint creates successfully), Audit Trail write SLO is green (verified by Cloud Monitoring), all blocked checkpoints have been replayed, and the incident record in Firestore is updated with: resolution_timestamp, root_cause_category, immediate_remediation_actions. The retrospective ticket must be created within 24 hours of resolution — not after the next sprint planning. It must include: timeline, blast radius (which HITL checkpoints were blocked, which modules were affected), root cause, and the specific architecture change or runbook update that prevents recurrence.

Retrospective ticket: within 24 hours of resolution. Architecture change PR (if required): within 1 sprint.

Chaos Engineering Cadence

Suite-level failure drills. Quarterly. Mandatory.

Page 07 defined six infrastructure-level chaos experiments. These six extend that programme to the suite level — testing cross-module failure propagation, the HITL framework under load, and model governance under adversarial conditions. Every experiment has a specific expected outcome. If the outcome does not match, the architecture has a gap.

Suite Experiment 01

Quarterly · Month 1

HITL framework failure under full suite load

Simulate HITL framework unavailability for 8 minutes during a period when all four HITL-generating modules (RevRec AI, ContractGuard, Asset IQ, FinRisk Sentinel) have active checkpoints in-flight simultaneously.

Expected: All four modules enter CIRCUIT OPEN state independently. Each module's circuit breaker routes its pending decision to a manual fallback — Finance Controller, Legal, FSM, and CFO each receive a manual review notification with the context package the agent had assembled. No module waits for another. HITL event records for blocked checkpoints show the incident annotation after replay. Cross-module burn rate remains within P1 threshold (not P0) because each module degrades independently.

Suite Experiment 02

Quarterly · Month 1

Model drift injection — RevRec AI and ContractGuard simultaneously

Inject synthetic feature distribution shift simultaneously into the contract_value_eur feature (RevRec AI) and the governing_law_match feature (ContractGuard). Both should trigger PSI > 0.2 alert within the weekly drift detection window.

Expected: Vertex AI Monitoring fires two separate drift alerts within 7 days of injection. Each routes independently to HITL-10 retraining checkpoint. The HITL-10 queue shows both alerts simultaneously — ML Engineer receives them as separate items with separate evidence packages. Critically: RevRec AI and ContractGuard continue operating in production during the drift alert period — drift does not trigger automatic shutdown, only a retraining recommendation.

Suite Experiment 03

Quarterly · Month 2

Data Governance quarantine cascade

Introduce a schema version mismatch in the APAC-East asset telemetry pipeline that affects 15% of daily records. Verify that Data Governance correctly quarantines the affected records, that Asset IQ inference excludes them, and that the quarantine does not propagate to ContractGuard or RevRec AI feature groups.

Expected: Data Governance quarantines affected records within one ingestion cycle. Asset IQ runs on the 85% of clean records — prediction quality is lower but inference continues. Strategy Dashboard shows Asset IQ data quality indicator as degraded. ContractGuard and RevRec AI feature stores are unaffected — they use independent feature groups. Data steward receives a P2 alert, not a P0. Quarantine is lifted only after data steward approves the corrected records.

Suite Experiment 04

Quarterly · Month 2

HITL SLA breach cascade across all modules

Simulate all Finance Controller HITL checkpoints (HITL-04 for RevRec AI) timing out simultaneously — the Finance Controller is unreachable for 4 hours and 5 minutes, triggering the SLA breach escalation path.

Expected: Cloud Scheduler timeout job fires at t+4h for each blocked HITL-04. Each escalates to CFO automatically with the original classification, SHAP explanation, and SLA breach note. No SAP writes occur during the breach period — the write guard holds regardless of SLA status. The HITL event records for each breached checkpoint show the escalation as an immutable annotation. EU AI Act compliance record shows SLA breaches documented, not hidden. The Strategy Dashboard HITL panel shows elevated breach count — CTO can see the operational state in real time.

Suite Experiment 05

Quarterly · Month 3

SHAP faithfulness gate failure in production pipeline

Inject a synthetic model version into the RevRec AI Vertex AI Pipeline where the SHAP perturbation test returns faithfulness of 87% — below the 90% threshold that gates HITL-11 promotion.

Expected: Pipeline fails at the XAI Gate step. Model does not proceed to HITL-11 regardless of its evaluation metrics. ML Engineer receives a pipeline failure notification with the specific faithfulness test result and the examples where perturbation did not produce the expected direction change. Previous production model remains in service. Incident is logged as a pipeline failure, not an operational incident — the gate worked correctly. Retrospective reviews whether the faithfulness failure was due to a model architecture change or a data quality issue.

Suite Experiment 06

Quarterly · Month 3

Full suite rebuild from Terraform state — suite-level verification

Extend the page-07 infrastructure rebuild experiment to the full AS Suite: after rebuilding the GCP infrastructure from Terraform state, verify that all eight module pages are reachable, that the HITL framework creates and presents a synthetic checkpoint, that Vertex AI Feature Store serves the expected features, and that the Strategy Dashboard renders all four panels with live data within 10 minutes of infrastructure restore.

Expected: Full infrastructure rebuild in under 45 minutes (page-07 target). Suite-level verification adds 10 minutes. Total time from terraform apply completion to full suite operational: under 55 minutes. This experiment validates that the AS is not just infrastructure-as-code but suite-as-code — the complete operational state is reproducible from the repository and the Terraform state file alone.

Architecture Decision Record

One operational decision worth documenting.

ADR-017 covers the most consequential Day 2 design decision: how HITL SLA breaches are handled at the suite level when the Finance Controller is the single point of approval for all RevRec AI classifications.

ADR-017 · Day 2 Operations

Automatic CFO escalation on HITL-04 SLA breach — not manual triage

Decision

When a HITL-04 (RevRec AI Finance Controller review) checkpoint exceeds its 4-hour SLA, escalation to the CFO is triggered automatically by Cloud Scheduler — not by a human triage decision. The CFO receives the original classification, SHAP explanation, comparable transactions, and a note that the Finance Controller SLA was exceeded. The SLA breach is documented as an immutable annotation in the HITL event record. SAP write remains blocked throughout — the breach does not bypass the approval requirement.

Alternatives Rejected

Manual triage on SLA breach: requires a human to notice the breach, assess its severity, decide who to escalate to, and initiate the escalation. In a 24-hour operation with a Finance Controller who may be in a different timezone, a manual triage step adds unpredictable latency and introduces a human failure point in a compliance-critical path. The EU AI Act does not accept "the escalation was delayed because no one noticed the SLA breach" as a valid explanation for a gap in the oversight record. Auto-escalation is structurally superior because it is deterministic. Allow timeout to auto-approve: categorically rejected. The SAP write guard is architecturally enforced — no approval record, no write, regardless of SLA status. Timeout-based auto-approval would require removing the mandatory HITL record ID parameter from the SAP write call, which would violate the core architectural constraint established in ADR-003.

Consequences

The CFO receives HITL-04 escalations whenever the Finance Controller is unavailable or unresponsive. This is an intentional design choice: the CFO is the appropriate escalation authority for revenue recognition decisions, and their receiving a HITL queue item is not an exceptional event — it is the designed fallback. The consequence of this design is that the CFO must be trained on the HITL-04 review UI and understand the SHAP explanation format before the AS goes live. This training requirement is a H2 go-live prerequisite, not an operational afterthought. The 24-hour retrospective requirement after any SLA breach ensures that repeated breaches are diagnosed and addressed — whether the root cause is Finance Controller availability, HITL UI usability, or model confidence distribution.

Accepted · Day 2 Operations