The Autonomous Enterprise / Page 07

Infrastructure &
GCP Architecture
— every resource in code.

The Phase D reference architecture from Page 03 expressed as Terraform resources, security controls, operational procedures, and SLO definitions. Every GCP resource is provisioned by code. Every security control is enforced at the infrastructure layer. Nothing is configured manually.

Terraform IaC VPC-SC · BeyondCorp CMEK · Workload Identity Cloud Build CI/CD FinOps · GreenOps SLO · DR · Chaos Engineering
Infrastructure Design Principles

Six principles. The rules every resource must satisfy.

These principles are derived from the Architecture Principles established in TOGAF Preliminary Phase (Page 03). They are the engineering expression of those principles — specific enough to evaluate any infrastructure decision against.

IP-01
Every resource is provisioned by Terraform — no manual state
Every GCP resource — VPC, subnet, IAM binding, Cloud Run service, Firestore database, BigQuery dataset — is declared in Terraform and provisioned by terraform apply. No manual console changes. If a resource cannot be expressed in Terraform, it is a design gap, not a configuration exception.
P-08 · FDA change control · ISO 27001
IP-02
Data residency is enforced by Organisation Policy — not application config
All resources are constrained to europe-west3 (Frankfurt) by an Organisation Policy constraint applied at the GCP organisation root. Application code cannot override this. New resources deployed to any other region will fail with a policy violation before they are created.
P-06 · GDPR · C-04
IP-03
Zero-trust: no implicit trust based on network location
BeyondCorp enforces identity verification at every access request. Service accounts use Workload Identity Federation — no key files exist anywhere in the codebase or the infrastructure state. Every service-to-service call is authenticated. The VPC is a deployment boundary, not a trust boundary.
P-11 · ISO 27001 · AR-07
IP-04
ClaraVis holds the encryption keys — Google does not
All storage resources (BigQuery, Firestore, GCS, Cloud Run data at rest) use Customer-Managed Encryption Keys via Cloud KMS. ClaraVis holds key custody. Key rotation policy: 90 days. In the event of a contract termination, ClaraVis can revoke key access and all data becomes cryptographically inaccessible.
P-11 · GDPR Art. 17 · ISO 27001
IP-05
Cost allocation is tagged from first terraform apply
Every resource carries four mandatory labels: module (which AE module), env (dev/staging/prod), cost-centre (ClaraVis internal code), and data-classification (public/internal/confidential/restricted). Labels enforced by a Terraform module-level validation. Resources without required labels fail plan validation before they reach apply.
P-12 · FinOps standard
IP-06
Security posture is continuously validated — not point-in-time audited
Security Command Center Premium is enabled. Every infrastructure change triggers a security posture scan. Findings above HIGH severity block the Cloud Build deployment pipeline. The security posture dashboard is a live view of the infrastructure state, not a monthly report.
P-11 · ISO 27001 · FDA software validation
Terraform IaC

Module structure and production HCL.

The infrastructure is organised into six Terraform modules. The root module wires them together and passes shared variables. Three HCL snippets below show the most architecturally significant resources — the ones a Google Cloud architect will ask to see first.

Repository Structure
infra/terraform/
├── root/
│ ├── main.tf — module composition
│ ├── variables.tf — region, project, env
│ ├── outputs.tf
│ └── backend.tf — GCS state backend
├── modules/networking/
│ ├── vpc.tf — shared VPC + subnets
│ ├── vpc_sc.tf — VPC-SC perimeter
│ ├── firewall.tf
│ └── cloud_armor.tf
├── modules/security/
│ ├── iam.tf — SA + WIF bindings
│ ├── kms.tf — CMEK key rings
│ ├── secret_manager.tf
│ └── org_policy.tf — region constraint
├── modules/compute/
│ ├── cloud_run.tf — agent services
│ └── gke_autopilot.tf — batch ML
├── modules/data/
│ ├── bigquery.tf — datasets + CMEK
│ ├── firestore.tf
│ ├── pubsub.tf — topics + subscriptions
│ └── gcs.tf — contract store bucket
└── modules/monitoring/
  ├── slo.tf — SLO definitions
  ├── alerts.tf — budget + drift alerts
  └── dashboards.tf
VPC-SC Perimeter
Workload Identity
Cloud Run Service
CMEK Key
# modules/networking/vpc_sc.tf
# VPC Service Controls perimeter — enforces data residency
# and prevents exfiltration outside europe-west3

resource "google_access_context_manager_service_perimeter" "claravis_ae" {
  parent = "accessPolicies/${var.access_policy_id}"
  name   = "accessPolicies/${var.access_policy_id}/servicePerimeters/claravis_ae_perimeter"
  title  = "ClaraVis AE Production Perimeter"

  spec {
    resources = ["projects/${var.project_number}"]

    restricted_services = [
      "bigquery.googleapis.com",
      "firestore.googleapis.com",
      "storage.googleapis.com",
      "aiplatform.googleapis.com",
      "run.googleapis.com",
      "pubsub.googleapis.com",
      "secretmanager.googleapis.com",
      "cloudkms.googleapis.com",
    ]

    access_levels = [
      google_access_context_manager_access_level.claravis_corp_devices.name,
    ]

    vpc_accessible_services {
      enable_restriction = true
      allowed_services   = ["RESTRICTED-SERVICES"]
    }

    ingress_policies {
      ingress_from {
        identity_type = "SERVICE_ACCOUNT"
        identities = [
          "serviceAccount:${var.orchestrator_sa}",
          "serviceAccount:${var.cloud_build_sa}",
        ]
      }
      ingress_to {
        resources = ["*"]
        operations { service_name = "*" }
      }
    }
  }

  use_explicit_dry_run_spec = false
}
# modules/security/iam.tf
# Workload Identity Federation — no service account key files
# Cloud Run services authenticate via WIF, not key files

resource "google_service_account" "contractguard_sa" {
  project      = var.project_id
  account_id   = "contractguard-sa"
  display_name = "ContractGuard Agent Service Account"
  description  = "Least-privilege SA for ContractGuard Cloud Run service. No key files created."
}

# Workload Identity binding — Cloud Run service → SA
resource "google_service_account_iam_member" "contractguard_wif" {
  service_account_id = google_service_account.contractguard_sa.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.project_id}.svc.id.goog[${var.namespace}/contractguard]"
}

# Minimal permissions — only what ContractGuard needs
resource "google_project_iam_member" "contractguard_permissions" {
  for_each = toset([
    "roles/datastore.user",           # Firestore read/write
    "roles/bigquery.dataEditor",       # BigQuery audit writes
    "roles/storage.objectViewer",      # GCS contract bucket read
    "roles/secretmanager.secretAccessor", # Secret Manager read
    "roles/aiplatform.user",           # Vertex AI endpoint invoke
    "roles/cloudkms.cryptoKeyDecrypter", # CMEK decrypt (own key ring only)
  ])
  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.contractguard_sa.email}"
}

# Org policy: deny SA key creation — enforced at org root
resource "google_org_policy_policy" "deny_sa_key_creation" {
  name   = "organizations/${var.org_id}/policies/iam.disableServiceAccountKeyCreation"
  parent = "organizations/${var.org_id}"
  spec {
    rules { enforce { enforce = true } }
  }
}
# modules/compute/cloud_run.tf
# ContractGuard agent service — stateless, VPC-native, CMEK

resource "google_cloud_run_v2_service" "contractguard" {
  project  = var.project_id
  location = var.region   # europe-west3 — enforced by org policy
  name     = "contractguard-${var.env}"

  template {
    service_account = google_service_account.contractguard_sa.email

    scaling {
      min_instance_count = 0   # scale-to-zero in non-prod
      max_instance_count = 10
    }

    vpc_access {
      connector = google_vpc_access_connector.ae_connector.id
      egress    = "PRIVATE_RANGES_ONLY"
    }

    containers {
      image = "${var.region}-docker.pkg.dev/${var.project_id}/ae-agents/contractguard:${var.image_tag}"

      resources {
        limits = {
          cpu    = "2"
          memory = "4Gi"
        }
        cpu_idle          = true   # CPU only allocated during request
        startup_cpu_boost = true
      }

      env {
        name = "PROJECT_ID"
        value = var.project_id
      }
      env {
        name = "SFDC_TOKEN_SECRET"
        value_source {
          secret_key_ref {
            secret  = google_secret_manager_secret.sfdc_oauth_token.secret_id
            version = "latest"
          }
        }
      }
    }

    labels = {
      module              = "contractguard"
      env                 = var.env
      cost-centre         = var.cost_centre
      data-classification = "confidential"
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }
}
# modules/security/kms.tf
# CMEK key ring and keys — ClaraVis holds custody
# Rotation: 90 days. Google cannot access encrypted data.

resource "google_kms_key_ring" "claravis_ae" {
  project  = var.project_id
  name     = "claravis-ae-keyring"
  location = var.region   # europe-west3 — keys stay in EU
}

resource "google_kms_crypto_key" "bigquery_cmek" {
  name     = "bigquery-cmek"
  key_ring = google_kms_key_ring.claravis_ae.id

  rotation_period = "7776000s"  # 90 days in seconds

  version_template {
    algorithm        = "GOOGLE_SYMMETRIC_ENCRYPTION"
    protection_level = "HSM"    # Hardware Security Module
  }

  labels = {
    module              = "platform"
    data-classification = "restricted"
  }
}

resource "google_kms_crypto_key" "firestore_cmek" {
  name     = "firestore-cmek"
  key_ring = google_kms_key_ring.claravis_ae.id
  rotation_period = "7776000s"
  version_template {
    algorithm        = "GOOGLE_SYMMETRIC_ENCRYPTION"
    protection_level = "HSM"
  }
}

# BigQuery dataset with CMEK
resource "google_bigquery_dataset" "audit" {
  project     = var.project_id
  dataset_id  = "ae_audit"
  location    = var.region
  description = "Immutable audit trail — all agent actions, HITL events, SHAP explanations"

  default_encryption_configuration {
    kms_key_name = google_kms_crypto_key.bigquery_cmek.id
  }

  labels = {
    module              = "platform"
    env                 = var.env
    data-classification = "restricted"
    cost-centre         = var.cost_centre
  }
}
Network Topology

Shared VPC. Three subnets. One perimeter.

The network design follows the GCP shared VPC pattern — a single host project owns the network, service projects host the workloads. Three subnets map to the three compute layers: agents, data/ML, and infrastructure. The VPC-SC perimeter wraps the entire project boundary and prevents data exfiltration regardless of application behaviour.

Network Topology — Shared VPC · VPC-SC · Private Service Connect
europe-west3 (Frankfurt) · all resources · VPC-SC perimeter · BeyondCorp access · Cloud Armor WAF
VPC-SC PERIMETER · claravis_ae_perimeter · All services restricted · No egress outside europe-west3 SHARED VPC: claravis-ae-vpc · Host Project: claravis-ae-host · Region: europe-west3 Cloud Armor WAF OWASP rules · DDoS Rate limiting · IP allowlist BeyondCorp / IAP Identity-aware proxy Device trust · MFA enforced SUBNET: ae-agents-subnet · 10.10.1.0/24 · Private Google Access: enabled CCAI Agent Cloud Run ContractGuard Cloud Run RevRec AI Cloud Run Asset IQ Cloud Run Orchestrator Cloud Run · A2A SUBNET: ae-data-ml-subnet · 10.10.2.0/24 · Managed services via PSC BigQuery CMEK · eu-west3 Firestore CMEK · native mode Pub/Sub 6 topics · CMEK Vertex AI Pipelines + FS + Endpoints GKE Autopilot Batch ML training workloads SUBNET: ae-infra-subnet · 10.10.3.0/24 · Management plane Cloud KMS CMEK key rings · HSM Secret Manager OAuth tokens · API keys Cloud Monitoring SLO · Alerts · Dashboards Sec. Cmd Center Continuous posture scan Artifact Registry Container images · signed Cloud Build CI/CD · Terraform runner Private Service Connect — All managed service API calls route via internal IP endpoints · no public internet traversal Salesforce (external) REST API · OAuth 2.0 · via NAT SAP S/4HANA (external) Middleware bridge · mock in demo VPC-internal · WIF-authenticated
Agent subnet (ae-agents)
Data/ML subnet (ae-data-ml)
Infra subnet (ae-infra)
VPC-SC perimeter
External systems
Security Architecture

Zero-trust enforced at every layer — not just the perimeter.

The CISO requirement from Page 02 (S-09) is satisfied structurally — not by policy. Every security control below is enforced by infrastructure code, not by operational procedure. A misconfigured application cannot bypass these controls because they are not application-level configurations.

Identity & Access
Zero-trust IAM — no implicit trust
One service account per agent — minimum required permissions only (least privilege)
Workload Identity Federation — no service account key files exist anywhere in the codebase
Org Policy: iam.disableServiceAccountKeyCreation enforced at organisation root
BeyondCorp IAP — all access to management interfaces requires device trust verification + MFA
Org Policy: iam.allowedPolicyMemberDomains — only ClaraVis identities can be bound to IAM roles
All IAM bindings managed exclusively by Terraform — no console grants
Data Protection
CMEK + VPC-SC — dual-layer data protection
CMEK on all storage: BigQuery, Firestore, GCS, Cloud Run — ClaraVis holds key custody via Cloud KMS HSM
Key rotation: 90 days — automated via Terraform rotation_period configuration
VPC-SC perimeter — all API calls to restricted services must originate from within the perimeter
Org Policy: gcp.resourceLocations constrains all resources to europe-west3/europe-west4 only
DLP API — automatic PII detection on data entering the GCS contract store bucket
All data in transit: TLS 1.3 minimum — enforced by Cloud Load Balancer SSL policy
Network Security
Defence in depth — four network layers
Cloud Armor WAF: OWASP CRS rules + rate limiting + IP allowlist for Salesforce egress NAT IP
VPC firewall: default-deny-all ingress — only explicitly allowed traffic is permitted
Private Google Access: all managed service API calls via internal IP (PSC) — no public internet
Cloud NAT: outbound internet access (Salesforce API) via NAT — agents have no public IP
Shared VPC: network management centralised in host project — service projects cannot modify network
VPC Flow Logs: enabled on all subnets — retained 30 days in Cloud Logging
Secrets Management
Secret Manager — no hardcoded credentials
All secrets in Secret Manager — Salesforce OAuth tokens, any third-party API keys
Secrets referenced via Cloud Run valueSource.secretKeyRef — never injected as plain env vars in Terraform
Secret access via service account IAM binding — secretmanager.secretAccessor role per SA, per secret
Secret rotation: Salesforce OAuth tokens rotated every 30 days via Cloud Scheduler + Cloud Functions
Pre-commit hook: detect-secrets scanner runs on every git commit — fails on any credential pattern
Audit log: every secret access logged to Cloud Audit Logs — alerts on anomalous access patterns
Posture Management
Security Command Center — continuous validation
Security Command Center Premium — enabled at organisation level
Every Terraform apply triggers a post-deployment posture scan
HIGH and CRITICAL findings block the Cloud Build deployment pipeline — ADR-014
Container images scanned by Artifact Analysis before deployment — known CVEs blocked
Binary Authorization — only images signed by Cloud Build SA are deployable to Cloud Run / GKE
Vulnerability scanning: OS and application layer — automated weekly on all running container images
Audit & Compliance
Immutable audit — every action logged
Cloud Audit Logs: DATA_READ, DATA_WRITE, ADMIN_ACTIVITY enabled on all services — retained 400 days
Audit logs exported to BigQuery via Log Sink — cross-queryable with application audit trail
Log bucket: locked (WORM) — audit logs cannot be deleted or modified for 400 days
HITL event records in Firestore: immutable by write pattern — no update or delete operations permitted
Compliance dashboard: Forseti Security / Config Validator checks Terraform state against CIS GCP Benchmark
ISO 27001 evidence: Cloud Audit Logs + Security Command Center findings exportable as compliance evidence package
Compute Architecture

Cloud Run for agents. GKE Autopilot for batch ML.

The compute split between Cloud Run and GKE Autopilot is ADR-003 from Page 03 — restated here with the full operational rationale. Cloud Run handles the stateless, request-driven agent workloads. GKE Autopilot handles the batch, long-running ML training jobs that benefit from GPU access and longer execution windows.

Service Compute Platform Scaling Resources Justification Cost Model
CCAI Sales AgentCloud Run v20–10 instances · request-driven2 vCPU · 4Gi · CPU idleStateless conversational handler. Request duration: 1–8s. Scale-to-zero eliminates idle cost. Session state in Firestore — no affinity required.Per-request billing · ~€0.08/1K req
ContractGuardCloud Run v20–10 instances · request-driven4 vCPU · 8Gi · startup boostLong-running requests (Gemini 1M token analysis: 30–90s). Startup CPU boost reduces cold start latency. Higher memory for Gemini response buffering.Per-request · ~€0.40/analysis
RevRec AI AgentCloud Run v20–10 instances · request-driven2 vCPU · 4GiStateless classification handler. Request duration: 2–5s (XGBoost inference + SHAP). Scale-to-zero acceptable — RevRec runs on contract signed events, not continuously.Per-request · ~€0.12/classification
Asset IQ AgentCloud Run v2 (inference) + GKE (batch feature eng.)0–5 Cloud Run · GKE: scheduled jobs2 vCPU · 4Gi (CR) · n2-standard-4 (GKE)RUL inference: Cloud Run (request-driven, short). Daily feature engineering over 12,000 units: GKE Autopilot (batch, long-running, benefit from parallelism).Mixed: per-request + batch pod billing
FinRisk SentinelCloud Run v2 · always-on1–5 instances · min 1 (streaming)2 vCPU · 4GiStreaming anomaly detection requires a persistent connection to BigQuery streaming inserts. Min 1 instance to eliminate cold start latency on financial event processing.Always-on: ~€45/month per instance
OrchestratorCloud Run v20–20 instances · request-driven2 vCPU · 4GiA2A message handler. Short requests (dispatch + state write). Highest instance count — the Orchestrator fans out to multiple agents simultaneously and must not become a bottleneck.Per-request · ~€0.06/dispatch
ML Training (all models)GKE Autopilot0–N pods · job-scopedn1-standard-8 + A100 GPU (training) · n1-standard-4 (evaluation)ADR-015. Vertex AI Pipelines submits training jobs to GKE Autopilot. A100 GPU required for Gemini embedding computation (ContractGuard feature engineering). Scale-to-zero between pipeline runs.Pod billing · GPU: ~€2.80/hour · billed only during training
CI/CD Pipeline

Seven steps. Three gates. One Cloud Build pipeline.

Every code change — agent code, Terraform IaC, ML pipeline definition — passes through the same Cloud Build pipeline. Three gates enforce quality, security, and architectural compliance before anything reaches production. The FDA 21 CFR 820 change control requirement is satisfied by the mandatory Terraform plan review gate and the Binary Authorization signing step.

STEP 01
Source Trigger
GitHub webhook · PR merge to main
STEP 02
Test Suite
pytest · coverage ≥ 80% · contract tests
GATE 01
Security Scan
detect-secrets · Trivy CVE · SCC findings
STEP 03
Container Build
Docker build · Artifact Registry push
GATE 02
Terraform Plan
terraform plan · infracost · manual review gate for infra PRs
STEP 04
Deploy Staging
terraform apply · Cloud Run staging · integration tests
GATE 03
Promote Production
Binary Authorization · signed image · manual approval for prod
Cloud Build Pipeline — Trigger → Test → Build → Gate → Deploy
FDA 21 CFR 820 change control: Terraform plan review (Gate 02) + Binary Authorization signing (Gate 03) create the auditable change record
GitHub PR merged to main Cloud Build pytest · coverage detect-secrets Trivy CVE scan Gate 01 SCC HIGH → block Block PR · notify author Docker Build Push to Artifact Registry · sign Gate 02 terraform plan Manual review (infra) Deploy Staging · apply Integration tests Gate 03 Binary Auth Manual prod approval Deploy Production terraform apply · Cloud Run SCC post-deploy scan FDA 21 CFR 820 change control: Gate 02 Terraform plan + Gate 03 manual approval = auditable change record before production deployment
Operations

SLOs. FinOps. GreenOps. DR. Chaos engineering.

Operational maturity is not an afterthought in this architecture — it is provisioned in the same Terraform run as the infrastructure. SLO definitions, budget alerts, carbon-aware scheduling, and DR configuration are all infrastructure-as-code artifacts.

SLO / SLI Definitions — per module
Module SLI (what is measured) SLO Target Error Budget (30d) Alert Threshold Burn Rate Alert
CCAI Sales Agent% requests returning 2xx within 8s99.5% availability3.6 hoursP99 latency > 8s for 5 min · error rate > 1% for 10 min14.4× over 1h
ContractGuard% analyses completing without error within 120s99.0% availability7.2 hoursAnalysis error rate > 2% for 15 min · P95 latency > 90s6× over 6h
RevRec AI% classifications completing + SHAP generated within 10s99.9% availability43 minutesAny classification error · SHAP generation failure · SAP write failure36× over 1h — page immediately
Asset IQ% daily prediction runs completing within 2h window99.0% availability7.2 hoursDaily run misses 2h window · prediction error rate > 5%6× over 6h
FinRisk Sentinel% financial events scored within 5 minutes of ingestion99.5% availability3.6 hoursStreaming lag > 5 min · anomaly scoring failure rate > 1%14.4× over 1h
HITL Framework% HITL checkpoints created and presented within 60s of trigger99.9% availability43 minutesHITL creation failure · SLA breach rate > 5% across all checkpoints36× over 1h — page immediately
Audit Trail% agent actions with audit record committed within 2s99.99% availability4 minutesAny audit write failure — critical · immediate pageAny failure = immediate page
FinOps — Cost Allocation
Four mandatory labels. Tagged from day one.
Label
module
contractguard · revrec-ai · asset-iq · finrisk · ccai-sales · platform · greenops · strategy-dashboard
Label
env
dev · staging · prod
Label
cost-centre
ClaraVis internal code · maps to finance cost centre in SAP
Label
data-classification
public · internal · confidential · restricted
Budget alerts: 50% · 80% · 100% of monthly budget per module. Alert to ClaraVis finance contact + Cloud Build notification. Resources without required labels fail terraform plan validation.
GreenOps — Carbon-Aware Scheduling
Compute-intensive workloads scheduled to low-carbon windows.
The GreenOps module (Page 04, FRD-06) integrates with the infrastructure layer via Cloud Scheduler and the GCP Carbon Footprint API. ML training jobs submitted to GKE Autopilot via Vertex AI Pipelines are deferred to grid-carbon-optimal windows within a ±6 hour scheduling flexibility window. ESG metrics written to BigQuery for EU CSRD reporting.
Scheduling flexibility: ±6 hours for batch ML training · ±2 hours for daily Asset IQ feature engineering
Carbon intensity source: GCP Carbon Footprint API · europe-west3 grid signal
ESG output: Monthly BigQuery report → Looker Studio dashboard → EU CSRD Scope 3 evidence
Hard limit: RevRec AI and FinRisk Sentinel are never deferred — financial latency SLOs take precedence
Disaster Recovery — RTO / RPO Targets
Tier 1 — Critical (Revenue + Compliance)
RevRec AI · HITL Framework · Audit Trail
These three components directly affect EU AI Act compliance and financial posting integrity. Any downtime creates a compliance gap. Backed by Cloud Spanner (HITL) in multi-region active-active configuration for the HITL audit store. BigQuery audit dataset replicated to europe-west4 as a secondary region. RevRec AI inference is stateless — Cloud Run redeploys in under 3 minutes.
RTO 15 minutes
RPO 0 (multi-region replication)
Backup Continuous · Firestore PITR 7 days
Tier 2 — Important (Operations)
Asset IQ · FinRisk Sentinel · ContractGuard
Important operational modules. Downtime creates business impact but not immediate compliance exposure. Asset IQ runs on a daily cadence — a 4-hour RTO is acceptable (next daily run is the recovery). FinRisk Sentinel has a 5-minute streaming SLO — recovery within 1 hour restores monitoring coverage. ContractGuard is document-processing driven — queued documents can be re-processed after recovery.
RTO 1–4 hours
RPO 1 hour (Firestore PITR)
Backup Hourly exports to GCS
Tier 3 — Standard (Sales + Reporting)
CCAI Sales Agent · Strategy Dashboard · GreenOps
Business-important but not operationally critical. CCAI Sales Agent downtime routes inbound inquiries directly to Account Executives (the pre-AE process). Strategy Dashboard is read-only BigQuery — restores on next Cloud Build deploy. GreenOps scheduling defers to non-carbon-aware schedule during recovery window.
RTO 4–8 hours
RPO 24 hours
Backup Daily GCS snapshots
Infrastructure Recovery
Full Environment Rebuild from Terraform State
The entire GCP environment is reproducible from the Terraform state file stored in a CMEK-encrypted GCS bucket in a separate GCP project. A full environment rebuild from terraform apply is tested quarterly as part of the chaos engineering programme. Estimated rebuild time: 45 minutes for the full production environment. The Terraform state bucket is the single most critical recovery artifact — it has its own versioning, Object Lock (WORM), and cross-project replication.
Full rebuild ~45 minutes
State bucket WORM + versioning + cross-project replica
Test cadence Quarterly full rebuild drill
Chaos Engineering — Scheduled Failure Drills
Chaos Experiment 01
Specialist agent failure mid-task
Inject a forced error response from the ContractGuard Cloud Run service after 3 successful clause analyses. The Orchestrator has dispatched a 12-clause contract analysis — 3 succeed, then the agent returns a 500 error.
Expected: Orchestrator opens circuit breaker after 3rd failure. Task routes to HITL fallback. Finance Controller receives manual review request with the 3 completed clause analyses and a note that automated analysis failed for clauses 4–12. Partial work preserved in Firestore. No data loss. Circuit half-opens after 30s.
Chaos Experiment 02
Firestore write failure during HITL creation
Block Firestore writes for the HITL event collection for 60 seconds during an active RevRec AI classification run. The model has completed inference and SHAP computation — the HITL-04 creation write fails.
Expected: RevRec AI agent enters ERROR state. SAP write is never initiated (SAP write requires HITL record ID as mandatory parameter — the write is physically impossible without it). RevRec AI agent retries HITL creation 3 times with exponential backoff. On third failure: task fails, Finance Controller receives manual classification alert. Audit record of the attempted HITL creation is preserved in Cloud Logging even if Firestore write failed.
Chaos Experiment 03
VPC-SC perimeter breach attempt
From inside the VPC, attempt to write a BigQuery record to a dataset in a different GCP project outside the VPC-SC perimeter. Simulate a misconfigured application that attempts to exfiltrate data to an external BigQuery project.
Expected: VPC-SC perimeter blocks the BigQuery API call with a PERMISSION_DENIED error before it reaches the destination project. The attempted write is logged to Cloud Audit Logs as a VPC-SC violation. Security Command Center generates a HIGH finding. Cloud Build pipeline deployment is blocked until the finding is resolved and acknowledged. No data is exfiltrated.
Chaos Experiment 04
HITL SLA timeout — escalation path test
Create a HITL-04 (RevRec AI Finance Controller review) and then simulate the Finance Controller taking no action for 4 hours and 5 minutes — 5 minutes past the 4-hour SLA defined in the HITL specification.
Expected: Cloud Scheduler timeout job fires at t+4h. Escalation to CFO triggered automatically. A second HITL checkpoint is created for the CFO with the original classification, SHAP explanation, and a note that the FC SLA was exceeded. The timeout event is written to the Firestore HITL record as an immutable escalation event. EU AI Act compliance record shows the escalation — the SLA breach is documented, not hidden.
Chaos Experiment 05
CMEK key rotation — zero-downtime verification
Trigger a manual CMEK key rotation for the BigQuery CMEK key during active RevRec AI classification traffic. Verify that ongoing classifications are not interrupted and that new data written after rotation uses the new key version.
Expected: GCP automatically re-encrypts new data with the rotated key. Existing data remains accessible (GCP retains previous key versions). In-flight RevRec AI requests complete successfully. Key rotation event appears in Cloud Audit Logs. Monitoring shows no latency spike or error rate increase during the rotation window. CMEK dashboard shows both old and new key versions active.
Chaos Experiment 06
Full environment rebuild from Terraform state
Quarterly: destroy the entire staging environment (terraform destroy) and rebuild it from scratch (terraform apply) using only the Terraform state file from GCS and the container images from Artifact Registry. Measure total rebuild time and verify all integration tests pass post-rebuild.
Expected: Complete environment rebuild in under 45 minutes. All Cloud Run services healthy within 3 minutes of apply completing. Firestore data restored from PITR backup within 10 minutes. All integration tests pass. This experiment validates the DR runbook and the infrastructure-as-code principle — if the environment cannot be rebuilt from code, the IaC is incomplete.
On-Call Escalation Path
P0 — Immediate page
Audit trail write failure · RevRec AI HITL creation failure · VPC-SC breach · SCC CRITICAL finding
P1 — Page within 15 min
HITL SLA breach > 2 modules simultaneously · FinRisk streaming lag > 10 min · Circuit breaker open on Orchestrator
P2 — Notify within 1 hour
Error budget burn rate > 14.4× · Asset IQ daily run delayed > 2h · SCC HIGH finding · Model drift alert
P3 — Next business day
Budget alert at 80% · CCAI Sales Agent elevated latency · GreenOps scheduling missed window · SCC MEDIUM finding
Infrastructure Design Decisions

Seven questions a senior GCP architect will ask — answered in advance.

These are the gaps a Google Cloud architect or SRE probes in a design review. Each answer below is documented because the absence of it suggests the design hasn't been thought through to production depth.

Decision 01 — Terraform State Bucket IAM
The state bucket is the highest-value attack target in the infrastructure — it is protected accordingly
Write access to the Terraform state file is equivalent to full infrastructure control — an attacker who can write to state can inject arbitrary resources or exfiltrate every secret referenced in outputs. The state bucket IAM has exactly two principals with write access: the Cloud Build SA (for pipeline runs only, scoped to the specific bucket path) and a break-glass admin SA that requires MFA re-authentication for every use and generates a Cloud Audit Log CRITICAL entry on activation. No human principal has direct bucket write access in normal operations. Object versioning is enabled — any state corruption is recoverable to any point within the 90-day retention window. The state bucket lives in a separate GCP project from the workload project, isolated by a distinct VPC-SC perimeter.
Decision 02 — Terraform State Locking
Concurrent applies are prevented by GCS native locking — no custom lock management required
The GCS Terraform backend provides native state locking via object metadata — when a terraform apply begins, it acquires an exclusive lock on the state object. Any concurrent apply attempt fails immediately with a lock error and the conflicting pipeline is halted before it modifies any resources. Cloud Build pipelines that target the same Terraform workspace are serialised via a Cloud Build trigger concurrency limit of 1 — a second trigger fires only after the first completes or fails. Lock acquisition and release events are written to Cloud Audit Logs. In the event of a failed apply that leaves a stale lock (e.g. Cloud Build runner crash), the lock can be force-released only by the break-glass admin SA — not by any pipeline SA.
Decision 03 — Secret Rotation Without Redeployment
Rotated secrets are picked up by running services on the next request — no redeployment required
All Cloud Run services reference secrets via valueSource.secretKeyRef.version: latest — not a pinned version number. When the Salesforce OAuth token rotation function writes a new secret version to Secret Manager, the next request from any Cloud Run service that calls Secret Manager resolves latest to the new version automatically. No redeployment, no restart, no configuration change. The rotation function (Cloud Scheduler → Cloud Functions → Secret Manager) runs on a 30-day cadence for Salesforce tokens, immediately on demand for any secret flagged by the security posture scan. Each rotation event is logged to Cloud Audit Logs and triggers a Secret Manager rotation notification to the on-call channel. The old secret version is disabled (not deleted) for 7 days after rotation — enabling rollback if the new credential is rejected.
Decision 04 — VPC-SC and Salesforce Outbound Traffic
VPC-SC restricts GCP managed service APIs — it does not block outbound internet traffic to Salesforce
VPC Service Controls operates at the GCP API layer — it restricts cross-project and external calls to services like BigQuery, Firestore, and Vertex AI. It does not intercept or block outbound TCP traffic from Cloud Run to the public internet. Salesforce REST API calls exit the VPC via Cloud NAT (which provides outbound internet access from private Cloud Run instances without assigning public IPs) and reach Salesforce's servers directly. The Salesforce NAT egress IP range is static and pre-registered with Salesforce's IP allowlist. Cloud Armor's WAF policy includes an egress rule that restricts outbound HTTP/S traffic to the Salesforce API hostname only — all other outbound internet traffic from the agent subnet is blocked at the firewall level. This separation is intentional: VPC-SC protects ClaraVis's data within GCP; Cloud Armor + firewall rules protect the outbound channel to Salesforce.
Decision 05 — Local Terraform Apply Prevention
IAM design prevents any developer from running terraform apply against production — Cloud Build SA is the only principal with apply permissions
Developer GCP credentials are granted read-only access to the production project (roles/viewer + specific read roles for debugging). The IAM roles required to provision or modify infrastructure — roles/compute.admin, roles/iam.securityAdmin, VPC-SC admin — are bound exclusively to the Cloud Build SA. A developer who clones the repo and runs terraform apply locally with their own credentials will receive PERMISSION_DENIED on the first resource that requires elevated roles. This is enforced structurally by IAM — not by policy, convention, or honour system. Developers have full apply access to the dev environment, where a separate SA with equivalent permissions exists. The separation of dev vs prod SA permissions is provisioned in Terraform and enforced by the Org Policy constraint on the prod project folder.
Decision 06 — Terraform Module Versioning
Shared modules are version-pinned — a change to the networking module cannot silently affect the compute module
Each internal Terraform module is tagged in the Git repository using semantic versioning (e.g. networking/v1.3.0). Consuming modules reference a specific tag via the Git source syntax: source = "git::https://github.com/...//modules/networking?ref=v1.3.0". A module update requires a version tag bump and a PR review before any consuming module can reference the new version. The root module's terraform.lock.hcl file pins provider versions with hash verification — provider upgrades are explicit, reviewable, and auditable. This means the entire dependency graph is reproducible from the lock file alone: given the same lock file and the same tagged module versions, terraform init always produces an identical provider and module set — a property the quarterly full environment rebuild drill (Chaos Experiment 06) explicitly tests.
Decision 07 — Binary Authorization Attestation Chain & Break-Glass
Attestation model, signing key custody, and the emergency bypass procedure — specified before the first deployment
The Binary Authorization attestation chain has two paths. The standard path: Cloud Build SA signs each container image digest using a Cloud KMS asymmetric key (EC P-256) during the Gate 03 step. The attestation is stored in Container Analysis and verified by the Binary Authorization policy on every Cloud Run and GKE deployment — unsigned images are rejected at the API layer before the container ever starts. The break-glass path: a secondary attestor SA exists for emergency hotfixes that cannot wait for the full CI/CD pipeline. Access to the break-glass attestor requires MFA re-authentication, generates a Security Command Center CRITICAL finding within 60 seconds of use, and mandates a post-incident review ticket within 24 hours. The break-glass key is a separate Cloud KMS asymmetric key with its own key ring, accessible only to two named security leads. Every break-glass use is logged to an immutable Cloud Audit Log bucket with 7-year retention — satisfying the FDA change control audit trail requirement for emergency software changes.
Attestation Chain Summary
Standard path: Cloud Build SA → KMS sign (EC P-256) → Container Analysis attestation → Binary Auth policy enforces on deploy
Key custody: Standard key — Cloud Build SA only. Break-glass key — 2 named security leads, MFA-gated, separate key ring
Break-glass trigger: SCC CRITICAL finding within 60s · mandatory post-incident review within 24h · immutable audit log 7yr retention
FDA 21 CFR 820: Break-glass audit trail satisfies emergency software change control documentation requirement
Unsigned image policy: DENY — no exceptions. Images without a valid attestation are rejected at the API layer before scheduling.
Architecture Decision Records

Three infrastructure decisions. Every alternative documented.

ADR-013 through ADR-015 cover the key infrastructure architecture choices. Each was made after evaluating the alternatives — the reasoning is documented here because it is the reasoning a Google Cloud architect or SRE will probe.

ADR-013
Shared VPC over separate VPCs per module
Separate VPCs per module (one per AE module) were evaluated as an isolation pattern. Rejected because: (1) VPC peering between 8 module VPCs creates an O(n²) peering mesh that becomes operationally complex and approaches GCP peering limits. (2) Shared VPC centralises network management in the host project — module service accounts cannot modify network topology, which strengthens the security posture. (3) Private Service Connect endpoints are provisioned once in the shared VPC and consumed by all modules — vs. 8 separate PSC configurations. The security isolation between modules is maintained at the IAM and VPC-SC layer, not the network boundary layer. Modules in the same VPC but different subnets cannot communicate without explicit firewall rules — the default-deny-all ingress policy enforces this.
Accepted · Phase Infra Design
ADR-014
Cloud Armor over third-party WAF (Cloudflare, Imperva)
Third-party WAF solutions were evaluated — Cloudflare and Imperva both offer mature OWASP protection and DDoS mitigation. Rejected because: (1) Third-party WAFs route traffic outside the GCP network before it reaches Cloud Load Balancer — this means traffic traverses a third-party's infrastructure before entering the VPC-SC perimeter. In a ClaraVis data residency context, any traffic routing outside GCP before the perimeter is a data sovereignty risk that requires legal review. (2) Cloud Armor integrates natively with Cloud Load Balancer and Security Command Center — findings from Cloud Armor violations appear in the SCC dashboard alongside infrastructure findings. (3) Cloud Armor's Adaptive Protection (ML-based DDoS detection) is included in the Cloud Armor Enterprise tier — comparable capability to third-party solutions without the data routing concern.
Accepted · Phase Infra Design
ADR-015
GKE Autopilot over GKE Standard for batch ML workloads
GKE Standard was evaluated for the batch ML training workloads submitted by Vertex AI Pipelines. Rejected in favour of GKE Autopilot because: (1) GKE Standard requires node pool management — selecting machine types, managing node lifecycle, patching OS images. Autopilot abstracts all of this — Vertex AI Pipelines submits pod specs, GKE Autopilot provisions the right nodes. (2) Autopilot's per-pod billing model is more cost-efficient for intermittent batch workloads — Standard charges for node runtime regardless of whether pods are scheduled. (3) Autopilot automatically applies GKE hardening (no SSH access, enforced container security context, workload identity pre-configured). For a portfolio-scale deployment where the ML engineer is also the infra engineer, Autopilot's managed approach is the correct tradeoff — operational simplicity over fine-grained control.
Accepted · Phase Infra Design
Next in the Portfolio
Infrastructure designed.
GTM strategy follows.

Six design phases complete. Page 08 answers the commercial question: how does the AE reach market? Buyer journey mapping, value proposition by persona, adoption phasing, and commercial model options — the architecture expressed as a go-to-market strategy.

PG 08
Go-to-Market Strategy
Buyer journey · Value proposition · Adoption phasing · Commercial model
In Design
PG 06
← ML Engineering & MLOps
The ML platform this infrastructure runs