Infrastructure & GCP Architecture — Page 07

Infrastructure Design Principles

Six principles. The rules every resource must satisfy.

These principles are derived from the Architecture Principles established in TOGAF Preliminary Phase (Page 03). They are the engineering expression of those principles — specific enough to evaluate any infrastructure decision against.

IP-01

Every resource is provisioned by Terraform — no manual state

Every GCP resource — VPC, subnet, IAM binding, Cloud Run service, Firestore database, BigQuery dataset — is declared in Terraform and provisioned by terraform apply. No manual console changes. If a resource cannot be expressed in Terraform, it is a design gap, not a configuration exception.

P-08 · FDA change control · ISO 27001

IP-02

Data residency is enforced by Organisation Policy — not application config

All resources are constrained to europe-west3 (Frankfurt) by an Organisation Policy constraint applied at the GCP organisation root. Application code cannot override this. New resources deployed to any other region will fail with a policy violation before they are created.

P-06 · GDPR · C-04

IP-03

Zero-trust: no implicit trust based on network location

BeyondCorp enforces identity verification at every access request. Service accounts use Workload Identity Federation — no key files exist anywhere in the codebase or the infrastructure state. Every service-to-service call is authenticated. The VPC is a deployment boundary, not a trust boundary.

P-11 · ISO 27001 · AR-07

IP-04

ClaraVis holds the encryption keys — Google does not

All storage resources (BigQuery, Firestore, GCS, Cloud Run data at rest) use Customer-Managed Encryption Keys via Cloud KMS. ClaraVis holds key custody. Key rotation policy: 90 days. In the event of a contract termination, ClaraVis can revoke key access and all data becomes cryptographically inaccessible.

P-11 · GDPR Art. 17 · ISO 27001

IP-05

Cost allocation is tagged from first terraform apply

Every resource carries four mandatory labels: module (which AE module), env (dev/staging/prod), cost-centre (ClaraVis internal code), and data-classification (public/internal/confidential/restricted). Labels enforced by a Terraform module-level validation. Resources without required labels fail plan validation before they reach apply.

P-12 · FinOps standard

IP-06

Security posture is continuously validated — not point-in-time audited

Security Command Center Premium is enabled. Every infrastructure change triggers a security posture scan. Findings above HIGH severity block the Cloud Build deployment pipeline. The security posture dashboard is a live view of the infrastructure state, not a monthly report.

P-11 · ISO 27001 · FDA software validation

Terraform IaC

Module structure and production HCL.

The infrastructure is organised into six Terraform modules. The root module wires them together and passes shared variables. Three HCL snippets below show the most architecturally significant resources — the ones a Google Cloud architect will ask to see first.

Repository Structure

infra/terraform/

├── root/

│ ├── main.tf — module composition

│ ├── variables.tf — region, project, env

│ ├── outputs.tf

│ └── backend.tf — GCS state backend

├── modules/networking/

│ ├── vpc.tf — shared VPC + subnets

│ ├── vpc_sc.tf — VPC-SC perimeter

│ ├── firewall.tf

│ └── cloud_armor.tf

├── modules/security/

│ ├── iam.tf — SA + WIF bindings

│ ├── kms.tf — CMEK key rings

│ ├── secret_manager.tf

│ └── org_policy.tf — region constraint

├── modules/compute/

│ ├── cloud_run.tf — agent services

│ └── gke_autopilot.tf — batch ML

├── modules/data/

│ ├── bigquery.tf — datasets + CMEK

│ ├── firestore.tf

│ ├── pubsub.tf — topics + subscriptions

│ └── gcs.tf — contract store bucket

└── modules/monitoring/

├── slo.tf — SLO definitions

├── alerts.tf — budget + drift alerts

└── dashboards.tf

VPC-SC Perimeter

Workload Identity

Cloud Run Service

CMEK Key

# modules/networking/vpc_sc.tf
# VPC Service Controls perimeter — enforces data residency
# and prevents exfiltration outside europe-west3

resource "google_access_context_manager_service_perimeter" "claravis_ae" {
  parent = "accessPolicies/${var.access_policy_id}"
  name   = "accessPolicies/${var.access_policy_id}/servicePerimeters/claravis_ae_perimeter"
  title  = "ClaraVis AE Production Perimeter"

  spec {
    resources = ["projects/${var.project_number}"]

    restricted_services = [
      "bigquery.googleapis.com",
      "firestore.googleapis.com",
      "storage.googleapis.com",
      "aiplatform.googleapis.com",
      "run.googleapis.com",
      "pubsub.googleapis.com",
      "secretmanager.googleapis.com",
      "cloudkms.googleapis.com",
    ]

    access_levels = [
      google_access_context_manager_access_level.claravis_corp_devices.name,
    ]

    vpc_accessible_services {
      enable_restriction = true
      allowed_services   = ["RESTRICTED-SERVICES"]
    }

    ingress_policies {
      ingress_from {
        identity_type = "SERVICE_ACCOUNT"
        identities = [
          "serviceAccount:${var.orchestrator_sa}",
          "serviceAccount:${var.cloud_build_sa}",
        ]
      }
      ingress_to {
        resources = ["*"]
        operations { service_name = "*" }
      }
    }
  }

  use_explicit_dry_run_spec = false
}

# modules/security/iam.tf
# Workload Identity Federation — no service account key files
# Cloud Run services authenticate via WIF, not key files

resource "google_service_account" "contractguard_sa" {
  project      = var.project_id
  account_id   = "contractguard-sa"
  display_name = "ContractGuard Agent Service Account"
  description  = "Least-privilege SA for ContractGuard Cloud Run service. No key files created."
}

# Workload Identity binding — Cloud Run service → SA
resource "google_service_account_iam_member" "contractguard_wif" {
  service_account_id = google_service_account.contractguard_sa.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.project_id}.svc.id.goog[${var.namespace}/contractguard]"
}

# Minimal permissions — only what ContractGuard needs
resource "google_project_iam_member" "contractguard_permissions" {
  for_each = toset([
    "roles/datastore.user",           # Firestore read/write
    "roles/bigquery.dataEditor",       # BigQuery audit writes
    "roles/storage.objectViewer",      # GCS contract bucket read
    "roles/secretmanager.secretAccessor", # Secret Manager read
    "roles/aiplatform.user",           # Vertex AI endpoint invoke
    "roles/cloudkms.cryptoKeyDecrypter", # CMEK decrypt (own key ring only)
  ])
  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.contractguard_sa.email}"
}

# Org policy: deny SA key creation — enforced at org root
resource "google_org_policy_policy" "deny_sa_key_creation" {
  name   = "organizations/${var.org_id}/policies/iam.disableServiceAccountKeyCreation"
  parent = "organizations/${var.org_id}"
  spec {
    rules { enforce { enforce = true } }
  }
}

# modules/compute/cloud_run.tf
# ContractGuard agent service — stateless, VPC-native, CMEK

resource "google_cloud_run_v2_service" "contractguard" {
  project  = var.project_id
  location = var.region   # europe-west3 — enforced by org policy
  name     = "contractguard-${var.env}"

  template {
    service_account = google_service_account.contractguard_sa.email

    scaling {
      min_instance_count = 0   # scale-to-zero in non-prod
      max_instance_count = 10
    }

    vpc_access {
      connector = google_vpc_access_connector.ae_connector.id
      egress    = "PRIVATE_RANGES_ONLY"
    }

    containers {
      image = "${var.region}-docker.pkg.dev/${var.project_id}/ae-agents/contractguard:${var.image_tag}"

      resources {
        limits = {
          cpu    = "2"
          memory = "4Gi"
        }
        cpu_idle          = true   # CPU only allocated during request
        startup_cpu_boost = true
      }

      env {
        name = "PROJECT_ID"
        value = var.project_id
      }
      env {
        name = "SFDC_TOKEN_SECRET"
        value_source {
          secret_key_ref {
            secret  = google_secret_manager_secret.sfdc_oauth_token.secret_id
            version = "latest"
          }
        }
      }
    }

    labels = {
      module              = "contractguard"
      env                 = var.env
      cost-centre         = var.cost_centre
      data-classification = "confidential"
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }
}

# modules/security/kms.tf
# CMEK key ring and keys — ClaraVis holds custody
# Rotation: 90 days. Google cannot access encrypted data.

resource "google_kms_key_ring" "claravis_ae" {
  project  = var.project_id
  name     = "claravis-ae-keyring"
  location = var.region   # europe-west3 — keys stay in EU
}

resource "google_kms_crypto_key" "bigquery_cmek" {
  name     = "bigquery-cmek"
  key_ring = google_kms_key_ring.claravis_ae.id

  rotation_period = "7776000s"  # 90 days in seconds

  version_template {
    algorithm        = "GOOGLE_SYMMETRIC_ENCRYPTION"
    protection_level = "HSM"    # Hardware Security Module
  }

  labels = {
    module              = "platform"
    data-classification = "restricted"
  }
}

resource "google_kms_crypto_key" "firestore_cmek" {
  name     = "firestore-cmek"
  key_ring = google_kms_key_ring.claravis_ae.id
  rotation_period = "7776000s"
  version_template {
    algorithm        = "GOOGLE_SYMMETRIC_ENCRYPTION"
    protection_level = "HSM"
  }
}

# BigQuery dataset with CMEK
resource "google_bigquery_dataset" "audit" {
  project     = var.project_id
  dataset_id  = "ae_audit"
  location    = var.region
  description = "Immutable audit trail — all agent actions, HITL events, SHAP explanations"

  default_encryption_configuration {
    kms_key_name = google_kms_crypto_key.bigquery_cmek.id
  }

  labels = {
    module              = "platform"
    env                 = var.env
    data-classification = "restricted"
    cost-centre         = var.cost_centre
  }
}

Network Topology

Shared VPC. Three subnets. One perimeter.

The network design follows the GCP shared VPC pattern — a single host project owns the network, service projects host the workloads. Three subnets map to the three compute layers: agents, data/ML, and infrastructure. The VPC-SC perimeter wraps the entire project boundary and prevents data exfiltration regardless of application behaviour.

Network Topology — Shared VPC · VPC-SC · Private Service Connect

europe-west3 (Frankfurt) · all resources · VPC-SC perimeter · BeyondCorp access · Cloud Armor WAF

Agent subnet (ae-agents)

Data/ML subnet (ae-data-ml)

Infra subnet (ae-infra)

VPC-SC perimeter

External systems

Security Architecture

Zero-trust enforced at every layer — not just the perimeter.

The CISO requirement from Page 02 (S-09) is satisfied structurally — not by policy. Every security control below is enforced by infrastructure code, not by operational procedure. A misconfigured application cannot bypass these controls because they are not application-level configurations.

Identity & Access

Zero-trust IAM — no implicit trust

One service account per agent — minimum required permissions only (least privilege)

Workload Identity Federation — no service account key files exist anywhere in the codebase

Org Policy: iam.disableServiceAccountKeyCreation enforced at organisation root

BeyondCorp IAP — all access to management interfaces requires device trust verification + MFA

Org Policy: iam.allowedPolicyMemberDomains — only ClaraVis identities can be bound to IAM roles

All IAM bindings managed exclusively by Terraform — no console grants

Data Protection

CMEK + VPC-SC — dual-layer data protection

CMEK on all storage: BigQuery, Firestore, GCS, Cloud Run — ClaraVis holds key custody via Cloud KMS HSM

Key rotation: 90 days — automated via Terraform rotation_period configuration

VPC-SC perimeter — all API calls to restricted services must originate from within the perimeter

Org Policy: gcp.resourceLocations constrains all resources to europe-west3/europe-west4 only

DLP API — automatic PII detection on data entering the GCS contract store bucket

All data in transit: TLS 1.3 minimum — enforced by Cloud Load Balancer SSL policy

Network Security

Defence in depth — four network layers

Cloud Armor WAF: OWASP CRS rules + rate limiting + IP allowlist for Salesforce egress NAT IP

VPC firewall: default-deny-all ingress — only explicitly allowed traffic is permitted

Private Google Access: all managed service API calls via internal IP (PSC) — no public internet

Cloud NAT: outbound internet access (Salesforce API) via NAT — agents have no public IP

Shared VPC: network management centralised in host project — service projects cannot modify network

VPC Flow Logs: enabled on all subnets — retained 30 days in Cloud Logging

Secrets Management

Secret Manager — no hardcoded credentials

All secrets in Secret Manager — Salesforce OAuth tokens, any third-party API keys

Secrets referenced via Cloud Run valueSource.secretKeyRef — never injected as plain env vars in Terraform

Secret access via service account IAM binding — secretmanager.secretAccessor role per SA, per secret

Secret rotation: Salesforce OAuth tokens rotated every 30 days via Cloud Scheduler + Cloud Functions

Pre-commit hook: detect-secrets scanner runs on every git commit — fails on any credential pattern

Audit log: every secret access logged to Cloud Audit Logs — alerts on anomalous access patterns

Posture Management

Security Command Center — continuous validation

Security Command Center Premium — enabled at organisation level

Every Terraform apply triggers a post-deployment posture scan

HIGH and CRITICAL findings block the Cloud Build deployment pipeline — ADR-014

Container images scanned by Artifact Analysis before deployment — known CVEs blocked

Binary Authorization — only images signed by Cloud Build SA are deployable to Cloud Run / GKE

Vulnerability scanning: OS and application layer — automated weekly on all running container images

Audit & Compliance

Immutable audit — every action logged

Cloud Audit Logs: DATA_READ, DATA_WRITE, ADMIN_ACTIVITY enabled on all services — retained 400 days

Audit logs exported to BigQuery via Log Sink — cross-queryable with application audit trail

Log bucket: locked (WORM) — audit logs cannot be deleted or modified for 400 days

HITL event records in Firestore: immutable by write pattern — no update or delete operations permitted

Compliance dashboard: Forseti Security / Config Validator checks Terraform state against CIS GCP Benchmark

ISO 27001 evidence: Cloud Audit Logs + Security Command Center findings exportable as compliance evidence package

Compute Architecture

Cloud Run for agents. GKE Autopilot for batch ML.

The compute split between Cloud Run and GKE Autopilot is ADR-003 from Page 03 — restated here with the full operational rationale. Cloud Run handles the stateless, request-driven agent workloads. GKE Autopilot handles the batch, long-running ML training jobs that benefit from GPU access and longer execution windows.

Service	Compute Platform	Scaling	Resources	Justification	Cost Model
CCAI Sales Agent	Cloud Run v2	0–10 instances · request-driven	2 vCPU · 4Gi · CPU idle	Stateless conversational handler. Request duration: 1–8s. Scale-to-zero eliminates idle cost. Session state in Firestore — no affinity required.	Per-request billing · ~€0.08/1K req
ContractGuard	Cloud Run v2	0–10 instances · request-driven	4 vCPU · 8Gi · startup boost	Long-running requests (Gemini 1M token analysis: 30–90s). Startup CPU boost reduces cold start latency. Higher memory for Gemini response buffering.	Per-request · ~€0.40/analysis
RevRec AI Agent	Cloud Run v2	0–10 instances · request-driven	2 vCPU · 4Gi	Stateless classification handler. Request duration: 2–5s (XGBoost inference + SHAP). Scale-to-zero acceptable — RevRec runs on contract signed events, not continuously.	Per-request · ~€0.12/classification
Asset IQ Agent	Cloud Run v2 (inference) + GKE (batch feature eng.)	0–5 Cloud Run · GKE: scheduled jobs	2 vCPU · 4Gi (CR) · n2-standard-4 (GKE)	RUL inference: Cloud Run (request-driven, short). Daily feature engineering over 12,000 units: GKE Autopilot (batch, long-running, benefit from parallelism).	Mixed: per-request + batch pod billing
FinRisk Sentinel	Cloud Run v2 · always-on	1–5 instances · min 1 (streaming)	2 vCPU · 4Gi	Streaming anomaly detection requires a persistent connection to BigQuery streaming inserts. Min 1 instance to eliminate cold start latency on financial event processing.	Always-on: ~€45/month per instance
Orchestrator	Cloud Run v2	0–20 instances · request-driven	2 vCPU · 4Gi	A2A message handler. Short requests (dispatch + state write). Highest instance count — the Orchestrator fans out to multiple agents simultaneously and must not become a bottleneck.	Per-request · ~€0.06/dispatch
ML Training (all models)	GKE Autopilot	0–N pods · job-scoped	n1-standard-8 + A100 GPU (training) · n1-standard-4 (evaluation)	ADR-015. Vertex AI Pipelines submits training jobs to GKE Autopilot. A100 GPU required for Gemini embedding computation (ContractGuard feature engineering). Scale-to-zero between pipeline runs.	Pod billing · GPU: ~€2.80/hour · billed only during training

CI/CD Pipeline

Seven steps. Three gates. One Cloud Build pipeline.

Every code change — agent code, Terraform IaC, ML pipeline definition — passes through the same Cloud Build pipeline. Three gates enforce quality, security, and architectural compliance before anything reaches production. The FDA 21 CFR 820 change control requirement is satisfied by the mandatory Terraform plan review gate and the Binary Authorization signing step.

STEP 01

Source Trigger

GitHub webhook · PR merge to main

STEP 02

Test Suite

pytest · coverage ≥ 80% · contract tests

GATE 01

Security Scan

detect-secrets · Trivy CVE · SCC findings

STEP 03

Container Build

Docker build · Artifact Registry push

GATE 02

Terraform Plan

terraform plan · infracost · manual review gate for infra PRs

STEP 04

Deploy Staging

terraform apply · Cloud Run staging · integration tests

GATE 03

Promote Production

Binary Authorization · signed image · manual approval for prod

Cloud Build Pipeline — Trigger → Test → Build → Gate → Deploy

FDA 21 CFR 820 change control: Terraform plan review (Gate 02) + Binary Authorization signing (Gate 03) create the auditable change record

Operations

SLOs. FinOps. GreenOps. DR. Chaos engineering.

Operational maturity is not an afterthought in this architecture — it is provisioned in the same Terraform run as the infrastructure. SLO definitions, budget alerts, carbon-aware scheduling, and DR configuration are all infrastructure-as-code artifacts.

SLO / SLI Definitions — per module

Module	SLI (what is measured)	SLO Target	Error Budget (30d)	Alert Threshold	Burn Rate Alert
CCAI Sales Agent	% requests returning 2xx within 8s	99.5% availability	3.6 hours	P99 latency > 8s for 5 min · error rate > 1% for 10 min	14.4× over 1h
ContractGuard	% analyses completing without error within 120s	99.0% availability	7.2 hours	Analysis error rate > 2% for 15 min · P95 latency > 90s	6× over 6h
RevRec AI	% classifications completing + SHAP generated within 10s	99.9% availability	43 minutes	Any classification error · SHAP generation failure · SAP write failure	36× over 1h — page immediately
Asset IQ	% daily prediction runs completing within 2h window	99.0% availability	7.2 hours	Daily run misses 2h window · prediction error rate > 5%	6× over 6h
FinRisk Sentinel	% financial events scored within 5 minutes of ingestion	99.5% availability	3.6 hours	Streaming lag > 5 min · anomaly scoring failure rate > 1%	14.4× over 1h
HITL Framework	% HITL checkpoints created and presented within 60s of trigger	99.9% availability	43 minutes	HITL creation failure · SLA breach rate > 5% across all checkpoints	36× over 1h — page immediately
Audit Trail	% agent actions with audit record committed within 2s	99.99% availability	4 minutes	Any audit write failure — critical · immediate page	Any failure = immediate page

FinOps — Cost Allocation

Four mandatory labels. Tagged from day one.

Label

module

contractguard · revrec-ai · asset-iq · finrisk · ccai-sales · platform · greenops · strategy-dashboard

Label

env

dev · staging · prod

Label

cost-centre

ClaraVis internal code · maps to finance cost centre in SAP

Label

data-classification

public · internal · confidential · restricted

Budget alerts: 50% · 80% · 100% of monthly budget per module. Alert to ClaraVis finance contact + Cloud Build notification. Resources without required labels fail terraform plan validation.

GreenOps — Carbon-Aware Scheduling

Compute-intensive workloads scheduled to low-carbon windows.

The GreenOps module (Page 04, FRD-06) integrates with the infrastructure layer via Cloud Scheduler and the GCP Carbon Footprint API. ML training jobs submitted to GKE Autopilot via Vertex AI Pipelines are deferred to grid-carbon-optimal windows within a ±6 hour scheduling flexibility window. ESG metrics written to BigQuery for EU CSRD reporting.

Scheduling flexibility: ±6 hours for batch ML training · ±2 hours for daily Asset IQ feature engineering

Carbon intensity source: GCP Carbon Footprint API · europe-west3 grid signal

ESG output: Monthly BigQuery report → Looker Studio dashboard → EU CSRD Scope 3 evidence

Hard limit: RevRec AI and FinRisk Sentinel are never deferred — financial latency SLOs take precedence

Disaster Recovery — RTO / RPO Targets

Tier 1 — Critical (Revenue + Compliance)

RevRec AI · HITL Framework · Audit Trail

These three components directly affect EU AI Act compliance and financial posting integrity. Any downtime creates a compliance gap. Backed by Cloud Spanner (HITL) in multi-region active-active configuration for the HITL audit store. BigQuery audit dataset replicated to europe-west4 as a secondary region. RevRec AI inference is stateless — Cloud Run redeploys in under 3 minutes.

RTO 15 minutes

RPO 0 (multi-region replication)

Backup Continuous · Firestore PITR 7 days

Tier 2 — Important (Operations)

Asset IQ · FinRisk Sentinel · ContractGuard

Important operational modules. Downtime creates business impact but not immediate compliance exposure. Asset IQ runs on a daily cadence — a 4-hour RTO is acceptable (next daily run is the recovery). FinRisk Sentinel has a 5-minute streaming SLO — recovery within 1 hour restores monitoring coverage. ContractGuard is document-processing driven — queued documents can be re-processed after recovery.

RTO 1–4 hours

RPO 1 hour (Firestore PITR)

Backup Hourly exports to GCS

Tier 3 — Standard (Sales + Reporting)

CCAI Sales Agent · Strategy Dashboard · GreenOps

Business-important but not operationally critical. CCAI Sales Agent downtime routes inbound inquiries directly to Account Executives (the pre-AE process). Strategy Dashboard is read-only BigQuery — restores on next Cloud Build deploy. GreenOps scheduling defers to non-carbon-aware schedule during recovery window.

RTO 4–8 hours

RPO 24 hours

Backup Daily GCS snapshots

Infrastructure Recovery

Full Environment Rebuild from Terraform State

The entire GCP environment is reproducible from the Terraform state file stored in a CMEK-encrypted GCS bucket in a separate GCP project. A full environment rebuild from terraform apply is tested quarterly as part of the chaos engineering programme. Estimated rebuild time: 45 minutes for the full production environment. The Terraform state bucket is the single most critical recovery artifact — it has its own versioning, Object Lock (WORM), and cross-project replication.

Full rebuild ~45 minutes

State bucket WORM + versioning + cross-project replica

Test cadence Quarterly full rebuild drill

Chaos Engineering — Scheduled Failure Drills

Chaos Experiment 01

Specialist agent failure mid-task

Inject a forced error response from the ContractGuard Cloud Run service after 3 successful clause analyses. The Orchestrator has dispatched a 12-clause contract analysis — 3 succeed, then the agent returns a 500 error.

Expected: Orchestrator opens circuit breaker after 3rd failure. Task routes to HITL fallback. Finance Controller receives manual review request with the 3 completed clause analyses and a note that automated analysis failed for clauses 4–12. Partial work preserved in Firestore. No data loss. Circuit half-opens after 30s.

Chaos Experiment 02

Firestore write failure during HITL creation

Block Firestore writes for the HITL event collection for 60 seconds during an active RevRec AI classification run. The model has completed inference and SHAP computation — the HITL-04 creation write fails.

Expected: RevRec AI agent enters ERROR state. SAP write is never initiated (SAP write requires HITL record ID as mandatory parameter — the write is physically impossible without it). RevRec AI agent retries HITL creation 3 times with exponential backoff. On third failure: task fails, Finance Controller receives manual classification alert. Audit record of the attempted HITL creation is preserved in Cloud Logging even if Firestore write failed.

Chaos Experiment 03

VPC-SC perimeter breach attempt

From inside the VPC, attempt to write a BigQuery record to a dataset in a different GCP project outside the VPC-SC perimeter. Simulate a misconfigured application that attempts to exfiltrate data to an external BigQuery project.

Expected: VPC-SC perimeter blocks the BigQuery API call with a PERMISSION_DENIED error before it reaches the destination project. The attempted write is logged to Cloud Audit Logs as a VPC-SC violation. Security Command Center generates a HIGH finding. Cloud Build pipeline deployment is blocked until the finding is resolved and acknowledged. No data is exfiltrated.

Chaos Experiment 04

HITL SLA timeout — escalation path test

Create a HITL-04 (RevRec AI Finance Controller review) and then simulate the Finance Controller taking no action for 4 hours and 5 minutes — 5 minutes past the 4-hour SLA defined in the HITL specification.

Expected: Cloud Scheduler timeout job fires at t+4h. Escalation to CFO triggered automatically. A second HITL checkpoint is created for the CFO with the original classification, SHAP explanation, and a note that the FC SLA was exceeded. The timeout event is written to the Firestore HITL record as an immutable escalation event. EU AI Act compliance record shows the escalation — the SLA breach is documented, not hidden.

Chaos Experiment 05

CMEK key rotation — zero-downtime verification

Trigger a manual CMEK key rotation for the BigQuery CMEK key during active RevRec AI classification traffic. Verify that ongoing classifications are not interrupted and that new data written after rotation uses the new key version.

Expected: GCP automatically re-encrypts new data with the rotated key. Existing data remains accessible (GCP retains previous key versions). In-flight RevRec AI requests complete successfully. Key rotation event appears in Cloud Audit Logs. Monitoring shows no latency spike or error rate increase during the rotation window. CMEK dashboard shows both old and new key versions active.

Chaos Experiment 06

Full environment rebuild from Terraform state

Quarterly: destroy the entire staging environment (terraform destroy) and rebuild it from scratch (terraform apply) using only the Terraform state file from GCS and the container images from Artifact Registry. Measure total rebuild time and verify all integration tests pass post-rebuild.

Expected: Complete environment rebuild in under 45 minutes. All Cloud Run services healthy within 3 minutes of apply completing. Firestore data restored from PITR backup within 10 minutes. All integration tests pass. This experiment validates the DR runbook and the infrastructure-as-code principle — if the environment cannot be rebuilt from code, the IaC is incomplete.

On-Call Escalation Path

P0 — Immediate page

Audit trail write failure · RevRec AI HITL creation failure · VPC-SC breach · SCC CRITICAL finding

P1 — Page within 15 min

HITL SLA breach > 2 modules simultaneously · FinRisk streaming lag > 10 min · Circuit breaker open on Orchestrator

P2 — Notify within 1 hour

Error budget burn rate > 14.4× · Asset IQ daily run delayed > 2h · SCC HIGH finding · Model drift alert

P3 — Next business day

Budget alert at 80% · CCAI Sales Agent elevated latency · GreenOps scheduling missed window · SCC MEDIUM finding

Infrastructure Design Decisions

Seven questions a senior GCP architect will ask — answered in advance.

These are the gaps a Google Cloud architect or SRE probes in a design review. Each answer below is documented because the absence of it suggests the design hasn't been thought through to production depth.

Decision 01 — Terraform State Bucket IAM

The state bucket is the highest-value attack target in the infrastructure — it is protected accordingly

Write access to the Terraform state file is equivalent to full infrastructure control — an attacker who can write to state can inject arbitrary resources or exfiltrate every secret referenced in outputs. The state bucket IAM has exactly two principals with write access: the Cloud Build SA (for pipeline runs only, scoped to the specific bucket path) and a break-glass admin SA that requires MFA re-authentication for every use and generates a Cloud Audit Log CRITICAL entry on activation. No human principal has direct bucket write access in normal operations. Object versioning is enabled — any state corruption is recoverable to any point within the 90-day retention window. The state bucket lives in a separate GCP project from the workload project, isolated by a distinct VPC-SC perimeter.

Decision 02 — Terraform State Locking

Concurrent applies are prevented by GCS native locking — no custom lock management required

The GCS Terraform backend provides native state locking via object metadata — when a terraform apply begins, it acquires an exclusive lock on the state object. Any concurrent apply attempt fails immediately with a lock error and the conflicting pipeline is halted before it modifies any resources. Cloud Build pipelines that target the same Terraform workspace are serialised via a Cloud Build trigger concurrency limit of 1 — a second trigger fires only after the first completes or fails. Lock acquisition and release events are written to Cloud Audit Logs. In the event of a failed apply that leaves a stale lock (e.g. Cloud Build runner crash), the lock can be force-released only by the break-glass admin SA — not by any pipeline SA.

Decision 03 — Secret Rotation Without Redeployment

Rotated secrets are picked up by running services on the next request — no redeployment required

All Cloud Run services reference secrets via valueSource.secretKeyRef.version: latest — not a pinned version number. When the Salesforce OAuth token rotation function writes a new secret version to Secret Manager, the next request from any Cloud Run service that calls Secret Manager resolves latest to the new version automatically. No redeployment, no restart, no configuration change. The rotation function (Cloud Scheduler → Cloud Functions → Secret Manager) runs on a 30-day cadence for Salesforce tokens, immediately on demand for any secret flagged by the security posture scan. Each rotation event is logged to Cloud Audit Logs and triggers a Secret Manager rotation notification to the on-call channel. The old secret version is disabled (not deleted) for 7 days after rotation — enabling rollback if the new credential is rejected.

Decision 04 — VPC-SC and Salesforce Outbound Traffic

VPC-SC restricts GCP managed service APIs — it does not block outbound internet traffic to Salesforce

VPC Service Controls operates at the GCP API layer — it restricts cross-project and external calls to services like BigQuery, Firestore, and Vertex AI. It does not intercept or block outbound TCP traffic from Cloud Run to the public internet. Salesforce REST API calls exit the VPC via Cloud NAT (which provides outbound internet access from private Cloud Run instances without assigning public IPs) and reach Salesforce's servers directly. The Salesforce NAT egress IP range is static and pre-registered with Salesforce's IP allowlist. Cloud Armor's WAF policy includes an egress rule that restricts outbound HTTP/S traffic to the Salesforce API hostname only — all other outbound internet traffic from the agent subnet is blocked at the firewall level. This separation is intentional: VPC-SC protects ClaraVis's data within GCP; Cloud Armor + firewall rules protect the outbound channel to Salesforce.

Decision 05 — Local Terraform Apply Prevention

IAM design prevents any developer from running terraform apply against production — Cloud Build SA is the only principal with apply permissions

Developer GCP credentials are granted read-only access to the production project (roles/viewer + specific read roles for debugging). The IAM roles required to provision or modify infrastructure — roles/compute.admin, roles/iam.securityAdmin, VPC-SC admin — are bound exclusively to the Cloud Build SA. A developer who clones the repo and runs terraform apply locally with their own credentials will receive PERMISSION_DENIED on the first resource that requires elevated roles. This is enforced structurally by IAM — not by policy, convention, or honour system. Developers have full apply access to the dev environment, where a separate SA with equivalent permissions exists. The separation of dev vs prod SA permissions is provisioned in Terraform and enforced by the Org Policy constraint on the prod project folder.

Decision 06 — Terraform Module Versioning

Shared modules are version-pinned — a change to the networking module cannot silently affect the compute module

Each internal Terraform module is tagged in the Git repository using semantic versioning (e.g. networking/v1.3.0). Consuming modules reference a specific tag via the Git source syntax: source = "git::https://github.com/...//modules/networking?ref=v1.3.0". A module update requires a version tag bump and a PR review before any consuming module can reference the new version. The root module's terraform.lock.hcl file pins provider versions with hash verification — provider upgrades are explicit, reviewable, and auditable. This means the entire dependency graph is reproducible from the lock file alone: given the same lock file and the same tagged module versions, terraform init always produces an identical provider and module set — a property the quarterly full environment rebuild drill (Chaos Experiment 06) explicitly tests.

Decision 07 — Binary Authorization Attestation Chain & Break-Glass

Attestation model, signing key custody, and the emergency bypass procedure — specified before the first deployment

The Binary Authorization attestation chain has two paths. The standard path: Cloud Build SA signs each container image digest using a Cloud KMS asymmetric key (EC P-256) during the Gate 03 step. The attestation is stored in Container Analysis and verified by the Binary Authorization policy on every Cloud Run and GKE deployment — unsigned images are rejected at the API layer before the container ever starts. The break-glass path: a secondary attestor SA exists for emergency hotfixes that cannot wait for the full CI/CD pipeline. Access to the break-glass attestor requires MFA re-authentication, generates a Security Command Center CRITICAL finding within 60 seconds of use, and mandates a post-incident review ticket within 24 hours. The break-glass key is a separate Cloud KMS asymmetric key with its own key ring, accessible only to two named security leads. Every break-glass use is logged to an immutable Cloud Audit Log bucket with 7-year retention — satisfying the FDA change control audit trail requirement for emergency software changes.

Attestation Chain Summary

Standard path: Cloud Build SA → KMS sign (EC P-256) → Container Analysis attestation → Binary Auth policy enforces on deploy

Key custody: Standard key — Cloud Build SA only. Break-glass key — 2 named security leads, MFA-gated, separate key ring

Break-glass trigger: SCC CRITICAL finding within 60s · mandatory post-incident review within 24h · immutable audit log 7yr retention

FDA 21 CFR 820: Break-glass audit trail satisfies emergency software change control documentation requirement

Unsigned image policy: DENY — no exceptions. Images without a valid attestation are rejected at the API layer before scheduling.

Architecture Decision Records

Three infrastructure decisions. Every alternative documented.

ADR-013 through ADR-015 cover the key infrastructure architecture choices. Each was made after evaluating the alternatives — the reasoning is documented here because it is the reasoning a Google Cloud architect or SRE will probe.

ADR-013

Shared VPC over separate VPCs per module

Separate VPCs per module (one per AE module) were evaluated as an isolation pattern. Rejected because: (1) VPC peering between 8 module VPCs creates an O(n²) peering mesh that becomes operationally complex and approaches GCP peering limits. (2) Shared VPC centralises network management in the host project — module service accounts cannot modify network topology, which strengthens the security posture. (3) Private Service Connect endpoints are provisioned once in the shared VPC and consumed by all modules — vs. 8 separate PSC configurations. The security isolation between modules is maintained at the IAM and VPC-SC layer, not the network boundary layer. Modules in the same VPC but different subnets cannot communicate without explicit firewall rules — the default-deny-all ingress policy enforces this.

Accepted · Phase Infra Design

ADR-014

Cloud Armor over third-party WAF (Cloudflare, Imperva)

Third-party WAF solutions were evaluated — Cloudflare and Imperva both offer mature OWASP protection and DDoS mitigation. Rejected because: (1) Third-party WAFs route traffic outside the GCP network before it reaches Cloud Load Balancer — this means traffic traverses a third-party's infrastructure before entering the VPC-SC perimeter. In a ClaraVis data residency context, any traffic routing outside GCP before the perimeter is a data sovereignty risk that requires legal review. (2) Cloud Armor integrates natively with Cloud Load Balancer and Security Command Center — findings from Cloud Armor violations appear in the SCC dashboard alongside infrastructure findings. (3) Cloud Armor's Adaptive Protection (ML-based DDoS detection) is included in the Cloud Armor Enterprise tier — comparable capability to third-party solutions without the data routing concern.

Accepted · Phase Infra Design

ADR-015

GKE Autopilot over GKE Standard for batch ML workloads

GKE Standard was evaluated for the batch ML training workloads submitted by Vertex AI Pipelines. Rejected in favour of GKE Autopilot because: (1) GKE Standard requires node pool management — selecting machine types, managing node lifecycle, patching OS images. Autopilot abstracts all of this — Vertex AI Pipelines submits pod specs, GKE Autopilot provisions the right nodes. (2) Autopilot's per-pod billing model is more cost-efficient for intermittent batch workloads — Standard charges for node runtime regardless of whether pods are scheduled. (3) Autopilot automatically applies GKE hardening (no SSH access, enforced container security context, workload identity pre-configured). For a portfolio-scale deployment where the ML engineer is also the infra engineer, Autopilot's managed approach is the correct tradeoff — operational simplicity over fine-grained control.

Accepted · Phase Infra Design