The Phase D reference architecture from Page 03 expressed as Terraform resources, security controls, operational procedures, and SLO definitions. Every GCP resource is provisioned by code. Every security control is enforced at the infrastructure layer. Nothing is configured manually.
These principles are derived from the Architecture Principles established in TOGAF Preliminary Phase (Page 03). They are the engineering expression of those principles — specific enough to evaluate any infrastructure decision against.
The infrastructure is organised into six Terraform modules. The root module wires them together and passes shared variables. Three HCL snippets below show the most architecturally significant resources — the ones a Google Cloud architect will ask to see first.
# modules/networking/vpc_sc.tf # VPC Service Controls perimeter — enforces data residency # and prevents exfiltration outside europe-west3 resource "google_access_context_manager_service_perimeter" "claravis_ae" { parent = "accessPolicies/${var.access_policy_id}" name = "accessPolicies/${var.access_policy_id}/servicePerimeters/claravis_ae_perimeter" title = "ClaraVis AE Production Perimeter" spec { resources = ["projects/${var.project_number}"] restricted_services = [ "bigquery.googleapis.com", "firestore.googleapis.com", "storage.googleapis.com", "aiplatform.googleapis.com", "run.googleapis.com", "pubsub.googleapis.com", "secretmanager.googleapis.com", "cloudkms.googleapis.com", ] access_levels = [ google_access_context_manager_access_level.claravis_corp_devices.name, ] vpc_accessible_services { enable_restriction = true allowed_services = ["RESTRICTED-SERVICES"] } ingress_policies { ingress_from { identity_type = "SERVICE_ACCOUNT" identities = [ "serviceAccount:${var.orchestrator_sa}", "serviceAccount:${var.cloud_build_sa}", ] } ingress_to { resources = ["*"] operations { service_name = "*" } } } } use_explicit_dry_run_spec = false }
# modules/security/iam.tf # Workload Identity Federation — no service account key files # Cloud Run services authenticate via WIF, not key files resource "google_service_account" "contractguard_sa" { project = var.project_id account_id = "contractguard-sa" display_name = "ContractGuard Agent Service Account" description = "Least-privilege SA for ContractGuard Cloud Run service. No key files created." } # Workload Identity binding — Cloud Run service → SA resource "google_service_account_iam_member" "contractguard_wif" { service_account_id = google_service_account.contractguard_sa.name role = "roles/iam.workloadIdentityUser" member = "serviceAccount:${var.project_id}.svc.id.goog[${var.namespace}/contractguard]" } # Minimal permissions — only what ContractGuard needs resource "google_project_iam_member" "contractguard_permissions" { for_each = toset([ "roles/datastore.user", # Firestore read/write "roles/bigquery.dataEditor", # BigQuery audit writes "roles/storage.objectViewer", # GCS contract bucket read "roles/secretmanager.secretAccessor", # Secret Manager read "roles/aiplatform.user", # Vertex AI endpoint invoke "roles/cloudkms.cryptoKeyDecrypter", # CMEK decrypt (own key ring only) ]) project = var.project_id role = each.value member = "serviceAccount:${google_service_account.contractguard_sa.email}" } # Org policy: deny SA key creation — enforced at org root resource "google_org_policy_policy" "deny_sa_key_creation" { name = "organizations/${var.org_id}/policies/iam.disableServiceAccountKeyCreation" parent = "organizations/${var.org_id}" spec { rules { enforce { enforce = true } } } }
# modules/compute/cloud_run.tf # ContractGuard agent service — stateless, VPC-native, CMEK resource "google_cloud_run_v2_service" "contractguard" { project = var.project_id location = var.region # europe-west3 — enforced by org policy name = "contractguard-${var.env}" template { service_account = google_service_account.contractguard_sa.email scaling { min_instance_count = 0 # scale-to-zero in non-prod max_instance_count = 10 } vpc_access { connector = google_vpc_access_connector.ae_connector.id egress = "PRIVATE_RANGES_ONLY" } containers { image = "${var.region}-docker.pkg.dev/${var.project_id}/ae-agents/contractguard:${var.image_tag}" resources { limits = { cpu = "2" memory = "4Gi" } cpu_idle = true # CPU only allocated during request startup_cpu_boost = true } env { name = "PROJECT_ID" value = var.project_id } env { name = "SFDC_TOKEN_SECRET" value_source { secret_key_ref { secret = google_secret_manager_secret.sfdc_oauth_token.secret_id version = "latest" } } } } labels = { module = "contractguard" env = var.env cost-centre = var.cost_centre data-classification = "confidential" } } traffic { type = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST" percent = 100 } }
# modules/security/kms.tf # CMEK key ring and keys — ClaraVis holds custody # Rotation: 90 days. Google cannot access encrypted data. resource "google_kms_key_ring" "claravis_ae" { project = var.project_id name = "claravis-ae-keyring" location = var.region # europe-west3 — keys stay in EU } resource "google_kms_crypto_key" "bigquery_cmek" { name = "bigquery-cmek" key_ring = google_kms_key_ring.claravis_ae.id rotation_period = "7776000s" # 90 days in seconds version_template { algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION" protection_level = "HSM" # Hardware Security Module } labels = { module = "platform" data-classification = "restricted" } } resource "google_kms_crypto_key" "firestore_cmek" { name = "firestore-cmek" key_ring = google_kms_key_ring.claravis_ae.id rotation_period = "7776000s" version_template { algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION" protection_level = "HSM" } } # BigQuery dataset with CMEK resource "google_bigquery_dataset" "audit" { project = var.project_id dataset_id = "ae_audit" location = var.region description = "Immutable audit trail — all agent actions, HITL events, SHAP explanations" default_encryption_configuration { kms_key_name = google_kms_crypto_key.bigquery_cmek.id } labels = { module = "platform" env = var.env data-classification = "restricted" cost-centre = var.cost_centre } }
The network design follows the GCP shared VPC pattern — a single host project owns the network, service projects host the workloads. Three subnets map to the three compute layers: agents, data/ML, and infrastructure. The VPC-SC perimeter wraps the entire project boundary and prevents data exfiltration regardless of application behaviour.
The CISO requirement from Page 02 (S-09) is satisfied structurally — not by policy. Every security control below is enforced by infrastructure code, not by operational procedure. A misconfigured application cannot bypass these controls because they are not application-level configurations.
The compute split between Cloud Run and GKE Autopilot is ADR-003 from Page 03 — restated here with the full operational rationale. Cloud Run handles the stateless, request-driven agent workloads. GKE Autopilot handles the batch, long-running ML training jobs that benefit from GPU access and longer execution windows.
| Service | Compute Platform | Scaling | Resources | Justification | Cost Model |
|---|---|---|---|---|---|
| CCAI Sales Agent | Cloud Run v2 | 0–10 instances · request-driven | 2 vCPU · 4Gi · CPU idle | Stateless conversational handler. Request duration: 1–8s. Scale-to-zero eliminates idle cost. Session state in Firestore — no affinity required. | Per-request billing · ~€0.08/1K req |
| ContractGuard | Cloud Run v2 | 0–10 instances · request-driven | 4 vCPU · 8Gi · startup boost | Long-running requests (Gemini 1M token analysis: 30–90s). Startup CPU boost reduces cold start latency. Higher memory for Gemini response buffering. | Per-request · ~€0.40/analysis |
| RevRec AI Agent | Cloud Run v2 | 0–10 instances · request-driven | 2 vCPU · 4Gi | Stateless classification handler. Request duration: 2–5s (XGBoost inference + SHAP). Scale-to-zero acceptable — RevRec runs on contract signed events, not continuously. | Per-request · ~€0.12/classification |
| Asset IQ Agent | Cloud Run v2 (inference) + GKE (batch feature eng.) | 0–5 Cloud Run · GKE: scheduled jobs | 2 vCPU · 4Gi (CR) · n2-standard-4 (GKE) | RUL inference: Cloud Run (request-driven, short). Daily feature engineering over 12,000 units: GKE Autopilot (batch, long-running, benefit from parallelism). | Mixed: per-request + batch pod billing |
| FinRisk Sentinel | Cloud Run v2 · always-on | 1–5 instances · min 1 (streaming) | 2 vCPU · 4Gi | Streaming anomaly detection requires a persistent connection to BigQuery streaming inserts. Min 1 instance to eliminate cold start latency on financial event processing. | Always-on: ~€45/month per instance |
| Orchestrator | Cloud Run v2 | 0–20 instances · request-driven | 2 vCPU · 4Gi | A2A message handler. Short requests (dispatch + state write). Highest instance count — the Orchestrator fans out to multiple agents simultaneously and must not become a bottleneck. | Per-request · ~€0.06/dispatch |
| ML Training (all models) | GKE Autopilot | 0–N pods · job-scoped | n1-standard-8 + A100 GPU (training) · n1-standard-4 (evaluation) | ADR-015. Vertex AI Pipelines submits training jobs to GKE Autopilot. A100 GPU required for Gemini embedding computation (ContractGuard feature engineering). Scale-to-zero between pipeline runs. | Pod billing · GPU: ~€2.80/hour · billed only during training |
Every code change — agent code, Terraform IaC, ML pipeline definition — passes through the same Cloud Build pipeline. Three gates enforce quality, security, and architectural compliance before anything reaches production. The FDA 21 CFR 820 change control requirement is satisfied by the mandatory Terraform plan review gate and the Binary Authorization signing step.
Operational maturity is not an afterthought in this architecture — it is provisioned in the same Terraform run as the infrastructure. SLO definitions, budget alerts, carbon-aware scheduling, and DR configuration are all infrastructure-as-code artifacts.
| Module | SLI (what is measured) | SLO Target | Error Budget (30d) | Alert Threshold | Burn Rate Alert |
|---|---|---|---|---|---|
| CCAI Sales Agent | % requests returning 2xx within 8s | 99.5% availability | 3.6 hours | P99 latency > 8s for 5 min · error rate > 1% for 10 min | 14.4× over 1h |
| ContractGuard | % analyses completing without error within 120s | 99.0% availability | 7.2 hours | Analysis error rate > 2% for 15 min · P95 latency > 90s | 6× over 6h |
| RevRec AI | % classifications completing + SHAP generated within 10s | 99.9% availability | 43 minutes | Any classification error · SHAP generation failure · SAP write failure | 36× over 1h — page immediately |
| Asset IQ | % daily prediction runs completing within 2h window | 99.0% availability | 7.2 hours | Daily run misses 2h window · prediction error rate > 5% | 6× over 6h |
| FinRisk Sentinel | % financial events scored within 5 minutes of ingestion | 99.5% availability | 3.6 hours | Streaming lag > 5 min · anomaly scoring failure rate > 1% | 14.4× over 1h |
| HITL Framework | % HITL checkpoints created and presented within 60s of trigger | 99.9% availability | 43 minutes | HITL creation failure · SLA breach rate > 5% across all checkpoints | 36× over 1h — page immediately |
| Audit Trail | % agent actions with audit record committed within 2s | 99.99% availability | 4 minutes | Any audit write failure — critical · immediate page | Any failure = immediate page |
These are the gaps a Google Cloud architect or SRE probes in a design review. Each answer below is documented because the absence of it suggests the design hasn't been thought through to production depth.
valueSource.secretKeyRef.version: latest — not a pinned version number. When the Salesforce OAuth token rotation function writes a new secret version to Secret Manager, the next request from any Cloud Run service that calls Secret Manager resolves latest to the new version automatically. No redeployment, no restart, no configuration change. The rotation function (Cloud Scheduler → Cloud Functions → Secret Manager) runs on a 30-day cadence for Salesforce tokens, immediately on demand for any secret flagged by the security posture scan. Each rotation event is logged to Cloud Audit Logs and triggers a Secret Manager rotation notification to the on-call channel. The old secret version is disabled (not deleted) for 7 days after rotation — enabling rollback if the new credential is rejected.roles/viewer + specific read roles for debugging). The IAM roles required to provision or modify infrastructure — roles/compute.admin, roles/iam.securityAdmin, VPC-SC admin — are bound exclusively to the Cloud Build SA. A developer who clones the repo and runs terraform apply locally with their own credentials will receive PERMISSION_DENIED on the first resource that requires elevated roles. This is enforced structurally by IAM — not by policy, convention, or honour system. Developers have full apply access to the dev environment, where a separate SA with equivalent permissions exists. The separation of dev vs prod SA permissions is provisioned in Terraform and enforced by the Org Policy constraint on the prod project folder.networking/v1.3.0). Consuming modules reference a specific tag via the Git source syntax: source = "git::https://github.com/...//modules/networking?ref=v1.3.0". A module update requires a version tag bump and a PR review before any consuming module can reference the new version. The root module's terraform.lock.hcl file pins provider versions with hash verification — provider upgrades are explicit, reviewable, and auditable. This means the entire dependency graph is reproducible from the lock file alone: given the same lock file and the same tagged module versions, terraform init always produces an identical provider and module set — a property the quarterly full environment rebuild drill (Chaos Experiment 06) explicitly tests.ADR-013 through ADR-015 cover the key infrastructure architecture choices. Each was made after evaluating the alternatives — the reasoning is documented here because it is the reasoning a Google Cloud architect or SRE will probe.
Six design phases complete. Page 08 answers the commercial question: how does the AE reach market? Buyer journey mapping, value proposition by persona, adoption phasing, and commercial model options — the architecture expressed as a go-to-market strategy.