Agent Swarm Architecture — Page 05 · The Autonomous Enterprise

Swarm Topology

One orchestrator. Five specialists.
One shared context.

The AE agent swarm uses Google ADK for agent definition, A2A protocol for inter-agent communication, and MCP for tool access. The Orchestrator is the single point of task dispatch — it never executes business logic directly. Specialist agents are stateless and idempotent. All state lives in Firestore. All tool calls are audited.

Agent Swarm — Full Topology

Orchestrator · 5 specialist agents · Tool layer · External systems · A2A communication paths

Orchestrator (ADK)

Specialist agents

MCP tool layer (GCP)

External systems

A2A task dispatch

MCP tool call

Orchestrator Agent

The dispatcher — never the executor.

The Orchestrator is the only agent that receives external requests. It never executes business logic directly. It decomposes tasks, routes sub-tasks to specialist agents via A2A, tracks task completion across the swarm, handles agent failures via circuit breaker, and maintains the global conversation context. It is the only agent with write access to the Orchestrator state collection in Firestore.

State Machine

IDLE — awaiting task dispatch

→ on: inbound A2A task message

DECOMPOSING — breaking task into sub-tasks

→ on: decomposition complete

DISPATCHING — sending sub-tasks to agents via A2A

→ on: all dispatches acknowledged

AWAITING — monitoring specialist agent completions

→ on: all sub-tasks complete / on: HITL pause received

HITL PAUSE — waiting for human approval on one or more sub-tasks

→ on: HITL approved / on: HITL rejected → ROLLBACK

CIRCUIT OPEN — specialist agent failed · fallback active

→ on: retry threshold exceeded

COMPLETE — all sub-tasks resolved · audit record committed

Tool Manifest

a2a.dispatch_taskSend task to specialist agent · returns task_id + ack

a2a.get_task_statusPoll specialist agent for task state · returns current FSM state

a2a.cancel_taskCancel in-flight task · triggers specialist rollback state

firestore.write_orchestration_statePersist orchestrator FSM state · atomic write with task context

firestore.write_audit_recordImmutable audit log · action_id · timestamp · agent_id · state

pubsub.publish_eventBroadcast cross-agent event · topic: orchestrator-events

hitl.create_checkpointCreate HITL state node · links to HITL-spec checkpoint ID

hitl.await_decisionBlock orchestrator until HITL resolution · timeout: per spec

Circuit Breaker Configuration

Failure threshold

3 failures

within 60s window

Open state duration

30 seconds

before half-open probe

Fallback behaviour

HITL escalation

route to human immediately

Audit action

Always

circuit events → Firestore

Agent Specifications

Five agents. Every state. Every tool. Every boundary.

Each specialist agent is defined by three things: its state machine (what states it can be in and what triggers each transition), its tool manifest (the exact MCP tools it is permitted to call), and its autonomy boundary (the line between what it does autonomously and what it escalates to a human). These are not descriptions — they are specifications.

Agent 01

CCAI Sales Agent

Multi-turn conversational agent handling inbound MRI inquiries. Manages qualification, configuration, and CPQ initiation autonomously through 11 turns before escalating to a human AE.

State Machine

IDLE

→ on: inbound inquiry event (Pub/Sub)

QUALIFYING — turn 1–4: budget, authority, need, timeline

→ on: qualification complete

CONFIGURING — turn 5–8: clinical requirements, MRI model fit, BOM

→ on: BOM validated

PRICING — turn 9–11: pricing estimate, delivery timeline, proposal draft

→ on: turn 11 reached OR commercial terms entered

HITL-01 — generating briefing doc · awaiting AE engagement

→ on: AE confirms engagement

HANDED OFF — Salesforce Opportunity created · audit record committed

→ on: agent failure at any state

CIRCUIT OPEN — escalate to VP Sales · preserve conversation state

Tool Manifest (MCP)

gemini.generate_responseMulti-turn conversation · system prompt: sales qualification playbook

salesforce.get_accountLookup hospital account by domain · returns account ID + history

salesforce.create_opportunityCreate Opportunity on escalation · stage: Qualification · auto-populates from conversation context

salesforce.create_activityLog every conversation turn as Activity on the Opportunity

product_catalogue.get_skuRetrieve MRI model specs, configurations, and pricing tiers

bom.validate_configurationRun BOM validation against applications engineering rules

document.generate_briefGenerate structured briefing document from conversation context

firestore.write_conversation_statePersist turn-by-turn conversation state · enables resume after HITL

hitl.create_checkpointHITL-01 · present briefing doc to AE · await engagement confirmation

Autonomy Boundary & Thresholds

Qualification confidence threshold≥ 0.75 → auto

BOM validation required before pricingAlways

Escalation trigger (turn count)Turn 11

Commercial terms detectedImmediate HITL

Circuit breaker threshold3 failures / 60s

Conversation state TTL (Firestore)7 days

Qualification questions (turns 1–4)

Clinical configuration matching against product catalogue

BOM validation and pricing estimate generation

Salesforce Opportunity creation and Activity logging

Briefing document generation

Any discussion of commercial terms, discounts, or deal structure

Escalation to human AE — HITL-01 checkpoint

Custom clinical configuration outside standard catalogue

Agent 02

ContractGuard Agent

Document-native contract intelligence agent. Reads full contracts via Gemini 1.5 Pro 1M token context, performs clause-level analysis, risk scores non-standard terms, and routes to Legal HITL before any counter-proposal is drafted.

State Machine

IDLE

→ on: contract uploaded to GCS (Pub/Sub trigger)

INGESTING — Document AI parsing · GCS → structured clause list

→ on: parse complete · clause count > 0

ANALYSING — Gemini 1.5 Pro full-document reasoning · clause classification

→ on: analysis complete

SCORING — risk model inference · SHAP attribution per flagged clause

→ on: risk scores above threshold detected

HITL-02 / HITL-03 — Legal review queue · awaiting approval per flagged clause

→ on: all HITL decisions received (approve / revise / escalate)

DRAFTING — generating counter-proposal based on HITL decisions

→ on: draft complete

COMPLETE — contract analysis + HITL record + counter-proposal committed

→ on: Document AI parse failure OR Gemini timeout

CIRCUIT OPEN — fallback to manual Legal review · preserve document state

Tool Manifest (MCP)

gcs.read_documentRead contract from GCS bucket · CMEK-encrypted · returns raw bytes

document_ai.parse_contractExtract structured clause list with positions, types, and metadata

gemini.analyse_contractFull 1M-token context pass · system prompt: clause classification + risk taxonomy

risk_model.score_clauseXGBoost clause risk model · returns risk score + SHAP attribution

vector_store.find_precedentsFind 3 most similar clauses from historical contract corpus · returns similarity scores

gemini.generate_counterDraft counter-position for flagged clause based on HITL decision and ClaraVis standard terms

salesforce.update_contractWrite analysis results back to Salesforce Contract object · risk summary field

hitl.create_checkpointHITL-02 (risk clause) / HITL-03 (governing law) · present clause + SHAP + precedents

firestore.write_clause_analysisPersist per-clause analysis, risk scores, SHAP values, and HITL decisions

Autonomy Boundary & Thresholds

Clause risk threshold → HITL≥ 0.65

Governing law non-standardAlways HITL-03

Liability cap ratio threshold> 3× contract value

Gemini confidence (analysis)≥ 0.80 → auto

Max contract size (tokens)900K tokens

Circuit breaker threshold3 failures / 60s

Document AI parsing and clause extraction

Standard clause classification (200+ types)

Precedent search and similarity scoring

Risk scoring below HITL threshold

Any clause with risk score ≥ 0.65 — HITL-02

All non-standard governing law clauses — HITL-03

Counter-proposal generation (requires approved HITL record first)

Contracts above 900K tokens — manual Legal review

Agent 03

RevRec AI Agent

ASC 606 / IFRS 15 revenue recognition classification agent. Classifies every MRI transaction as sale, lease, or multi-element arrangement. Every classification routes through Finance Controller HITL before posting to SAP. No exceptions.

State Machine

IDLE

→ on: contract signed event (Pub/Sub · Salesforce)

EXTRACTING — pulling contract line items, terms, and pricing from Salesforce

→ on: features extracted and validated

CLASSIFYING — ML model inference · ASC 606 rule engine · SHAP computation

→ on: classification complete · confidence ≥ minimum threshold

HITL-04 — Finance Controller review queue · classification + SHAP + comparables

→ on: HITL approved

POSTING — writing classification to Transaction entity · initiating SAP write

→ on: SAP write confirmed

COMPLETE — Transaction entity tagged · SAP posted · audit record committed

→ on: confidence < minimum threshold

HITL-09 — low confidence · manual classification requested

→ on: multi-element detected

HITL-05 — performance obligation split review

Tool Manifest (MCP)

salesforce.get_contract_detailsRetrieve contract terms, line items, pricing, and customer type

feature_store.get_featuresRetrieve pre-computed transaction features from Vertex AI Feature Store

asc606_model.classifyRun trained classification model · returns class + confidence + raw feature vector

shap.explainCompute SHAP values for classification · returns top-5 feature attributions

bigquery.find_comparablesFind 3 most similar historical transactions by feature similarity

hitl.create_checkpointHITL-04 (standard) / HITL-05 (multi-element) / HITL-09 (low confidence)

bigquery.write_transactionWrite Transaction entity with recognition type + performance obligation tags

sap.post_journal_entryInitiate SAP GL posting · requires HITL approval record ID as mandatory parameter

Autonomy Boundary & Thresholds

Minimum classification confidence≥ 0.70 req'd

HITL required for all classificationsAlways

SAP write without HITL recordBlocked by design

Multi-element threshold> 1 perf. obligation

SHAP generationEvery inference

Circuit breaker threshold2 failures / 60s

Feature extraction from Salesforce contract

ASC 606 model inference and SHAP computation

Comparable transaction lookup and presentation

Every classification without exception — HITL-04

SAP GL posting — only after HITL approval record committed

Multi-element splits — HITL-05

Low-confidence classifications — HITL-09 manual review

Agent 04

Asset IQ Agent

Predictive maintenance intelligence agent. Processes unified asset telemetry from 12,000+ MRI units. Runs two-tier ML: fleet-level RUL prediction and unit-level anomaly detection. Routes below-threshold predictions to Field Service HITL.

State Machine

IDLE

→ on: scheduled cadence trigger (daily) OR asset event (Pub/Sub)

INGESTING — reading asset events from unified Pub/Sub pipeline

→ on: event batch assembled

FEATURE ENGINEERING — computing time-series features per unit

→ on: features computed and stored in Feature Store

RUL PREDICTION — fleet-level model inference · SHAP per unit

→ on: predictions complete

ANOMALY DETECTION — unit-level anomaly scan · cross-regional pattern detection

→ on: predictions above confidence → auto work order / below confidence → HITL

HITL-06 — low confidence prediction · Field Service Manager review

→ on: fleet anomaly pattern detected (cross-regional)

HITL-07 — fleet anomaly alert · VP Field Service + FSM review

→ on: all HITL decisions received

COMPLETE — work orders created · Device entities updated · ISO 13485 DHR written

Tool Manifest (MCP)

pubsub.subscribe_asset_eventsPull from unified asset telemetry topic · batch by device_id + time window

feature_store.write_featuresStore time-series features per device for RUL model and drift monitoring

rul_model.predictRun RUL gradient boosting model · returns days_to_failure + confidence + SHAP

anomaly_model.detectIsolation Forest unit-level anomaly · returns anomaly score + contributing sensors

shap.explain_sensorsSHAP for sensor time-series features · top-3 sensor attribution per alert

bigquery.query_fleet_patternsCross-regional pattern query · find units with similar anomaly profiles

salesforce.create_work_orderCreate preventive maintenance work order on Case object

bigquery.update_deviceWrite RUL score + last prediction timestamp to Device entity

hitl.create_checkpointHITL-06 (low confidence) / HITL-07 (fleet anomaly)

bigquery.write_dhr_eventISO 13485 Device History Record event · maintenance activity log

Autonomy Boundary & Thresholds

RUL prediction confidence → auto work order≥ 0.82

RUL prediction confidence → HITL-06< 0.82

Fleet anomaly (≥ N units)≥ 3 units → HITL-07

Anomaly score threshold≥ 0.75

RUL alert horizon< 14 days

Circuit breaker threshold3 failures / 120s

Feature engineering and Feature Store writes

RUL model inference and SHAP computation

Work orders for high-confidence (≥ 0.82) predictions

Device entity updates (RUL score, last prediction)

ISO 13485 DHR event writes

Low-confidence predictions — HITL-06 (FSM approval)

Fleet-level anomaly patterns — HITL-07 (VP Field Service)

Any action that would trigger a potential recall review

Agent 05

FinRisk Sentinel Agent

Real-time financial anomaly detection agent. Monitors the BigQuery financial event stream continuously. Detects unusual payment patterns, revenue posting discrepancies, and warranty reserve movements. Routes high-severity anomalies to CFO + Finance Controller HITL simultaneously.

State Machine

IDLE

→ on: financial event stream (BigQuery streaming insert)

MONITORING — continuous anomaly scan on incoming financial events

→ on: anomaly score above alert threshold

ENRICHING — computing Z-score vs 90-day baseline · SHAP attribution

→ on: severity classified

ALERTING — medium severity: Finance Controller notification + context package

→ on: high severity detected

HITL-08 — high severity · CFO + Finance Controller simultaneous HITL

→ on: HITL decision received (acknowledge / false positive / escalate)

LEARNING — false positive feedback written to baseline update queue

→ on: feedback processed

RESOLVED — anomaly record committed · decision logged · baseline updated

→ on: BigQuery streaming failure

CIRCUIT OPEN — alert ops team · switch to batch scan fallback

Tool Manifest (MCP)

bigquery.stream_financial_eventsSubscribe to financial event stream · filter by event_type: payment, posting, reserve

anomaly_model.score_eventIsolation Forest + statistical anomaly scoring · returns anomaly score + contributing features

bigquery.compute_zscoreCompute Z-score vs 90-day rolling baseline for the event type and entity

shap.explain_anomalySHAP attribution for anomaly score · top-3 financial features

bigquery.get_entity_contextRetrieve full entity context (account, contract, recent transactions) for the anomaly

notification.send_alertSend structured alert package to Finance Controller (medium) or CFO + FC (high)

hitl.create_checkpointHITL-08 · simultaneous CFO + Finance Controller · 1-hour SLA

bigquery.write_anomaly_recordPersist anomaly event, scores, SHAP, HITL decision, and resolution to audit dataset

pubsub.publish_baseline_updatePublish false positive feedback to baseline model update queue

Autonomy Boundary & Thresholds

Alert threshold (anomaly score)≥ 0.65 → alert

HITL threshold (high severity)≥ 0.85

Z-score alert threshold≥ 3.0σ

HITL SLA (high severity)1 hour

Monitoring cadenceStreaming · sub-5min

Circuit breaker threshold5 failures / 120s

Continuous anomaly scoring on financial event stream

Z-score computation and SHAP attribution

Medium-severity alerts with context package (no HITL required)

False positive feedback processing to baseline queue

High-severity anomalies (≥ 0.85) — HITL-08 simultaneous CFO + FC

Any anomaly indicating potential regulatory reporting obligation

Anomalies in warranty reserve — always HITL regardless of score

A2A Protocol

How agents communicate — precisely.

Agent-to-Agent (A2A) is the communication protocol between the Orchestrator and specialist agents. Every message is typed, versioned, and auditable. The sequence below shows a ContractGuard task dispatch and the HITL escalation that follows. The JSON schema below it is the actual message format.

A2A Sequence — ContractGuard Task: Clause Risk Escalation

Orchestrator dispatches task → ContractGuard analyses → HITL-02 pause → Legal approves → counter-proposal generated

A2A Message Schema — Task Dispatch (dispatch_task)

{
  "a2a_version": "1.0",
  "message_type": "TASK_DISPATCH",          // TASK_DISPATCH | TASK_ACK | TASK_UPDATE | TASK_COMPLETE | TASK_ERROR
  "task_id": "task_cg_20260315_001a",       // globally unique · format: task_{agent}_{date}_{seq}
  "correlation_id": "orch_20260315_042",     // orchestration session ID · links all sub-tasks
  "from_agent": "orchestrator",
  "to_agent": "contractguard",
  "timestamp_utc": "2026-03-15T09:14:32Z",
  "task_type": "CONTRACT_ANALYSIS",
  "priority": "NORMAL",                      // NORMAL | HIGH | CRITICAL
  "timeout_seconds": 3600,                  // 1 hour · circuit breaker triggers at 3 failures
  "payload": {
    "contract_id": "sfdc_contract_CV2026_0042",
    "gcs_uri": "gs://claravis-contracts-eu/2026/0042_uniklinik.pdf",
    "counterparty": "Universitätsklinikum München",
    "contract_value_eur": 2840000,
    "analysis_config": {
      "risk_threshold": 0.65,              // clauses above this score → HITL-02
      "governing_law_check": "true",       // always trigger HITL-03 if non-standard
      "precedent_count": 3,                // number of similar precedents to surface in HITL
      "generate_counter": "post_hitl_approval"
    }
  },
  "audit": {
    "initiated_by": "orchestrator-sa@claravis-ae-prod.iam.gserviceaccount.com",
    "audit_trail_id": "audit_20260315_cg_001a",  // Firestore document ID · immutable
    "parent_hitl_ids": []                    // populated when this task is triggered by a HITL decision
  }
}

A2A Message Schema — HITL Update (task_update → HITL_PAUSE)

{
  "a2a_version": "1.0",
  "message_type": "TASK_UPDATE",
  "task_id": "task_cg_20260315_001a",
  "from_agent": "contractguard",
  "to_agent": "orchestrator",
  "timestamp_utc": "2026-03-15T09:42:18Z",
  "state": "HITL_PAUSE",
  "hitl_context": {
    "hitl_spec_id": "HITL-02",                  // references Page 04 HITL specification
    "hitl_event_id": "hitl_20260315_cg_007",      // Firestore document ID · immutable on creation
    "approver_role": "GENERAL_COUNSEL",
    "sla_deadline_utc": "2026-03-16T09:42:18Z",  // 24-hour SLA per HITL-02 spec
    "timeout_action": "ESCALATE_TO_GC_MANAGER",
    "presented_to_human": {
      "clause_text": "Liability limited to 50% of contract value...",
      "risk_score": 0.82,
      "shap_attribution": [
        { "feature": "liability_cap_ratio", "value": 0.5, "contribution": +0.31 },
        { "feature": "governing_law_match", "value": "false", "contribution": +0.24 },
        { "feature": "indemnification_asymmetry", "value": 0.78, "contribution": +0.18 }
      ],
      "precedent_contracts": [
        { "id": "sfdc_contract_CV2024_0108", "similarity": 0.91, "outcome": "negotiated_up_to_80pct" },
        { "id": "sfdc_contract_CV2025_0033", "similarity": 0.87, "outcome": "accepted_with_carve_out" }
      ],
      "decision_options": ["APPROVE_AS_IS", "REQUEST_REVISION", "ESCALATE_EXTERNAL_COUNSEL"]
    }
  }
}

Memory Architecture

Three-tier memory. Each tier with a purpose.

Agent memory is not a monolith. Short-term memory holds the context for the current task — it is ephemeral and task-scoped. Long-term memory holds the institutional knowledge that makes agents smarter over time — contract precedents, historical decisions, asset failure patterns. The shared context bus is the event stream that keeps all agents aware of what other agents are doing.

TIER 01

Short-Term Memory

Firestore · Task-scoped · TTL: 7 days

The working memory for a single agent task. Stores conversation turns (CCAI Sales), document analysis state (ContractGuard), classification context (RevRec), and sensor batch context (Asset IQ). Every write is atomic and timestamped. State is preserved across HITL pauses — the agent can resume from the exact state it was in when it paused.

Schema (agent_task_state collection):
task_id · agent_id · fsm_state
context_payload (JSON) · created_at
last_updated_at · hitl_ids[]
correlation_id · ttl_expires_at

TIER 02

Long-Term Memory

Vertex AI Vector Store · Persistent · Embedding: text-embedding-004

The institutional knowledge base. ContractGuard uses it for precedent search — finding the 3 most similar clauses from ClaraVis's historical contract corpus. RevRec AI uses it for comparable transaction lookup. Asset IQ uses it for cross-regional failure pattern matching. Every HITL decision is written back to long-term memory as a labelled example — the agents get smarter with every human review.

Collections:
contract_clauses · transaction_history
asset_failure_patterns · hitl_decisions
Embedding model: text-embedding-004
Similarity metric: cosine · top-k: 3

TIER 03

Shared Context Bus

Pub/Sub · Cross-agent · Retention: 7 days

The event stream that keeps all agents aware of what is happening across the swarm. When ContractGuard flags a liability clause, FinRisk Sentinel subscribes to the same event and can adjust its financial anomaly baseline accordingly. When Asset IQ detects a fleet-level failure pattern, RevRec AI can factor that into warranty reserve recognition. The shared bus enables cross-module intelligence without direct agent-to-agent coupling.

Topics:
ae-orchestration-events
ae-hitl-events · ae-asset-events
ae-contract-events · ae-financial-events
Retention: 7 days · at-least-once delivery

Guardrails & Safety

What happens when things go wrong — by design.

A production-grade agent swarm is defined as much by its failure modes as its happy path. Every guardrail below is a design artifact — not a monitoring dashboard added after the fact. The circuit breaker, confidence thresholds, hallucination detection, and fallback behaviours are specified before a line of agent code is written.

Guardrail 01

Circuit Breaker

Every specialist agent is wrapped in a circuit breaker that the Orchestrator monitors. When a specialist agent fails to respond within its timeout, returns an error state, or produces an output that fails schema validation, the Orchestrator opens the circuit for that agent and routes the task to a HITL fallback — a human performs the function the agent was trying to perform. The circuit closes after a configurable half-open probe period.

States: CLOSED → OPEN → HALF-OPEN → CLOSED
OPEN trigger: 3 failures in 60s window (default)
HALF-OPEN probe: single request after 30s
OPEN action: route to HITL · preserve task state
Audit: every circuit state transition → Firestore

Guardrail 02

Hallucination Detection

LLM outputs that inform business decisions are validated against a schema contract before they are acted on. Gemini responses from ContractGuard clause analysis must conform to the ClauseAnalysis JSON schema — responses that fail validation are retried with a temperature reduction (0.7 → 0.3 → 0.1) before escalating to HITL. For RevRec AI, the classification must be one of three valid ASC 606 types — any other output triggers an immediate HITL-09 manual classification request.

Validation: JSON schema contract per agent output type
Retry strategy: temperature reduction: 0.7 → 0.3 → 0.1
Max retries: 3 · then HITL escalation
All retry attempts: logged to Firestore audit record
Invalid outputs: never acted on · always HITL

Guardrail 03

Confidence Thresholds

Every ML model inference and LLM analysis in the swarm produces a confidence score. Scores above the configured threshold allow autonomous action. Scores below the threshold pause the agent and route to HITL — the human gets the agent's best work and decides whether to accept it. Thresholds are configured per agent and per action type, not globally. A low-confidence revenue recognition classification is treated differently from a low-confidence qualification assessment.

CCAI Sales: qualification confidence ≥ 0.75
ContractGuard: Gemini analysis confidence ≥ 0.80
RevRec AI: classification confidence ≥ 0.70 (HITL always regardless)
Asset IQ: RUL confidence ≥ 0.82 for auto work order
FinRisk: anomaly score ≥ 0.85 for high-severity HITL

Guardrail 04

Fallback & Rollback

Every agent task has a defined rollback path — the set of compensating actions that restore the system to its pre-task state if the task fails or is rejected at HITL. Firestore's transactional writes mean that partial state is never committed. The Orchestrator tracks every state transition and can reconstruct the pre-task state from the Firestore audit record for any task that needs to be rolled back. SAP write operations are the only irreversible action — they require a committed HITL approval record as a mandatory input parameter.

Rollback trigger: HITL rejection · circuit open · agent timeout
State preservation: Firestore atomic writes · no partial state
SAP write guard: HITL approval record ID is a required parameter
Rollback audit: rollback action written to Firestore before execution
Salesforce rollback: Opportunity stage reverted · activity log appended

Architecture Decision Records

Three decisions. Every alternative documented.

ADR-007 through ADR-009 are produced in the agent swarm design phase. Each states the choice, the alternatives that were evaluated, and why this choice was made — the reasoning that a principal engineer or enterprise architect will probe in any serious design review.

ADR-007

Google ADK over LangGraph or CrewAI

ADK selected as the agent orchestration framework. LangGraph provides excellent graph-based state machine support but runs on arbitrary Python infrastructure — it has no native GCP observability, IAM integration, or Vertex AI deployment path. CrewAI is high-level and fast to prototype but does not expose the state machine primitives required for formal HITL checkpoint specification. ADK runs natively on Cloud Run with Vertex AI integration, has first-class Firestore state management, and its A2A protocol is an open standard — not a proprietary message format locked to one vendor's SDK.

Accepted · Phase Agent Design

ADR-008

A2A protocol over direct HTTP for inter-agent communication

Direct HTTP calls between agents were considered as the simplest integration path. Rejected because: direct HTTP creates tight coupling between agent endpoints, makes circuit breaking the Orchestrator's responsibility rather than a platform concern, and produces no auditable message record. A2A messages are published to Pub/Sub, giving the event bus replay capability, at-least-once delivery guarantees, and a complete message history that is queryable in BigQuery. Every A2A message is also written to the Firestore audit record — direct HTTP calls are not.

Accepted · Phase Agent Design

ADR-009

Firestore over Redis for agent state and HITL audit

Redis was considered for agent short-term memory given its low latency and widespread use for session state. Rejected for two reasons: (1) Redis is an in-memory store — data loss on failure requires a persistence configuration that adds operational complexity. (2) The HITL audit requirement mandates that HITL event records are immutable and durable by design — Redis TTL-based eviction is architecturally incompatible with an immutable audit store. Firestore's transactional writes, native JSON document model, and eu-west3 regional deployment satisfy both the state management and immutable audit requirements in a single managed service.

Accepted · Phase Agent Design

One orchestrator. Five specialists.One shared context.

The dispatcher — never the executor.

Five agents. Every state. Every tool. Every boundary.

How agents communicate — precisely.

Three-tier memory. Each tier with a purpose.

What happens when things go wrong — by design.

Three decisions. Every alternative documented.

One orchestrator. Five specialists.
One shared context.