Data Governance — Module 08 · The Autonomous Enterprise

System Context — C4 Level 1

Raw data in from six sources. Validated, lineage-tagged data out to every module.

Data Governance sits at the boundary between the raw regional data sources and the AE data fabric. Everything that enters the Feature Store, the BigQuery financial event stream, and the contract event stream passes through Data Governance first. It is the only module with no downstream dependents blocked by it — but every downstream module is dependent on it being correct.

Data Governance — System Context (C4 Level 1)

H1 foundation · three input streams · Feature Store + BigQuery outputs · data steward for quarantine reinstatement only

Data Governance boundary (H1)

Asset telemetry (6 regions)

Feature Store (validated output)

Quarantine + steward alert

Architecture — Validation Pipeline + Two Diagrams

Schema diff shows what it catches. Lineage graph shows why it matters.

Data Governance has no Vertex AI inference endpoint — it uses TFX's deterministic validation engine and BigQuery lineage tables. The two diagrams below are complementary: the schema diff shows the concrete violation that triggers quarantine; the lineage graph shows the chain from a raw sensor reading to a Vertex AI inference, explaining why every link in that chain must be validated and tagged.

Data Governance — Validation Pipeline

Three input streams · TFX validation · quality scoring · lineage tagging · three output paths

DG Agent

TFX Validation + Quality Scoring

Lineage Tagger → Feature Store

Quarantine path

Schema Diff — APAC-East · schema v2.1 arriving vs v2.0 expected

Green = match · red = new field (unexpected) · amber = type mismatch · grey = expected but absent

Expected (canonical v2.0)

device_id: STRING ✓

region_code: STRING ✓

gradient_coil_temp: FLOAT ✓

helium_level: FLOAT ✓

rf_power_deviation: FLOAT ✓

scan_count_daily: INTEGER ✓

coil_resistance_ohm: FLOAT ✓

magnet_ramp_time_s: FLOAT ✓

event_timestamp: TIMESTAMP

bearing_vibration_hz: FLOAT

Arriving (APAC-East v2.1)

device_id: STRING ✓

region_code: STRING ✓

gradient_coil_temp: FLOAT ✓

helium_level: FLOAT ✓

rf_power_deviation: FLOAT ✓

scan_count_daily: INTEGER ✓

coil_resistance_ohm: FLOAT ✓

magnet_ramp_time_s: FLOAT ✓

event_timestamp: STRING ⚠ was TIMESTAMP

bearing_vibration_hz: — (absent)

optical_sensor_temp_c: FLOAT ✗ new field

Violations detected (2): (1) event_timestamp type mismatch — STRING vs expected TIMESTAMP. ISO 8601 string cannot be used directly in BigQuery time-partitioned tables. (2) bearing_vibration_hz absent — required field for RUL model inference; records with this field missing cannot be ingested into the Feature Store. New field optical_sensor_temp_c noted as schema drift — schema registry updated, field added to v2.1 candidate for steward review.

Feature Lineage Graph — gradient_coil_temp_p95 · MCH-0042 → RUL inference

From raw sensor event to Vertex AI inference to HITL-06 — every link validated and tagged · EU AI Act Art. 11

Agent State Machine

Validate. Score. Tag. Route. No ML inference. No HITL in the normal path.

The Data Governance state machine is the simplest in the suite in terms of external interactions — it has no Vertex AI endpoint, no HITL in the normal flow, and no conversational component. Its complexity is in the routing logic: three output paths from the same validation pipeline, each with clearly defined thresholds and consequences.

Data Governance — Agent Finite State Machine

Schema check · quality scoring · three routing paths · quarantine + steward alert · no HITL in normal path

VALIDATING (TFX)

SCORING (quality metrics)

LINEAGE TAGGING + FS WRITE

QUARANTINE + ALERT

Data Flow — Schema Violation Sequence

APAC-East schema version mismatch. Detected in 3 seconds. Steward alerted. Reinstated in 4 hours.

A new batch of asset telemetry arrives from the APAC-East regional system. The system updated to schema version 2.1 overnight without notice — adding a new optical sensor field and changing the event_timestamp type. Data Governance detects both violations, quarantines the batch, and alerts the data steward. The steward reviews, confirms the field mapping, and reinstates.

Data Flow — APAC-East Schema v2.1 Violation → Quarantine → Steward Reinstatement

847 records · 2 violations · quarantined t+3s · data steward reinstates t+4h · Asset IQ unaffected throughout

Data Steward Interface

The only human decision in Data Governance — quarantine reinstatement.

The data steward interface is operationally different from every other HITL in the suite. There is no time pressure (P2, not P0), no financial consequence to the individual decision, and no dual-reviewer requirement. The steward's job is schema triage: understand the violation, determine whether the incoming schema change is legitimate, create the field mapping rule if needed, and reinstate or discard.

Data Governance · Quarantine Review Queue — APAC-East · Schema v2.1 · 847 records

Data Governance — Quarantine Queue · APAC-East batch · 2 violations · P2 alert · 847 records on hold

Quarantine Summary

847

Records quarantined

2

Violations detected

P2

Alert priority

Violation Details

Violation 1 — Type Mismatch (BLOCKING)

Field: event_timestamp · Expected: TIMESTAMP · Arriving: STRING (ISO 8601 format e.g. "2026-03-15T14:22:11Z")
Impact: Records cannot be inserted into BigQuery time-partitioned tables without type conversion. All 847 records blocked.
Recommended fix: Add type coercion rule in schema mapping: PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', event_timestamp)

Violation 2 — New Field (SCHEMA DRIFT)

Field: optical_sensor_temp_c (FLOAT) · Not present in canonical schema v2.0
Source system note: APAC-East added optical bore temperature sensor in firmware v4.2.1 (deployed 2026-03-14)
Suggested mapping: optical_sensor_temp_c → maps to bearing_vibration_hz equivalent in EMEA-North systems (r=0.87 Pearson correlation in pilot data). Accept as new canonical field v2.1 or derive.

Schema Diff

Canonical v2.0 (current)

event_timestamp: TIMESTAMP

bearing_vibration_hz: FLOAT

... 8 other fields unchanged

Arriving v2.1 (APAC-East)

event_timestamp: STRING ⚠

bearing_vibration_hz: — (absent)

optical_sensor_temp_c: FLOAT ✗

... 8 other fields match

Suggested Field Mapping Rule (auto-generated)

Mapping rule v2.1_APAC_EAST (pending approval):
1. PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', event_timestamp) AS event_timestamp
2. optical_sensor_temp_c → stored as new canonical field bearing_temp_optical_c
3. bearing_vibration_hz → NULL (absent in APAC-East v2.1) → imputed via fleet median at inference time

Note: This mapping allows the 847 quarantined records to be reinstated. It also updates the canonical schema to v2.1, accepting the new optical sensor field for all future APAC-East records.

✓ Approve Mapping + Reinstate 847 Records

✗ Discard Batch

Batch Detail

Source region

APAC-East

Records quarantined

847

Units affected

14 MRI-7T + 9 CT-Premium units · APAC-East region

Source schema version

v2.1 (unregistered)

Quarantine timestamp

2026-03-15T14:22:14Z

Asset IQ impact

APAC-East units: no RUL prediction until reinstated. Other 5 regions: unaffected.

EU AI Act

Reinstating with mapping rule preserves feature lineage — Art. 11 provenance chain maintained. Mapping rule logged to ae_governance.schema_mappings.

P2 alert — no SLA
This is a data quality event, not an operational emergency. Asset IQ continues for 5 other regions. Reinstate when schema mapping is confirmed — do not rush.

Architecture Decision Records

Three Data Governance decisions. Every alternative documented.

ADR-DG01 — Data Governance specific

TFX ExampleValidator over custom JSON schema validation for schema conformance

Custom JSON schema validation (using Python jsonschema or Pydantic) was the initial design — write schema definitions in JSON Schema or Pydantic models, validate each incoming Pub/Sub message against the definition, reject on failure. Replaced with TFX ExampleValidator for four reasons: (1) TFX ExampleValidator is stateful — it learns statistics from observed data (mean, standard deviation, distribution) and can flag distributional drift, not just structural violations. A sensor reading that is structurally valid but statistically anomalous (e.g. a temperature value within the valid range but 10 standard deviations above the fleet mean) would pass JSON schema validation and fail TFX's anomaly detection. (2) TFX integrates natively with Vertex AI Pipelines — the same schema definitions used in the offline training pipeline are enforced at serving time. This eliminates the train-serve skew risk where training data passes a different validation than serving data. (3) TFX's Schema artifact is versionable and storable in the ML Metadata store — schema evolution is tracked automatically. (4) The Great Expectations integration in TFX Validate allows documenting data quality expectations in code — these expectations become living documentation that satisfies EU AI Act Article 11's training data documentation requirement.

Accepted · Phase Data Design · Data Governance module

ADR-DG02 — Data Governance specific

Quarantine-then-review over reject-on-arrival for schema violations

The simpler design was to reject schema-violating records at the Pub/Sub subscriber level — if a record fails validation, log the error and discard. This is the common streaming data quality pattern. Rejected for two reasons: (1) Data loss is irreversible in a real-time sensor system. A regional sensor system sending 847 records per batch cannot easily re-send data from 4 hours ago — the sensor data exists only in the Pub/Sub delivery window. Quarantine preserves the raw record in BigQuery, allowing it to be reprocessed once the schema mapping is confirmed. Reject-on-arrival means the data is gone. (2) Schema violations often indicate a legitimate upstream system update — not corrupted data. The APAC-East scenario is the canonical example: the system updated to a new firmware that added a sensor field. The data is valid; the schema registration is missing. Reject-on-arrival would silently discard valid, valuable telemetry data. Quarantine-then-review surfaces the issue to a human who can determine whether the data is recoverable. The cost is operational complexity in the quarantine management pipeline; the benefit is that no valid data is permanently lost to a schema mismatch.

Accepted · Phase Data Design · Data Governance module

ADR-AQ02 (cross-reference)

Unified Pub/Sub schema over per-region API adapters — enforced here

ADR-AQ02 was documented in the Asset IQ module as the schema standardisation decision. It is cross-referenced here because Data Governance is the architectural component that makes ADR-AQ02 enforceable at runtime. The unified Pub/Sub canonical schema is not a convention that source systems are expected to follow — it is a gate that every record must pass before entering the AE data fabric. Data Governance is the enforcement mechanism. Without Data Governance, ADR-AQ02 is a design aspiration. With Data Governance, it is a hard technical constraint. Any regional system that updates its schema without registering the change in ae_schemas will have its records quarantined until the mapping is reviewed. This creates a virtuous feedback loop: schema violations are visible, immediate, and operationally costly (the data steward P2 alert is a real interruption) — which incentivises regional system teams to register schema changes before deploying them.

Accepted · Phase Data Design · Asset IQ ADR-AQ02 · enforced by Data Governance

Stakeholder Rebuttals

Six objections. Each with an architectural answer.

CTO · S-01

Why a separate module — can't the Feature Store handle validation natively?

"Vertex AI Feature Store has data validation capabilities built in. Why build a separate Data Governance module with TFX, quarantine tables, and a steward interface when the Feature Store already handles ingestion quality?"

Architectural response

Vertex AI Feature Store validates that incoming data conforms to the feature group schema — it ensures the correct feature types are present. It does not validate the business rules that make a record meaningful: a temperature reading of 73.4°C is schema-valid but only meaningful if the freshness is within the expected window, the quality score meets the threshold, the source system is registered, and the lineage tag is attached. The Feature Store also does not write quarantine records, does not alert data stewards, does not track schema drift fingerprints, and does not produce the lineage metadata required for EU AI Act Article 11 technical documentation. Data Governance handles the layer between raw Pub/Sub events and Feature Store ingest — it is not a replacement for Feature Store's ingestion schema enforcement but a prerequisite to it.

Evidence: ADR-DG01 (TFX vs Feature Store native validation) · Lineage graph (shows the chain DG enables) · EU AI Act Art. 11 (lineage provenance requirement)

CCO · S-02

How does feature lineage satisfy EU AI Act Article 11?

"EU AI Act Article 11 requires technical documentation covering training data — its provenance, characteristics, and quality. How does attaching a lineage tag to a Feature Store record satisfy that obligation in a way that an auditor can verify?"

Architectural response

The lineage graph diagram above shows the complete chain: raw sensor event → DG validation → Feature Store feature → Vertex AI inference → SHAP explanation → HITL record. Every link in that chain is queryable from BigQuery. For any SHAP explanation presented to a human reviewer, an auditor can trace: the lineage_ref field in the SHAP record → the Feature Store feature group entry with the lineage tag → the DG validation log entry confirming schema v2.0 conformance and quality score → the original Pub/Sub message ID and source system. This is not a description of provenance — it is a queryable audit chain. The TFX validation log entry, the quality score, and the schema version are all written to ae_governance.validation_log at validation time. The Article 11 obligation for training data documentation is satisfied by the same lineage chain — the Feature Store offline store records used for model training carry the same lineage tags, so the training dataset's provenance is fully documented.

Evidence: Lineage graph (§02) · ae_governance.validation_log · Feature Store lineage_ref field · EU AI Act Art. 11 training data documentation

Enterprise Architect · S-08

What happens to downstream modules when a large batch is quarantined?

"If APAC-East sends 847 records and all are quarantined, does Asset IQ stop making predictions for APAC-East units? For how long? And is there any risk that the quarantine cascades to other regional streams?"

Architectural response

Asset IQ continues running on all non-APAC-East records without interruption — the quarantine is scoped to the batch and region that failed validation. The Asset IQ RUL batch job runs across all 12,000 units but uses Feature Store feature values; if an APAC-East unit's feature values are stale (because new records are quarantined), the RUL model uses the last valid feature snapshot. Stale features are flagged — the quality_score for those units drops below the 0.85 high-confidence threshold, routing APAC-East unit predictions to HITL-06 rather than auto-work-order until the quarantine is resolved. This is the correct degraded behaviour: APAC-East units get lower-confidence predictions (requiring FSM review) rather than no predictions. The quarantine cannot cascade to other regional streams because the Pub/Sub topics are independent — ae-asset-events has per-region message filtering, and the DG agent processes each message independently. A APAC-East schema violation does not affect EMEA-North validation.

Evidence: Data flow sequence (Asset IQ continues for 5 regions) · Asset IQ HITL-06 confidence threshold (stale features → lower confidence → HITL routing) · Pub/Sub per-region message filtering

Asset IQ — Field Service Manager

If APAC-East data is quarantined, does Asset IQ stop predicting for those units?

"We have 23 units in the APAC-East region. If their sensor data is quarantined for 4+ hours, do I have any visibility into their health status? Or am I flying blind while the data steward resolves the schema issue?"

Architectural response

You are not flying blind. Asset IQ uses the last validated feature snapshot for APAC-East units — the features from before the quarantine event are still in the Feature Store with their quality scores intact. The daily RUL batch job will still produce predictions for all 23 APAC-East units using yesterday's feature values, with a lower confidence score (because the freshness component of the quality metric degrades as features become stale). Predictions below the 0.82 confidence threshold route to your HITL-06 queue with a note: "Feature freshness below threshold — APAC-East schema quarantine active." You see the prediction, the SHAP attribution based on yesterday's data, and the staleness flag. You can decide to schedule preventive maintenance or wait for the quarantine to resolve. The system never silently stops predicting — it degrades gracefully with visible flags.

Evidence: Asset IQ Feature Store (freshness component in quality score) · HITL-06 interface (staleness flag) · Data flow sequence (stale feature note in Asset IQ HITL output)

CISO · S-09

Who has access to the quarantine dataset — it contains raw telemetry?

"The ae_quarantine BigQuery dataset contains raw telemetry that failed validation — including the original field values, the source system identifier, and the device ID. Who has access to this dataset, and for how long is the data retained?"

Architectural response

The ae_quarantine dataset has four principals: the DG agent SA (dg-sa@, write access — inserts quarantine records), the data steward role (read access — reviews violations), the Data Governance admin SA (read-write — for reinstatement processing), and the audit SA (read-only — 7-year audit access). No developer SA has access. The data classification for quarantine records is the same as the source data — Internal for asset telemetry. Quarantine records are retained for 90 days after resolution (reinstatement or discard), then automatically deleted by a BigQuery table expiry policy. Unresolved quarantine records are retained indefinitely until a steward decision is made. The raw record content in ae_quarantine is the same data that would have been ingested if validation had passed — it carries no additional sensitivity. The VPC-SC perimeter applies to ae_quarantine as it does to all ae_ datasets.

Evidence: Page 07 BigQuery IAM (ae_quarantine access policy) · 90-day quarantine retention (BigQuery table expiry) · Data classification: Internal (asset telemetry)

CFO · S-03

What is the cost of running Data Governance continuously on all six streams?

"Data Governance is always-on, processing three Pub/Sub topics continuously. What is the monthly infrastructure cost, and is there a cheaper architecture — batch validation, for example — that achieves the same quality guarantees?"

Architectural response

Data Governance's Cloud Run cost is minimal because it is stateless and event-driven — it runs only when a Pub/Sub message arrives. At ClaraVis's message volume (approximately 12,000 asset events per day + financial and contract events ≈ ~15,000 messages/day total), the Cloud Run processing time per message is approximately 80ms. Total monthly Cloud Run compute: ~15,000 × 30 × 0.08s = 36,000 seconds = 10 CPU-hours at 2 vCPU = ~€0.14/month at Cloud Run pricing. BigQuery writes for validation log, quarantine records, and lineage tags: approximately €2/month at standard insertion pricing. Total Data Governance monthly infrastructure cost: approximately €2.20. Batch validation (running quality checks once daily) was not considered because schema violations in asset telemetry need to be detected before the daily RUL batch job runs — a quarantine event discovered in a batch validation run at 03:00 UTC would mean an entire day of potentially corrupt features already in the Feature Store. Streaming validation at the point of ingestion is both cheaper and faster than batch validation with quarantine rollback.

Evidence: Cloud Run pricing (event-driven, not instance-idle) · ADR-DG02 (quarantine-then-review over reject) · Asset IQ batch RUL (needs clean features before 02:00 UTC run)

Demo Pathway

Three minutes. One schema violation. Detected, quarantined, reinstated.

The demo shows the complete Data Governance lifecycle for a schema violation — from a synthetic APAC-East message with the wrong timestamp type to the data steward approving reinstatement and the records flowing through to the Feature Store. The key moment is showing the BigQuery quarantine record appearing before the Feature Store entry, demonstrating that nothing bypasses the gate.

00

Setup · 30s before

Show the Data Governance validation log in BigQuery — recent valid records

Open BigQuery, navigate to ae_governance.validation_log. Show recent rows — all status: VALID, quality_score above 0.85, lineage_ref populated. This establishes the normal operating state before introducing the violation. Also show the Vertex AI Feature Store — APAC-East features are present with recent timestamps.

BigQuery validation_logFeature Store APAC-East features

01

Publish violation · 0:00

Publish a synthetic APAC-East message with schema v2.1

Publish a synthetic message to ae-asset-events via Pub/Sub Console with schema version 2.1 — event_timestamp as STRING "2026-03-15T14:22:11Z" (not TIMESTAMP) and the new optical_sensor_temp_c field. Watch the Cloud Run DG agent logs: "Message received · APAC-East · TFX validation starting…"

"One message — wrong schema. Watch the Data Governance agent catch it in under 2 seconds."

Pub/Sub publishCloud Run DG agentTFX ExampleValidator

02

Violation detected · 0:10

Watch the quarantine record appear in BigQuery — before the Feature Store

Cloud Run logs: "Schema violation: event_timestamp type STRING vs TIMESTAMP · new field optical_sensor_temp_c · writing to ae_quarantine." Open BigQuery ae_quarantine — show the quarantine row appearing in real time. Open Feature Store — no new APAC-East feature record. The quarantine is the gate: nothing reached the Feature Store.

"The record is in the quarantine table but not in the Feature Store. The gate worked. The data isn't lost — it's in ae_quarantine with the full violation details. The schema drift event is also logged — Data Governance knows this wasn't random noise, it's a systematic schema change from APAC-East."

BigQuery ae_quarantineSchema drift logCloud Monitoring P2 alert

03

Steward queue · 0:45

Open the data steward quarantine review interface

Open the Data Governance steward interface. Show the quarantine queue entry: APAC-East, 1 record (in the demo, not 847), 2 violations, schema diff displayed. Show the auto-generated field mapping rule. Point out that the sidebar notes the Asset IQ impact — APAC-East units will use stale features until reinstated.

"The steward interface shows exactly what failed and why. The schema diff side-by-side. The auto-generated mapping rule that converts STRING timestamp to TIMESTAMP and maps the new optical sensor field. The steward doesn't need to write SQL — they review the mapping, confirm it's correct, and click Reinstate."

Steward quarantine interfaceSchema diff viewerAuto-generated mapping rule

04

Reinstatement · 1:20

Approve the mapping rule — watch the record flow through to the Feature Store

Click "Approve Mapping + Reinstate." Cloud Run logs: "Reprocessing quarantined record with mapping rule v2.1_APAC_EAST… schema v2.1 accepted… quality_score: 0.93… lineage tag written… Feature Store write." Open Feature Store — the APAC-East feature record now appears with the new optical_sensor_temp_c field and the lineage tag. Open BigQuery ae_governance.validation_log — the reinstated record shows status: REINSTATED with the mapping rule reference.

"The record is now in the Feature Store — with the corrected timestamp, the lineage tag, and the quality score. The ae_schemas table now includes v2.1 as an accepted schema for APAC-East. Future records with v2.1 will pass validation without quarantine. The schema registry updated itself through an operational decision, not a code change."

Steward reinstate actionDG reprocessingFeature Store writeae_schemas v2.1 accepted

05

Lineage · 2:00

Query the lineage chain from Feature Store to validation log

Open BigQuery. Run: SELECT v.source_system, v.schema_version, v.quality_score, v.event_id, f.feature_value FROM ae_governance.validation_log v JOIN ae_assets.feature_store_log f ON v.event_id = f.lineage_ref WHERE v.event_id = 'asst_demo_001'. Show the row: source APAC-East, schema v2.1, quality 0.93, feature value for gradient_coil_temp. This is the EU AI Act Article 11 provenance chain for this feature value.

"If an auditor asks: where did the gradient_coil_temp_p95 value used in the Munich RUL prediction come from, and was it validated? This BigQuery join answers both questions in one query. Source system, schema version, quality score, event ID, validation timestamp — all in one row. Data Governance is what makes that query possible."

BigQuery lineage joinae_governance.validation_logEU AI Act Art. 11