Data Governance is deployed in Horizon 1 — before any ML model, before any HITL checkpoint, before any inference in any suite. It validates every record entering the AE data fabric, tags every feature with a traceable lineage, and quarantines anything that fails. Without it, the SHAP explanations that satisfy EU AI Act Article 11 have no verified provenance.
Data Governance sits at the boundary between raw regional data sources and the AE data fabric. Everything entering the Feature Store, BigQuery financial stream, and contract event stream — from any suite — passes through Data Governance first. All suites are blocked until this module is live.
The 6-region canonical schema v2.0 is the contract between every regional data source and the AE data fabric. All suites are blocked until this schema is validated and live. Schema changes must be registered before deployment — unregistered changes trigger quarantine automatically.
| Field | Type | Constraint | Version | Notes |
|---|---|---|---|---|
entity_id | STRING | REQUIRED | v1.0 | Unique asset identifier · format: {REGION}-{TYPE}-{SEQ} |
event_id | STRING | REQUIRED | v1.0 | Pub/Sub message ID · used as lineage_ref anchor |
event_timestamp | TIMESTAMP | REQUIRED | v1.0 | UTC · ISO-8601 · type mismatch triggers quarantine |
source_region | STRING | REQUIRED | v1.0 | ENUM: EMEA-North · EMEA-West · APAC-East · APAC-South · AMER-East · AMER-West |
asset_type | STRING | REQUIRED | v1.0 | ENUM: MRI-7T · CT-Premium · MRI-3T · Ultrasound-Elite |
bearing_vibration_hz | FLOAT | REQUIRED | v1.0 | Range: 0–500 · anomaly threshold: fleet_mean ± 3σ |
bearing_temp_c | FLOAT | REQUIRED | v1.0 | Range: -10–150°C · null triggers quality score deduction |
bearing_temp_optical_c | FLOAT | OPTIONAL | v2.1 | New canonical field · maps from optical_sensor_temp_c (APAC-East firmware v4.2.1) |
power_draw_kw | FLOAT | REQUIRED | v1.0 | Range: 0–50kW · feeds GreenOps carbon model |
firmware_version | STRING | REQUIRED | v1.0 | Semver · tracks schema drift to firmware release correlation |
free_text_notes | STRING | OPTIONALDLP SCAN | v1.0 | Engineer field notes · DLP API scanned inline before TFX · PII detection routes to DLP HOLD (not schema quarantine) |
schema_version | STRING | REQUIRED | v1.0 | Written by DG agent post-validation · used as lineage_ref component |
quality_score | FLOAT | REQUIRED | v1.0 | Written by DG quality scorer · 0–1 · threshold 0.85 for Feature Store · 0.60 for BigQuery |
lineage_ref | STRING | REQUIRED | v1.0 | Composite: event_id + schema_version + ingest_ts · EU AI Act Art. 11 anchor |
free_text_notes (OPTIONAL field, present in ~12% of records at ClaraVis volume) is submitted to the Cloud DLP Content API. Structured numeric fields are not inspected — DLP has no value on FLOAT/TIMESTAMP fields. At 15,000 msg/day × 12% = ~1,800 DLP API calls/day → ~54,000/month. Cloud DLP free tier: 1 unit = 1 API call for content <500KB. Monthly DLP cost: 54,000 calls × $0.003/unit ≈ $162/month. Full revised cost model: Cloud Run ~€0.14 + BigQuery ~€2 + Cloud DLP ~$162 = ≈ $165/month total. Batch validation was not considered — see ADR-DG02. DLP cost scales linearly with free_text field prevalence; an operator flag suppresses DLP scan when notes field is absent.
Data Governance uses TFX's deterministic validation engine and BigQuery lineage tables — no ML inference. Every record passes through schema conformance, quality scoring, and lineage tagging before reaching the Feature Store. Records that fail schema validation go to quarantine immediately. Records that pass schema but score below 0.60 quality also quarantine.
A record that fails validation is never silently discarded — it moves to QUARANTINED, where it waits for a data steward decision. If reinstated, it re-enters the VALIDATING state with the approved mapping rule applied. If discarded, an audit tombstone is written to ae_governance.discard_log.
The data steward interface is operationally different from every other HITL in any suite. No time pressure (P2, not P0), no financial consequence to the individual decision, no dual-reviewer requirement. The steward's job is schema triage: understand the violation, confirm the mapping rule, reinstate or discard.
The Data Governance Ops Dashboard is the primary operational interface for the platform on-call engineer. It shows pipeline throughput per region, active quarantine queue depth, SLO burn rate, and DLQ status. All data is live-updated — every record processed, quarantined, or DLQ'd is visible within 5 seconds of the event.
Select a scenario and run the simulation. Watch records arrive via Pub/Sub, pass through the TFX validation pipeline, and route to their outcome. The schema violation scenario shows the full quarantine-then-review path with a data steward alert — click Reinstate to watch the records flow through to the Feature Store.
Data Governance is the H1 prerequisite. Deployment order: M-08 (H1 · PI-1) → M-06 GreenOps (H3 · PI-7) → M-07 Strategy Dashboard (H3 · PI-8). No suite can operate on the AE Platform until the 6-region canonical schema is validated and the Feature Store lineage pipeline is live.