CTO · S-01
Why a separate module — can't the Feature Store handle validation natively?
"Vertex AI Feature Store has data validation capabilities built in. Why build a separate Data Governance module with TFX, quarantine tables, and a steward interface when the Feature Store already handles ingestion quality?"
Architectural response
Vertex AI Feature Store validates that incoming data conforms to the feature group schema — it ensures the correct feature types are present. It does not validate the business rules that make a record meaningful: a temperature reading of 73.4°C is schema-valid but only meaningful if the freshness is within the expected window, the quality score meets the threshold, the source system is registered, and the lineage tag is attached. The Feature Store also does not write quarantine records, does not alert data stewards, does not track schema drift fingerprints, and does not produce the lineage metadata required for EU AI Act Article 11 technical documentation. Data Governance handles the layer between raw Pub/Sub events and Feature Store ingest — it is not a replacement for Feature Store's ingestion schema enforcement but a prerequisite to it.
Evidence: ADR-DG01 (TFX vs Feature Store native validation) · Lineage graph (shows the chain DG enables) · EU AI Act Art. 11 (lineage provenance requirement)
CCO · S-02
How does feature lineage satisfy EU AI Act Article 11?
"EU AI Act Article 11 requires technical documentation covering training data — its provenance, characteristics, and quality. How does attaching a lineage tag to a Feature Store record satisfy that obligation in a way that an auditor can verify?"
Architectural response
The lineage graph diagram above shows the complete chain: raw sensor event → DG validation → Feature Store feature → Vertex AI inference → SHAP explanation → HITL record. Every link in that chain is queryable from BigQuery. For any SHAP explanation presented to a human reviewer, an auditor can trace: the lineage_ref field in the SHAP record → the Feature Store feature group entry with the lineage tag → the DG validation log entry confirming schema v2.0 conformance and quality score → the original Pub/Sub message ID and source system. This is not a description of provenance — it is a queryable audit chain. The TFX validation log entry, the quality score, and the schema version are all written to ae_governance.validation_log at validation time. The Article 11 obligation for training data documentation is satisfied by the same lineage chain — the Feature Store offline store records used for model training carry the same lineage tags, so the training dataset's provenance is fully documented.
Evidence: Lineage graph (§02) · ae_governance.validation_log · Feature Store lineage_ref field · EU AI Act Art. 11 training data documentation
Enterprise Architect · S-08
What happens to downstream modules when a large batch is quarantined?
"If APAC-East sends 847 records and all are quarantined, does Asset IQ stop making predictions for APAC-East units? For how long? And is there any risk that the quarantine cascades to other regional streams?"
Architectural response
Asset IQ continues running on all non-APAC-East records without interruption — the quarantine is scoped to the batch and region that failed validation. The Asset IQ RUL batch job runs across all 12,000 units but uses Feature Store feature values; if an APAC-East unit's feature values are stale (because new records are quarantined), the RUL model uses the last valid feature snapshot. Stale features are flagged — the quality_score for those units drops below the 0.85 high-confidence threshold, routing APAC-East unit predictions to HITL-06 rather than auto-work-order until the quarantine is resolved. This is the correct degraded behaviour: APAC-East units get lower-confidence predictions (requiring FSM review) rather than no predictions. The quarantine cannot cascade to other regional streams because the Pub/Sub topics are independent — ae-asset-events has per-region message filtering, and the DG agent processes each message independently. A APAC-East schema violation does not affect EMEA-North validation.
Evidence: Data flow sequence (Asset IQ continues for 5 regions) · Asset IQ HITL-06 confidence threshold (stale features → lower confidence → HITL routing) · Pub/Sub per-region message filtering
Asset IQ — Field Service Manager
If APAC-East data is quarantined, does Asset IQ stop predicting for those units?
"We have 23 units in the APAC-East region. If their sensor data is quarantined for 4+ hours, do I have any visibility into their health status? Or am I flying blind while the data steward resolves the schema issue?"
Architectural response
You are not flying blind. Asset IQ uses the last validated feature snapshot for APAC-East units — the features from before the quarantine event are still in the Feature Store with their quality scores intact. The daily RUL batch job will still produce predictions for all 23 APAC-East units using yesterday's feature values, with a lower confidence score (because the freshness component of the quality metric degrades as features become stale). Predictions below the 0.82 confidence threshold route to your HITL-06 queue with a note: "Feature freshness below threshold — APAC-East schema quarantine active." You see the prediction, the SHAP attribution based on yesterday's data, and the staleness flag. You can decide to schedule preventive maintenance or wait for the quarantine to resolve. The system never silently stops predicting — it degrades gracefully with visible flags.
Evidence: Asset IQ Feature Store (freshness component in quality score) · HITL-06 interface (staleness flag) · Data flow sequence (stale feature note in Asset IQ HITL output)
CISO · S-09
Who has access to the quarantine dataset — it contains raw telemetry?
"The ae_quarantine BigQuery dataset contains raw telemetry that failed validation — including the original field values, the source system identifier, and the device ID. Who has access to this dataset, and for how long is the data retained?"
Architectural response
The ae_quarantine dataset has four principals: the DG agent SA (dg-sa@, write access — inserts quarantine records), the data steward role (read access — reviews violations), the Data Governance admin SA (read-write — for reinstatement processing), and the audit SA (read-only — 7-year audit access). No developer SA has access. The data classification for quarantine records is the same as the source data — Internal for asset telemetry. Quarantine records are retained for 90 days after resolution (reinstatement or discard), then automatically deleted by a BigQuery table expiry policy. Unresolved quarantine records are retained indefinitely until a steward decision is made. The raw record content in ae_quarantine is the same data that would have been ingested if validation had passed — it carries no additional sensitivity. The VPC-SC perimeter applies to ae_quarantine as it does to all ae_ datasets.
Evidence: Page 07 BigQuery IAM (ae_quarantine access policy) · 90-day quarantine retention (BigQuery table expiry) · Data classification: Internal (asset telemetry)
CFO · S-03
What is the cost of running Data Governance continuously on all six streams?
"Data Governance is always-on, processing three Pub/Sub topics continuously. What is the monthly infrastructure cost, and is there a cheaper architecture — batch validation, for example — that achieves the same quality guarantees?"
Architectural response
Data Governance's Cloud Run cost is minimal because it is stateless and event-driven — it runs only when a Pub/Sub message arrives. At ClaraVis's message volume (approximately 12,000 asset events per day + financial and contract events ≈ ~15,000 messages/day total), the Cloud Run processing time per message is approximately 80ms. Total monthly Cloud Run compute: ~15,000 × 30 × 0.08s = 36,000 seconds = 10 CPU-hours at 2 vCPU = ~€0.14/month at Cloud Run pricing. BigQuery writes for validation log, quarantine records, and lineage tags: approximately €2/month at standard insertion pricing. Total Data Governance monthly infrastructure cost: approximately €2.20. Batch validation (running quality checks once daily) was not considered because schema violations in asset telemetry need to be detected before the daily RUL batch job runs — a quarantine event discovered in a batch validation run at 03:00 UTC would mean an entire day of potentially corrupt features already in the Feature Store. Streaming validation at the point of ingestion is both cheaper and faster than batch validation with quarantine rollback.
Evidence: Cloud Run pricing (event-driven, not instance-idle) · ADR-DG02 (quarantine-then-review over reject) · Asset IQ batch RUL (needs clean features before 02:00 UTC run)