MLOps for an
LLM-native
pipeline.
The Autonomous Author does not train models. It governs prompts. In an LLM-native pipeline, system prompts are the models — they encode the pipeline's intelligence, define its behaviour, and degrade over time if not managed. This page documents the operational discipline that keeps the pipeline reliable, measurable, and upgradeable.
What MLOps means for a
prompt-governed pipeline.
Traditional MLOps governs training pipelines, model registries, feature stores, and inference infrastructure. None of that applies here — the Autonomous Author uses pre-trained models via an API and produces no training artefacts. But the operational challenges are structurally identical: the system can degrade, outputs can drift, "models" (system prompts) need versioning and testing, and upgrades need a decision process.
The mapping is direct. System prompts are models. rules.json is training data. The evaluation harness is the test suite. Output quality monitoring is drift detection. The model upgrade protocol is the promotion gate. Every MLOps concept has a clean analogue in this architecture.
System prompts are models.
Govern them accordingly.
Each agent's system prompt is the primary encoding of its behaviour. A change to a system prompt is functionally equivalent to retraining a model — it can improve performance, introduce regressions, change the output schema, or break downstream agents. Every prompt change therefore goes through the same discipline as a code change: PR review, evaluation harness run, and a version bump before deployment.
| Prompt ID | Agent | Version | Last Change | Eval Baseline | Breaking Change |
|---|---|---|---|---|---|
| intake-agent | A-01 Intake | v1.3.0 | Added persona auto-detection for P2 imperative language signals | Dimension avg: 0.91 | No — additive detection logic |
| research-agent | A-02 Research | v1.2.1 | Tightened gap detection — reduced false-positive proper noun flags | Dimension avg: 0.88 | No — output schema unchanged |
| draft-agent-p1 | A-03 Draft (P1) | v2.0.0 | Major: restructured output to enforce Parameters table in all feature docs | Dimension avg: 0.93 | Yes — DraftDocument schema v2 required |
| draft-agent-p2 | A-03 Draft (P2) | v1.4.2 | Strengthened placeholder instruction — reduced inference on missing fields | Dimension avg: 0.89 | No — P-10 compliance improvement |
| ambiguity-detector | A-04 Ambiguity | v1.1.0 | Added implicit assumption detection category (P-09 coverage expansion) | Dimension avg: 0.86 | No — AmbiguityReport gains new flag category |
| compliance-agent | A-05 Compliance | v1.5.3 | Tuned semantic check prompt — reduced false positives on passive voice detection | Dimension avg: 0.94 | No — ComplianceReport schema unchanged |
| review-prep-agent | A-06 Review Prep | v1.0.1 | Fixed priority sort — HIGH ambiguity flags now surface before MED compliance violations | Dimension avg: 0.97 | No — ReviewBundle sort order only |
50 fixtures. 8 dimensions.
Every prompt change tested.
The evaluation harness is the test suite for the pipeline. It consists of 50 labelled fixture pairs (input → expected output), scored against 8 quality dimensions. A prompt change that degrades any dimension below its baseline threshold is rejected — regardless of improvements on other dimensions. Regressions in any dimension are blocking, not advisory.
Output contract never breaks
Every agent output is validated against its TypeScript-style output schema. Any response that fails to produce a valid typed object fails this dimension with a score of 0.0. Threshold: 1.00 — this is a binary gate, not a percentage. One schema violation on any fixture = blocked.
THRESHOLD = 1.00 · Binary gateKnown violations caught
20 fixture documents are seeded with known violations across all 10 rule categories. Scored as: violations detected / violations present. The threshold of 0.90 allows for up to 2 misses on a 20-violation document — chosen to accommodate edge cases in semantic rules without demanding perfection on hard cases.
THRESHOLD ≥ 0.90 · Recall measureVague terms found before review
25 P2 fixtures contain seeded vague quantifiers, undefined terms, and missing error states at known locations. Scored as: flags raised at correct locations / total seeded flags. Threshold 0.88 acknowledges that implicit assumption detection is harder than vague quantifier detection — a small miss rate is acceptable.
THRESHOLD ≥ 0.88 · P2 onlyMissing context never inferred
Every P2 fixture with a documented context gap must produce exactly one [REQUIRES INPUT:] placeholder at that gap location. Threshold 1.00 — this is P-10 enforced as a metric. A single instance of the Draft Agent inferring content to fill a known gap is a blocking failure.
THRESHOLD = 1.00 · P-10 enforcementEvery agent surfaces reasoning
Every fixture run must produce a valid XAI card for every agent — all four fields populated (understood, decided, why, uncertainties) with non-empty values. Any agent that omits its XAI card or produces an empty field fails this dimension. Threshold 1.00 — AR-01 is binary.
THRESHOLD = 1.00 · AR-01 enforcementNoise in compliance output controlled
Clean fixtures (documents with no violations) are scored for false positive compliance flags. Threshold ≤ 0.08 — an 8% false positive rate means at most 1.6 false flags on a 20-rule check. Above this threshold, the compliance output becomes noisy enough to erode Maya's trust in the tool.
THRESHOLD ≤ 0.08 · Trust metricConfidence scores predict review need
For fixtures where human review found issues the agent didn't flag, the agent's confidence score should have been below 0.80. This measures whether confidence scores are predictive of actual quality — not just reported high. Computed as correlation between low confidence and missed flags.
THRESHOLD ≥ 0.80 · Predictive validityPerformance within target
P95 latency for a full P2 pipeline run (the slower path) must stay under 20 seconds on Groq free tier. This is not a quality dimension per se — it is a user experience gate. A pipeline that takes 45 seconds to run will be abandoned. Target: 20s p95, with headroom below the 20-minute total session target.
THRESHOLD ≤ 20s P95 · UX gateOutput quality monitoring —
detecting degradation without retraining.
In a traditionally trained model, drift is detected by monitoring the distance between production input distributions and training data distributions. In an LLM-native pipeline, the model doesn't change — but the upstream LLM can change (Groq model updates), the input distribution can change (Maya uses the tool for new doc types), and prompt version changes can introduce regressions. Drift manifests as output quality degradation, not distribution shift.
The compliance rule set —
authored, versioned, tested, deployed.
rules.json is not a configuration file — it is the data artefact that makes the Compliance Agent deterministic. It is the closest analogue to a training dataset in this pipeline. Accordingly it has its own lifecycle: authoring, validation, testing, versioning, and deployment. A change to rules.json that introduces false positives is a data quality regression, not a configuration error.
{
"version": "1.5.3",
"generated": "2026-02-14",
"rule_count": 80,
"rules": [
{
"id": "R-047",
"category": "Terminology",
"rule": "Avoid Latin abbreviations: e.g., i.e., etc., vs., via.",
"check_type": "PATTERN",
"pattern": "\\b(e\\.g\\.|i\\.e\\.|etc\\.|vs\\.|\\bvia\\b)",
"severity": "medium",
"fix_template": "Replace '{match}' with: e.g. → 'for example', i.e. → 'that is', etc. → (list items explicitly), vs. → 'versus'",
"style_guide_ref": "Google Developer Style Guide § Abbreviations",
"positive_fixture": "The API supports multiple auth methods, e.g., OAuth and API keys.",
"negative_fixture": "The API supports multiple authentication methods, including OAuth and API keys."
},
{
"id": "R-074",
"category": "Tone",
"rule": "Avoid 'simple', 'easy', 'just', 'straightforward', 'obviously', 'trivially'.",
"check_type": "PATTERN",
"pattern": "\\b(simple|simply|easy|easily|just|straightforward|obviously|trivially)\\b",
"severity": "low",
"fix_template": "Remove '{match}' — describe the steps directly without characterising their difficulty.",
"style_guide_ref": "Google Developer Style Guide § Accessible language"
}
]
}
When Groq releases a new model —
the decision process.
Model upgrades are the highest-risk operational event in an LLM-native pipeline. A new model can improve performance on some dimensions while silently regressing on others. The upgrade protocol ensures that a model swap only happens after all 8 eval dimensions are verified against the new model — not on the basis of benchmark claims from the model provider.
| Model | D1 Schema | D2 Compliance | D3 Ambiguity | D4 Placeholder | D5 XAI | D6 FP Rate | D7 Conf | D8 Latency | Decision |
|---|---|---|---|---|---|---|---|---|---|
| Llama 3.1 70B baseline |
1.00 | 0.94 | 0.91 | 1.00 | 1.00 | 0.06 | 0.84 | 14.2s | CURRENT |
| Llama 3.2 70B hypothetical |
1.00 | 0.96 | 0.93 | 1.00 | 1.00 | 0.05 | 0.87 | 12.8s | PROMOTE |
| Mixtral 8x22B hypothetical |
1.00 | 0.91 | 0.82 | 0.94 | 1.00 | 0.07 | 0.81 | 16.1s | HOLD |
Red cells = gate failed. Model is held until all 8 dimensions pass. Mixtral fails on D3 (ambiguity detection) and D4 (placeholder insertion) — both P2 quality gates. Not acceptable for a pipeline where DDD spec quality is a first-class requirement.
Three MLOps challenges.
Every objection answered.
"You're not doing MLOps. You're doing software engineering with some extra JSON files. Renaming it MLOps is resume padding."
Call it "Prompt Engineering" and skip the MLOps framing entirely. Simpler, less contestable.
MLOps is fundamentally the discipline of keeping ML-powered systems reliable in production. The specific implementation — whether weights or prompts encode the intelligence — is secondary. This pipeline has the same operational risks as a traditionally trained system: output can degrade, the "model" (prompt) can be changed with unintended consequences, the inference backend can change, and there is no natural audit trail without deliberate instrumentation. The MLOps tooling — prompt registry, eval harness, drift detection, upgrade protocol — addresses exactly these risks. The framing is accurate. The artefacts are different; the concerns are identical.
The analogy between prompts and model weights is imperfect — prompts are interpretable and editable in a way weights are not. The governance process is less formal than full MLOps. These limitations are documented in the Glossary under "Prompt Governance."
"A 50-fixture eval set has no statistical power. You cannot draw meaningful conclusions from it. A model that passes 50 fixtures could still fail catastrophically on production inputs."
Expand to 500+ fixtures. More data = more confidence. Harder to argue with.
The eval harness serves a different purpose than a held-out test set in a supervised learning context. It is not measuring generalisation from a large sample — it is verifying that specific, documented behaviours are preserved across prompt changes. The 50 fixtures are carefully constructed to cover all 10 rule categories, all failure modes documented in known_failure_modes in each prompt file, and edge cases identified in production sessions. This is a specification-driven test suite, not a statistical sample. Expanding to 500 fixtures would mostly add redundant coverage of already-tested cases. The investment is in fixture quality and coverage of known failure modes, not fixture quantity.
The eval set will not catch novel failure modes on production inputs that don't resemble any fixture. This is mitigated by the session log drift monitoring — production failures surface as quality degradation signals even if the eval set didn't predict them. The eval set prevents regressions on known behaviours; the drift monitor catches new failures in production.
"Your drift detection relies on session logs stored in the writer's browser. If she clears her cache, you lose your entire drift signal. You have no production telemetry."
Send anonymised session metrics to a free-tier telemetry service (PostHog, Plausible). Reliable, persistent drift signal without depending on browser storage.
AR-03 (client-side only — no backend) and C-03 (no workflow disruption) are binding constraints. Sending session metrics to a telemetry service — even anonymised — crosses the "no data leaves the browser" guarantee. Enterprise writers operate in environments with data governance policies. A tool that sends telemetry, even without document content, could be blocked by IT policy and undermine the enterprise adoption argument. The architectural integrity of the zero-server guarantee is worth more than the operational convenience of external drift monitoring. The drift detection is therefore self-hosted: visible to Maya, voluntary, and entirely within her browser. It is imperfect telemetry by design, and that is documented.
Drift detection depends on Maya's session history being intact in IndexedDB. If she clears browser data, the drift baseline resets. The eval harness (run on prompt changes) provides the primary quality gate — session log drift monitoring is a secondary signal, not the primary defence. The limitation is disclosed.