The Autonomous Author

Framing

What MLOps means for a
prompt-governed pipeline.

Traditional MLOps governs training pipelines, model registries, feature stores, and inference infrastructure. None of that applies here — the Autonomous Author uses pre-trained models via an API and produces no training artefacts. But the operational challenges are structurally identical: the system can degrade, outputs can drift, "models" (system prompts) need versioning and testing, and upgrades need a decision process.

The mapping is direct. System prompts are models. rules.json is training data. The evaluation harness is the test suite. Output quality monitoring is drift detection. The model upgrade protocol is the promotion gate. Every MLOps concept has a clean analogue in this architecture.

Diagram 18 MLOps concept mapping — traditional pipeline vs LLM-native pipeline. Every classical MLOps concern has a direct analogue.

Prompt Governance

System prompts are models.
Govern them accordingly.

Each agent's system prompt is the primary encoding of its behaviour. A change to a system prompt is functionally equivalent to retraining a model — it can improve performance, introduce regressions, change the output schema, or break downstream agents. Every prompt change therefore goes through the same discipline as a code change: PR review, evaluation harness run, and a version bump before deployment.

Diagram 19 Prompt governance flow — /prompts/ directory structure, change proposal gate, eval harness decision, runtime SHA verification.

Prompt ID	Agent	Version	Last Change	Eval Baseline	Breaking Change
intake-agent	A-01 Intake	v1.3.0	Added persona auto-detection for P2 imperative language signals	Dimension avg: 0.91	No — additive detection logic
research-agent	A-02 Research	v1.2.1	Tightened gap detection — reduced false-positive proper noun flags	Dimension avg: 0.88	No — output schema unchanged
draft-agent-p1	A-03 Draft (P1)	v2.0.0	Major: restructured output to enforce Parameters table in all feature docs	Dimension avg: 0.93	Yes — DraftDocument schema v2 required
draft-agent-p2	A-03 Draft (P2)	v1.4.2	Strengthened placeholder instruction — reduced inference on missing fields	Dimension avg: 0.89	No — P-10 compliance improvement
ambiguity-detector	A-04 Ambiguity	v1.1.0	Added implicit assumption detection category (P-09 coverage expansion)	Dimension avg: 0.86	No — AmbiguityReport gains new flag category
compliance-agent	A-05 Compliance	v1.5.3	Tuned semantic check prompt — reduced false positives on passive voice detection	Dimension avg: 0.94	No — ComplianceReport schema unchanged
review-prep-agent	A-06 Review Prep	v1.0.1	Fixed priority sort — HIGH ambiguity flags now surface before MED compliance violations	Dimension avg: 0.97	No — ReviewBundle sort order only

Evaluation Harness

50 fixtures. 8 dimensions.
Every prompt change tested.

The evaluation harness is the test suite for the pipeline. It consists of 50 labelled fixture pairs (input → expected output), scored against 8 quality dimensions. A prompt change that degrades any dimension below its baseline threshold is rejected — regardless of improvements on other dimensions. Regressions in any dimension are blocking, not advisory.

Diagram 20 Evaluation harness — 50 fixture set, 8-dimension scoring, pass/fail gate. All dimensions must pass for a prompt change to be merged.

Dimension 01 · Schema Conformance

Output contract never breaks

Every agent output is validated against its TypeScript-style output schema. Any response that fails to produce a valid typed object fails this dimension with a score of 0.0. Threshold: 1.00 — this is a binary gate, not a percentage. One schema violation on any fixture = blocked.

THRESHOLD = 1.00 · Binary gate

Dimension 02 · Compliance Detection Rate

Known violations caught

20 fixture documents are seeded with known violations across all 10 rule categories. Scored as: violations detected / violations present. The threshold of 0.90 allows for up to 2 misses on a 20-violation document — chosen to accommodate edge cases in semantic rules without demanding perfection on hard cases.

THRESHOLD ≥ 0.90 · Recall measure

Dimension 03 · Ambiguity Detection (P2)

Vague terms found before review

25 P2 fixtures contain seeded vague quantifiers, undefined terms, and missing error states at known locations. Scored as: flags raised at correct locations / total seeded flags. Threshold 0.88 acknowledges that implicit assumption detection is harder than vague quantifier detection — a small miss rate is acceptable.

THRESHOLD ≥ 0.88 · P2 only

Dimension 04 · Placeholder Insertion (P2)

Missing context never inferred

Every P2 fixture with a documented context gap must produce exactly one [REQUIRES INPUT:] placeholder at that gap location. Threshold 1.00 — this is P-10 enforced as a metric. A single instance of the Draft Agent inferring content to fill a known gap is a blocking failure.

THRESHOLD = 1.00 · P-10 enforcement

Dimension 05 · XAI Card Completeness

Every agent surfaces reasoning

Every fixture run must produce a valid XAI card for every agent — all four fields populated (understood, decided, why, uncertainties) with non-empty values. Any agent that omits its XAI card or produces an empty field fails this dimension. Threshold 1.00 — AR-01 is binary.

THRESHOLD = 1.00 · AR-01 enforcement

Dimension 06 · False Positive Rate

Noise in compliance output controlled

Clean fixtures (documents with no violations) are scored for false positive compliance flags. Threshold ≤ 0.08 — an 8% false positive rate means at most 1.6 false flags on a 20-rule check. Above this threshold, the compliance output becomes noisy enough to erode Maya's trust in the tool.

THRESHOLD ≤ 0.08 · Trust metric

Dimension 07 · Confidence Calibration

Confidence scores predict review need

For fixtures where human review found issues the agent didn't flag, the agent's confidence score should have been below 0.80. This measures whether confidence scores are predictive of actual quality — not just reported high. Computed as correlation between low confidence and missed flags.

THRESHOLD ≥ 0.80 · Predictive validity

Dimension 08 · Pipeline Latency (p95)

Performance within target

P95 latency for a full P2 pipeline run (the slower path) must stay under 20 seconds on Groq free tier. This is not a quality dimension per se — it is a user experience gate. A pipeline that takes 45 seconds to run will be abandoned. Target: 20s p95, with headroom below the 20-minute total session target.

THRESHOLD ≤ 20s P95 · UX gate

Drift Detection

Output quality monitoring —
detecting degradation without retraining.

In a traditionally trained model, drift is detected by monitoring the distance between production input distributions and training data distributions. In an LLM-native pipeline, the model doesn't change — but the upstream LLM can change (Groq model updates), the input distribution can change (Maya uses the tool for new doc types), and prompt version changes can introduce regressions. Drift manifests as output quality degradation, not distribution shift.

Diagram 21 Drift signal taxonomy — three sources, detection mechanism, and response protocol for each. All tracked via IndexedDB session log.

rules.json Lifecycle

The compliance rule set —
authored, versioned, tested, deployed.

rules.json is not a configuration file — it is the data artefact that makes the Compliance Agent deterministic. It is the closest analogue to a training dataset in this pipeline. Accordingly it has its own lifecycle: authoring, validation, testing, versioning, and deployment. A change to rules.json that introduces false positives is a data quality regression, not a configuration error.

01

Authoring

Rules authored as JSON objects with id, category, rule_text, check_type (PATTERN/STRUCTURAL/SEMANTIC), regex pattern (if PATTERN), positive_example, and negative_example. Style guide section reference required for every rule.

02

Schema Validation

GitHub Actions validates rules.json against JSON Schema on every PR. Required fields present, check_type is valid enum, PATTERN rules have valid regex, examples non-empty. Schema validation failure blocks merge.

03

Fixture Testing

Each rule must have at least one positive fixture (doc that violates it — should fire) and one negative fixture (doc that doesn't — should not fire). New rules without fixtures are rejected by the GitHub Actions gate.

04

FP Rate Check

The full 50-fixture eval set is run with the new rule. If the new rule increases the false positive rate above 0.08 threshold (Dimension 06), it is rejected. Rule authoring continues until FP rate is controlled.

05

Version Bump

Adding rules = minor bump. Removing or changing existing rules = major bump (breaking change to ComplianceReport structure). Version logged in every ComplianceReport so session history shows which rule version produced each report.

06

Deploy + Verify

Merged to main → GitHub Actions deploys to Pages. Compliance Agent fetches rules.json on pipeline init, validates version matches pipeline config, logs version in session record. SHA verified at load time (same as prompts).

rules.json — Sample Rule Structure

{
  "version": "1.5.3",
  "generated": "2026-02-14",
  "rule_count": 80,
  "rules": [
    {
      "id": "R-047",
      "category": "Terminology",
      "rule": "Avoid Latin abbreviations: e.g., i.e., etc., vs., via.",
      "check_type": "PATTERN",
      "pattern": "\\b(e\\.g\\.|i\\.e\\.|etc\\.|vs\\.|\\bvia\\b)",
      "severity": "medium",
      "fix_template": "Replace '{match}' with: e.g. → 'for example', i.e. → 'that is', etc. → (list items explicitly), vs. → 'versus'",
      "style_guide_ref": "Google Developer Style Guide § Abbreviations",
      "positive_fixture": "The API supports multiple auth methods, e.g., OAuth and API keys.",
      "negative_fixture": "The API supports multiple authentication methods, including OAuth and API keys."
    },
    {
      "id": "R-074",
      "category": "Tone",
      "rule": "Avoid 'simple', 'easy', 'just', 'straightforward', 'obviously', 'trivially'.",
      "check_type": "PATTERN",
      "pattern": "\\b(simple|simply|easy|easily|just|straightforward|obviously|trivially)\\b",
      "severity": "low",
      "fix_template": "Remove '{match}' — describe the steps directly without characterising their difficulty.",
      "style_guide_ref": "Google Developer Style Guide § Accessible language"
    }
  ]
}

Model Upgrade Protocol

When Groq releases a new model —
the decision process.

Model upgrades are the highest-risk operational event in an LLM-native pipeline. A new model can improve performance on some dimensions while silently regressing on others. The upgrade protocol ensures that a model swap only happens after all 8 eval dimensions are verified against the new model — not on the basis of benchmark claims from the model provider.

Diagram 22 Model upgrade decision tree — parallel eval, 8-dimension comparison, three outcomes: HOLD, PROMPT TUNING, DEPLOY.

Model	D1 Schema	D2 Compliance	D3 Ambiguity	D4 Placeholder	D5 XAI	D6 FP Rate	D7 Conf	D8 Latency	Decision
Llama 3.1 70B baseline	1.00	0.94	0.91	1.00	1.00	0.06	0.84	14.2s	CURRENT
Llama 3.2 70B hypothetical	1.00	0.96	0.93	1.00	1.00	0.05	0.87	12.8s	PROMOTE
Mixtral 8x22B hypothetical	1.00	0.91	0.82	0.94	1.00	0.07	0.81	16.1s	HOLD

Red cells = gate failed. Model is held until all 8 dimensions pass. Mixtral fails on D3 (ambiguity detection) and D4 (placeholder insertion) — both P2 quality gates. Not acceptable for a pipeline where DDD spec quality is a first-class requirement.

Rebuttals & Pushbacks

Three MLOps challenges.
Every objection answered.

MLOps — Pushback 01

Calling this "MLOps" when there's no model training is misleading. It's just DevOps with LLM API calls.

The Challenge

"You're not doing MLOps. You're doing software engineering with some extra JSON files. Renaming it MLOps is resume padding."

The Temptation

Call it "Prompt Engineering" and skip the MLOps framing entirely. Simpler, less contestable.

Why We Rejected It

MLOps is fundamentally the discipline of keeping ML-powered systems reliable in production. The specific implementation — whether weights or prompts encode the intelligence — is secondary. This pipeline has the same operational risks as a traditionally trained system: output can degrade, the "model" (prompt) can be changed with unintended consequences, the inference backend can change, and there is no natural audit trail without deliberate instrumentation. The MLOps tooling — prompt registry, eval harness, drift detection, upgrade protocol — addresses exactly these risks. The framing is accurate. The artefacts are different; the concerns are identical.

Trade-off Accepted

The analogy between prompts and model weights is imperfect — prompts are interpretable and editable in a way weights are not. The governance process is less formal than full MLOps. These limitations are documented in the Glossary under "Prompt Governance."

MLOps — Pushback 02

50 fixtures is too small an eval set. Any serious ML evaluation uses thousands of examples.

The Challenge

"A 50-fixture eval set has no statistical power. You cannot draw meaningful conclusions from it. A model that passes 50 fixtures could still fail catastrophically on production inputs."

The Temptation

Expand to 500+ fixtures. More data = more confidence. Harder to argue with.

Why We Rejected It

The eval harness serves a different purpose than a held-out test set in a supervised learning context. It is not measuring generalisation from a large sample — it is verifying that specific, documented behaviours are preserved across prompt changes. The 50 fixtures are carefully constructed to cover all 10 rule categories, all failure modes documented in known_failure_modes in each prompt file, and edge cases identified in production sessions. This is a specification-driven test suite, not a statistical sample. Expanding to 500 fixtures would mostly add redundant coverage of already-tested cases. The investment is in fixture quality and coverage of known failure modes, not fixture quantity.

Trade-off Accepted

The eval set will not catch novel failure modes on production inputs that don't resemble any fixture. This is mitigated by the session log drift monitoring — production failures surface as quality degradation signals even if the eval set didn't predict them. The eval set prevents regressions on known behaviours; the drift monitor catches new failures in production.

MLOps — Pushback 03

Tracking drift via IndexedDB session logs is unreliable — Maya can clear her browser data at any time.

The Challenge

"Your drift detection relies on session logs stored in the writer's browser. If she clears her cache, you lose your entire drift signal. You have no production telemetry."

The Temptation

Send anonymised session metrics to a free-tier telemetry service (PostHog, Plausible). Reliable, persistent drift signal without depending on browser storage.

Why We Rejected It

AR-03 (client-side only — no backend) and C-03 (no workflow disruption) are binding constraints. Sending session metrics to a telemetry service — even anonymised — crosses the "no data leaves the browser" guarantee. Enterprise writers operate in environments with data governance policies. A tool that sends telemetry, even without document content, could be blocked by IT policy and undermine the enterprise adoption argument. The architectural integrity of the zero-server guarantee is worth more than the operational convenience of external drift monitoring. The drift detection is therefore self-hosted: visible to Maya, voluntary, and entirely within her browser. It is imperfect telemetry by design, and that is documented.

Trade-off Accepted

Drift detection depends on Maya's session history being intact in IndexedDB. If she clears browser data, the drift baseline resets. The eval harness (run on prompt changes) provides the primary quality gate — session log drift monitoring is a secondary signal, not the primary defence. The limitation is disclosed.

MLOps for anLLM-nativepipeline.

What MLOps means for aprompt-governed pipeline.

System prompts are models.Govern them accordingly.

50 fixtures. 8 dimensions.Every prompt change tested.

Output contract never breaks

Known violations caught

Vague terms found before review

Missing context never inferred

Every agent surfaces reasoning

Noise in compliance output controlled

Confidence scores predict review need

Performance within target

Output quality monitoring —detecting degradation without retraining.

The compliance rule set —authored, versioned, tested, deployed.

When Groq releases a new model —the decision process.

Three MLOps challenges.Every objection answered.

MLOps for an
LLM-native
pipeline.

What MLOps means for a
prompt-governed pipeline.

System prompts are models.
Govern them accordingly.

50 fixtures. 8 dimensions.
Every prompt change tested.

Output quality monitoring —
detecting degradation without retraining.

The compliance rule set —
authored, versioned, tested, deployed.

When Groq releases a new model —
the decision process.

Three MLOps challenges.
Every objection answered.