The Autonomous Author / Page 05 — MLOps

MLOps for an
LLM-native
pipeline.

The Autonomous Author does not train models. It governs prompts. In an LLM-native pipeline, system prompts are the models — they encode the pipeline's intelligence, define its behaviour, and degrade over time if not managed. This page documents the operational discipline that keeps the pipeline reliable, measurable, and upgradeable.

Prompt Engineering Governance Evaluation Harness Drift Detection — Output Quality rules.json Lifecycle Model Upgrade Protocol No Model Training — LLM-native
Framing

What MLOps means for a
prompt-governed pipeline.

Traditional MLOps governs training pipelines, model registries, feature stores, and inference infrastructure. None of that applies here — the Autonomous Author uses pre-trained models via an API and produces no training artefacts. But the operational challenges are structurally identical: the system can degrade, outputs can drift, "models" (system prompts) need versioning and testing, and upgrades need a decision process.

The mapping is direct. System prompts are models. rules.json is training data. The evaluation harness is the test suite. Output quality monitoring is drift detection. The model upgrade protocol is the promotion gate. Every MLOps concept has a clean analogue in this architecture.

MLOPS CONCEPT MAPPING — TRADITIONAL vs LLM-NATIVE PIPELINE TRADITIONAL MLOPS AUTONOMOUS AUTHOR EQUIVALENT ML Model Trained weights, architecture, checkpoint System Prompt Versioned text file encoding agent behaviour and output contract Model Registry Version-controlled model artefacts + metadata Prompt Registry (Git) /prompts/ directory in repo · semver tags · SHA hashes · PR-gated changes Training Data Labelled dataset used to train/fine-tune the model rules.json + Eval Test Set 80-rule compliance set + 50 labelled input/output pairs for evaluation Test Suite / Eval Held-out labelled data · accuracy/F1 metrics Evaluation Harness 50 fixture inputs · expected outputs · scored against 8 quality dimensions Drift Detection Data/concept drift monitoring on production inputs Output Quality Monitoring Violation rate, confidence score distribution, placeholder rate tracked per session Model Promotion Gate Eval thresholds passed → promote to production Model Upgrade Protocol New Groq model runs full eval harness · must match or exceed baseline on all 8 dimensions
Diagram 18 MLOps concept mapping — traditional pipeline vs LLM-native pipeline. Every classical MLOps concern has a direct analogue.
Prompt Governance

System prompts are models.
Govern them accordingly.

Each agent's system prompt is the primary encoding of its behaviour. A change to a system prompt is functionally equivalent to retraining a model — it can improve performance, introduce regressions, change the output schema, or break downstream agents. Every prompt change therefore goes through the same discipline as a code change: PR review, evaluation harness run, and a version bump before deployment.

/prompts/ (Git) intake-agent.md v1.3.0 research-agent.md v1.2.1 draft-agent-p1.md v2.0.0 draft-agent-p2.md v1.4.2 ambiguity-detector.md v1.1.0 compliance-agent.md v1.5.3 review-prep-agent.md v1.0.1 Each file contains: • version header (semver) • SHA-256 content hash • input schema reference • output schema reference • change log (last 5 entries) • eval baseline scores • model compatibility list • known failure modes Loaded at runtime · SHA verified 1. Propose change Edit .md file · bump semver 2. Run eval harness 50 fixtures · 8 dimensions Pass all gates? NO REJECT revise prompt YES 3. PR review SHA hash verified · CHANGELOG updated 4. Merge → Deploy Runtime Behaviour On pipeline init: 1. Fetch all prompt .md files from /prompts/ 2. Compute SHA-256 of each file 3. Compare against expected hash in prompt-manifest.json 4. Hash mismatch → pipeline refuses to start + surface alert 5. Hash match → prompts loaded into agent constructors prompt-manifest.json structure: { "schema_version": "1.0", "prompts": [ {"id": "intake-agent", "version": "1.3.0", "sha256": "a3f9...", "file": "intake-agent.md"}, ... (one entry per agent) ] } SHA verification prevents prompt tampering · Ensures deployed version matches tested version loads
Diagram 19 Prompt governance flow — /prompts/ directory structure, change proposal gate, eval harness decision, runtime SHA verification.
Prompt IDAgentVersionLast ChangeEval BaselineBreaking Change
intake-agent A-01 Intake v1.3.0 Added persona auto-detection for P2 imperative language signals Dimension avg: 0.91 No — additive detection logic
research-agent A-02 Research v1.2.1 Tightened gap detection — reduced false-positive proper noun flags Dimension avg: 0.88 No — output schema unchanged
draft-agent-p1 A-03 Draft (P1) v2.0.0 Major: restructured output to enforce Parameters table in all feature docs Dimension avg: 0.93 Yes — DraftDocument schema v2 required
draft-agent-p2 A-03 Draft (P2) v1.4.2 Strengthened placeholder instruction — reduced inference on missing fields Dimension avg: 0.89 No — P-10 compliance improvement
ambiguity-detector A-04 Ambiguity v1.1.0 Added implicit assumption detection category (P-09 coverage expansion) Dimension avg: 0.86 No — AmbiguityReport gains new flag category
compliance-agent A-05 Compliance v1.5.3 Tuned semantic check prompt — reduced false positives on passive voice detection Dimension avg: 0.94 No — ComplianceReport schema unchanged
review-prep-agent A-06 Review Prep v1.0.1 Fixed priority sort — HIGH ambiguity flags now surface before MED compliance violations Dimension avg: 0.97 No — ReviewBundle sort order only
Evaluation Harness

50 fixtures. 8 dimensions.
Every prompt change tested.

The evaluation harness is the test suite for the pipeline. It consists of 50 labelled fixture pairs (input → expected output), scored against 8 quality dimensions. A prompt change that degrades any dimension below its baseline threshold is rejected — regardless of improvements on other dimensions. Regressions in any dimension are blocking, not advisory.

Fixture Set (50) P1 Fixtures (25): • Simple endpoint docs (8) • Complex multi-param APIs (8) • Delta updates (5) • Ambiguous tickets (4) P2 Fixtures (25): • Clean intent stmts (8) • Ambiguous specs (8) • Missing context (5) • Known violations (4) Each: input + expected output + expected XAI card fields Pipeline Run (candidate prompt) all 6 agents Scorer (8 dims) actual vs expected schema conformance violation detection rate confidence calibration Eval Report — 8 Dimensions Dimension Threshold Score 1. Schema conformance ≥ 1.00 1.00 2. Compliance detection rate ≥ 0.90 0.94 3. Ambiguity detection rate (P2) ≥ 0.88 0.91 4. Placeholder insertion (P2) = 1.00 1.00 5. XAI card completeness = 1.00 1.00 6. False positive rate (compliance) ≤ 0.08 0.06 7. Confidence calibration ≥ 0.80 0.84 8. Latency (p95) ≤ 20s 14.2s ALL GATES PASSED — Prompt eligible for merge
Diagram 20 Evaluation harness — 50 fixture set, 8-dimension scoring, pass/fail gate. All dimensions must pass for a prompt change to be merged.
Dimension 01 · Schema Conformance

Output contract never breaks

Every agent output is validated against its TypeScript-style output schema. Any response that fails to produce a valid typed object fails this dimension with a score of 0.0. Threshold: 1.00 — this is a binary gate, not a percentage. One schema violation on any fixture = blocked.

THRESHOLD = 1.00 · Binary gate
Dimension 02 · Compliance Detection Rate

Known violations caught

20 fixture documents are seeded with known violations across all 10 rule categories. Scored as: violations detected / violations present. The threshold of 0.90 allows for up to 2 misses on a 20-violation document — chosen to accommodate edge cases in semantic rules without demanding perfection on hard cases.

THRESHOLD ≥ 0.90 · Recall measure
Dimension 03 · Ambiguity Detection (P2)

Vague terms found before review

25 P2 fixtures contain seeded vague quantifiers, undefined terms, and missing error states at known locations. Scored as: flags raised at correct locations / total seeded flags. Threshold 0.88 acknowledges that implicit assumption detection is harder than vague quantifier detection — a small miss rate is acceptable.

THRESHOLD ≥ 0.88 · P2 only
Dimension 04 · Placeholder Insertion (P2)

Missing context never inferred

Every P2 fixture with a documented context gap must produce exactly one [REQUIRES INPUT:] placeholder at that gap location. Threshold 1.00 — this is P-10 enforced as a metric. A single instance of the Draft Agent inferring content to fill a known gap is a blocking failure.

THRESHOLD = 1.00 · P-10 enforcement
Dimension 05 · XAI Card Completeness

Every agent surfaces reasoning

Every fixture run must produce a valid XAI card for every agent — all four fields populated (understood, decided, why, uncertainties) with non-empty values. Any agent that omits its XAI card or produces an empty field fails this dimension. Threshold 1.00 — AR-01 is binary.

THRESHOLD = 1.00 · AR-01 enforcement
Dimension 06 · False Positive Rate

Noise in compliance output controlled

Clean fixtures (documents with no violations) are scored for false positive compliance flags. Threshold ≤ 0.08 — an 8% false positive rate means at most 1.6 false flags on a 20-rule check. Above this threshold, the compliance output becomes noisy enough to erode Maya's trust in the tool.

THRESHOLD ≤ 0.08 · Trust metric
Dimension 07 · Confidence Calibration

Confidence scores predict review need

For fixtures where human review found issues the agent didn't flag, the agent's confidence score should have been below 0.80. This measures whether confidence scores are predictive of actual quality — not just reported high. Computed as correlation between low confidence and missed flags.

THRESHOLD ≥ 0.80 · Predictive validity
Dimension 08 · Pipeline Latency (p95)

Performance within target

P95 latency for a full P2 pipeline run (the slower path) must stay under 20 seconds on Groq free tier. This is not a quality dimension per se — it is a user experience gate. A pipeline that takes 45 seconds to run will be abandoned. Target: 20s p95, with headroom below the 20-minute total session target.

THRESHOLD ≤ 20s P95 · UX gate
Drift Detection

Output quality monitoring —
detecting degradation without retraining.

In a traditionally trained model, drift is detected by monitoring the distance between production input distributions and training data distributions. In an LLM-native pipeline, the model doesn't change — but the upstream LLM can change (Groq model updates), the input distribution can change (Maya uses the tool for new doc types), and prompt version changes can introduce regressions. Drift manifests as output quality degradation, not distribution shift.

DRIFT SIGNAL TAXONOMY — THREE SOURCES · DETECTED · RESPONSE MAPPED Source 1: LLM Model Update Signal: Groq announces new model version or changes default endpoint model Detection: Model version in API response header compared to expected version in pipeline config on every session start Response: Run full eval harness against new model before accepting. If gates pass → update config. If gates fail → hold old model, surface alert to maintainer Source 2: Prompt Regression Signal: Compliance violation rate rising across recent sessions (tracked in IndexedDB) Detection: Session log aggregation: if avg violations per draft in last 10 sessions exceeds baseline + 30% → drift alert in UI Response: Surface alert to writer + maintainer. Check recent prompt changelog for correlated changes. Re-run eval harness. Rollback prompt version if regression found. Source 3: Input Distribution Shift Signal: Maya starts using tool for new doc type (e.g. CLI reference, runbook) not in eval fixture set. Confidence scores drop. Detection: Intake Agent produces doc_type value not in the known enum. Flagged in XAI card uncertainty list automatically. Response: Add new doc_type to intake prompt enum. Create fixture for new type. Run eval. This is the prompt version bump trigger for capability expansion.
Diagram 21 Drift signal taxonomy — three sources, detection mechanism, and response protocol for each. All tracked via IndexedDB session log.
rules.json Lifecycle

The compliance rule set —
authored, versioned, tested, deployed.

rules.json is not a configuration file — it is the data artefact that makes the Compliance Agent deterministic. It is the closest analogue to a training dataset in this pipeline. Accordingly it has its own lifecycle: authoring, validation, testing, versioning, and deployment. A change to rules.json that introduces false positives is a data quality regression, not a configuration error.

01
Authoring
Rules authored as JSON objects with id, category, rule_text, check_type (PATTERN/STRUCTURAL/SEMANTIC), regex pattern (if PATTERN), positive_example, and negative_example. Style guide section reference required for every rule.
02
Schema Validation
GitHub Actions validates rules.json against JSON Schema on every PR. Required fields present, check_type is valid enum, PATTERN rules have valid regex, examples non-empty. Schema validation failure blocks merge.
03
Fixture Testing
Each rule must have at least one positive fixture (doc that violates it — should fire) and one negative fixture (doc that doesn't — should not fire). New rules without fixtures are rejected by the GitHub Actions gate.
04
FP Rate Check
The full 50-fixture eval set is run with the new rule. If the new rule increases the false positive rate above 0.08 threshold (Dimension 06), it is rejected. Rule authoring continues until FP rate is controlled.
05
Version Bump
Adding rules = minor bump. Removing or changing existing rules = major bump (breaking change to ComplianceReport structure). Version logged in every ComplianceReport so session history shows which rule version produced each report.
06
Deploy + Verify
Merged to main → GitHub Actions deploys to Pages. Compliance Agent fetches rules.json on pipeline init, validates version matches pipeline config, logs version in session record. SHA verified at load time (same as prompts).
rules.json — Sample Rule Structure
{
  "version": "1.5.3",
  "generated": "2026-02-14",
  "rule_count": 80,
  "rules": [
    {
      "id": "R-047",
      "category": "Terminology",
      "rule": "Avoid Latin abbreviations: e.g., i.e., etc., vs., via.",
      "check_type": "PATTERN",
      "pattern": "\\b(e\\.g\\.|i\\.e\\.|etc\\.|vs\\.|\\bvia\\b)",
      "severity": "medium",
      "fix_template": "Replace '{match}' with: e.g. → 'for example', i.e. → 'that is', etc. → (list items explicitly), vs. → 'versus'",
      "style_guide_ref": "Google Developer Style Guide § Abbreviations",
      "positive_fixture": "The API supports multiple auth methods, e.g., OAuth and API keys.",
      "negative_fixture": "The API supports multiple authentication methods, including OAuth and API keys."
    },
    {
      "id": "R-074",
      "category": "Tone",
      "rule": "Avoid 'simple', 'easy', 'just', 'straightforward', 'obviously', 'trivially'.",
      "check_type": "PATTERN",
      "pattern": "\\b(simple|simply|easy|easily|just|straightforward|obviously|trivially)\\b",
      "severity": "low",
      "fix_template": "Remove '{match}' — describe the steps directly without characterising their difficulty.",
      "style_guide_ref": "Google Developer Style Guide § Accessible language"
    }
  ]
}
Model Upgrade Protocol

When Groq releases a new model —
the decision process.

Model upgrades are the highest-risk operational event in an LLM-native pipeline. A new model can improve performance on some dimensions while silently regressing on others. The upgrade protocol ensures that a model swap only happens after all 8 eval dimensions are verified against the new model — not on the basis of benchmark claims from the model provider.

New Groq model version announced Run full eval harness on new model All 50 fixtures · All 8 dimensions · Parallel to current model Compare all 8 dimensions to baseline New model must match or exceed on ALL dimensions All 8 gates pass? NO HOLD Keep current model Document failure dims Monitor next release PARTIAL PROMPT TUNING Adjust affected prompts Re-run eval · iterate YES Update pipeline-config.json model_id · expected_version · new baselines recorded DEPLOYED · CHANGELOG UPDATED
Diagram 22 Model upgrade decision tree — parallel eval, 8-dimension comparison, three outcomes: HOLD, PROMPT TUNING, DEPLOY.
ModelD1 SchemaD2 ComplianceD3 AmbiguityD4 PlaceholderD5 XAID6 FP RateD7 ConfD8 LatencyDecision
Llama 3.1 70B
baseline
1.00 0.94 0.91 1.00 1.00 0.06 0.84 14.2s CURRENT
Llama 3.2 70B
hypothetical
1.00 0.96 0.93 1.00 1.00 0.05 0.87 12.8s PROMOTE
Mixtral 8x22B
hypothetical
1.00 0.91 0.82 0.94 1.00 0.07 0.81 16.1s HOLD

Red cells = gate failed. Model is held until all 8 dimensions pass. Mixtral fails on D3 (ambiguity detection) and D4 (placeholder insertion) — both P2 quality gates. Not acceptable for a pipeline where DDD spec quality is a first-class requirement.

Rebuttals & Pushbacks

Three MLOps challenges.
Every objection answered.

MLOps — Pushback 01
Calling this "MLOps" when there's no model training is misleading. It's just DevOps with LLM API calls.
The Challenge

"You're not doing MLOps. You're doing software engineering with some extra JSON files. Renaming it MLOps is resume padding."

The Temptation

Call it "Prompt Engineering" and skip the MLOps framing entirely. Simpler, less contestable.

Why We Rejected It

MLOps is fundamentally the discipline of keeping ML-powered systems reliable in production. The specific implementation — whether weights or prompts encode the intelligence — is secondary. This pipeline has the same operational risks as a traditionally trained system: output can degrade, the "model" (prompt) can be changed with unintended consequences, the inference backend can change, and there is no natural audit trail without deliberate instrumentation. The MLOps tooling — prompt registry, eval harness, drift detection, upgrade protocol — addresses exactly these risks. The framing is accurate. The artefacts are different; the concerns are identical.

Trade-off Accepted

The analogy between prompts and model weights is imperfect — prompts are interpretable and editable in a way weights are not. The governance process is less formal than full MLOps. These limitations are documented in the Glossary under "Prompt Governance."

MLOps — Pushback 02
50 fixtures is too small an eval set. Any serious ML evaluation uses thousands of examples.
The Challenge

"A 50-fixture eval set has no statistical power. You cannot draw meaningful conclusions from it. A model that passes 50 fixtures could still fail catastrophically on production inputs."

The Temptation

Expand to 500+ fixtures. More data = more confidence. Harder to argue with.

Why We Rejected It

The eval harness serves a different purpose than a held-out test set in a supervised learning context. It is not measuring generalisation from a large sample — it is verifying that specific, documented behaviours are preserved across prompt changes. The 50 fixtures are carefully constructed to cover all 10 rule categories, all failure modes documented in known_failure_modes in each prompt file, and edge cases identified in production sessions. This is a specification-driven test suite, not a statistical sample. Expanding to 500 fixtures would mostly add redundant coverage of already-tested cases. The investment is in fixture quality and coverage of known failure modes, not fixture quantity.

Trade-off Accepted

The eval set will not catch novel failure modes on production inputs that don't resemble any fixture. This is mitigated by the session log drift monitoring — production failures surface as quality degradation signals even if the eval set didn't predict them. The eval set prevents regressions on known behaviours; the drift monitor catches new failures in production.

MLOps — Pushback 03
Tracking drift via IndexedDB session logs is unreliable — Maya can clear her browser data at any time.
The Challenge

"Your drift detection relies on session logs stored in the writer's browser. If she clears her cache, you lose your entire drift signal. You have no production telemetry."

The Temptation

Send anonymised session metrics to a free-tier telemetry service (PostHog, Plausible). Reliable, persistent drift signal without depending on browser storage.

Why We Rejected It

AR-03 (client-side only — no backend) and C-03 (no workflow disruption) are binding constraints. Sending session metrics to a telemetry service — even anonymised — crosses the "no data leaves the browser" guarantee. Enterprise writers operate in environments with data governance policies. A tool that sends telemetry, even without document content, could be blocked by IT policy and undermine the enterprise adoption argument. The architectural integrity of the zero-server guarantee is worth more than the operational convenience of external drift monitoring. The drift detection is therefore self-hosted: visible to Maya, voluntary, and entirely within her browser. It is imperfect telemetry by design, and that is documented.

Trade-off Accepted

Drift detection depends on Maya's session history being intact in IndexedDB. If she clears browser data, the drift baseline resets. The eval harness (run on prompt changes) provides the primary quality gate — session log drift monitoring is a secondary signal, not the primary defence. The limitation is disclosed.