Notes

December 10, 2025

·

AI · Governance

·

8 min

Prompt governance in regulated AI environments

How to version, audit, and control prompts in financial and legal systems where every model interaction is a liability. Covers prompt registries, change gates, and evaluation pipelines.

Prompt governance is the discipline of treating prompts as production artifacts — versioned, tested, audited, and attributable. In regulated environments this is not optional.

Why prompts are liabilities

Every prompt that reaches a model in a financial or legal system is a decision-making input. Regulators increasingly require that AI-assisted decisions be explainable and reproducible. A prompt that changed last Tuesday and produced a different answer today is an audit failure.

The standard response — "we'll add logging" — is insufficient. Logging records what happened. Governance controls what can happen.

The four layers of prompt governance


1. Registry

Every prompt in production lives in a versioned registry. Schema:

  • id: canonical identifier (snake_case, stable across versions)
  • version: semver
  • status: draft | review | approved | deprecated
  • owner: team or individual
  • hash: content hash (detects silent mutations)
  • created_at / approved_at / deprecated_at

No prompt executes in production without a registry entry in approved status.

2. Change gates

Promotion from draft to approved requires:

  • Diff review by a designated reviewer (not the author)
  • Passing the evaluation suite with defined acceptance thresholds
  • Sign-off captured in the registry (who, when, which evaluation run)

Emergency overrides exist but create an incident record.

3. Evaluation pipelines

Each prompt version has an associated evaluation set: input-output pairs that define expected behavior. Pipeline runs on every promotion attempt:

  • Correctness (task-specific metrics)
  • Refusal rate (for safety-relevant prompts)
  • Consistency (same input → same output class across N runs)
  • Latency (p50, p95, p99)

Thresholds are set at the prompt level, not globally — a legal extraction prompt and a summary prompt have different acceptance criteria.

4. Runtime attribution

Every model call in production carries:

  • prompt_id + prompt_version in the request context
  • outcome_hash in the response log
  • caller_id (which system or user triggered the call)

This creates a complete causal chain from decision to prompt version to evaluation state at the time of approval.

What this prevents

  • Silent prompt drift (hash comparison catches it at registry write time)
  • Unapproved changes reaching production (gating)
  • "We don't know which version produced this" (attribution)
  • Inability to reproduce a past decision (version + evaluation history preserved)

Implementation starting point

The minimal viable governance stack: a Postgres table as the registry, a CI/CD step that runs the evaluation pipeline on PR, and a middleware layer that rejects calls to non-approved prompt IDs.

More sophisticated implementations add a dedicated governance API (Prompt-Maker pattern), evaluation dashboards, and automated deprecation when successor versions are approved.

The infrastructure cost is low. The cost of not having it is measured in audit findings.