Prompt governance in regulated AI environments

How to version, audit, and control prompts in financial and legal systems where every model interaction is a liability. Covers prompt registries, change gates, and evaluation pipelines.

Prompt governance is the discipline of treating prompts as production artifacts — versioned, tested, audited, and attributable. In regulated environments this is not optional.

Why prompts are liabilities

Every prompt that reaches a model in a financial or legal system is a decision-making input. Regulators increasingly require that AI-assisted decisions be explainable and reproducible. A prompt that changed last Tuesday and produced a different answer today is an audit failure.

The standard response — "we'll add logging" — is insufficient. Logging records what happened. Governance controls what can happen.

The four layers of prompt governance

1. Registry

Every prompt in production lives in a versioned registry. Schema:

id: canonical identifier (snake_case, stable across versions)
version: semver
status: draft | review | approved | deprecated
owner: team or individual
hash: content hash (detects silent mutations)
created_at / approved_at / deprecated_at

No prompt executes in production without a registry entry in approved status.

2. Change gates

Promotion from draft to approved requires:

Diff review by a designated reviewer (not the author)
Passing the evaluation suite with defined acceptance thresholds
Sign-off captured in the registry (who, when, which evaluation run)

Emergency overrides exist but create an incident record.

3. Evaluation pipelines

Each prompt version has an associated evaluation set: input-output pairs that define expected behavior. Pipeline runs on every promotion attempt:

Correctness (task-specific metrics)
Refusal rate (for safety-relevant prompts)
Consistency (same input → same output class across N runs)
Latency (p50, p95, p99)

Thresholds are set at the prompt level, not globally — a legal extraction prompt and a summary prompt have different acceptance criteria.

4. Runtime attribution

Every model call in production carries:

prompt_id + prompt_version in the request context
outcome_hash in the response log
caller_id (which system or user triggered the call)

This creates a complete causal chain from decision to prompt version to evaluation state at the time of approval.

What this prevents

Silent prompt drift (hash comparison catches it at registry write time)
Unapproved changes reaching production (gating)
"We don't know which version produced this" (attribution)
Inability to reproduce a past decision (version + evaluation history preserved)

Implementation starting point

The minimal viable governance stack: a Postgres table as the registry, a CI/CD step that runs the evaluation pipeline on PR, and a middleware layer that rejects calls to non-approved prompt IDs.

More sophisticated implementations add a dedicated governance API (Prompt-Maker pattern), evaluation dashboards, and automated deprecation when successor versions are approved.

The infrastructure cost is low. The cost of not having it is measured in audit findings.