General-purpose LLMs perform well on isolated regulatory questions. Ask ChatGPT or Claude about a specific BSA/AML requirement or FDIC filing rule and you’ll often get a sharp, well-reasoned answer. But regulatory compliance at a bank isn’t isolated questions — it’s multi-step analysis across overlapping federal and state frameworks, with institution-specific context that builds from one step to the next.

We tested that distinction rigorously. The results were stark.


We benchmarked against real compliance officers

We assembled a panel of 5 compliance advisors — CCOs and BSA Officers from U.S. community banks and mid-size institutions ($800M–$8B in assets), spanning single-state and multi-state operations (up to 5 jurisdictions). We gave them AI-generated compliance output and asked one question: would you sign off on this in front of a regulator?

The task set wasn’t synthetic. We ran 510 compliance tasks drawn from production workflows — real incidents with real operational details: dollar amounts, affected customer counts, filing deadlines, multi-system dependencies. The 4 highest-volume categories:

CategoryTasksExample
Regulation gap analysis130Identify missing regulatory coverage — e.g., state breach notification laws, BSA/FinCEN SAR requirements for non-bank creditors
Service outage128Manage regulatory reporting obligations during a SEV-1 payment processing outage blocking loan disbursements
Transaction & identity fraud124Investigate a 47-application synthetic ID fraud ring ($2.3M exposure), assess SAR filing obligations; flag suspicious ACH patterns and KYC verification failures
Security incident128Assess breach notification triggers after credential compromise, coordinate FDIC filing for PII exposure, determine evidence preservation for unauthorized employee access

Every task was scored on 5 dimensions — factual accuracy, source traceability, institution specificity, completeness, and uncertainty signaling. All 5 had to pass. Any single failure meant a fail.

The baseline: 21%

We tested Claude Opus 4.5 with RAG — institution profile in the system prompt, relevant regulatory text retrieved from a curated corpus. This is a best-case setup for a team using general-purpose LLMs. More sophisticated than a raw ChatGPT conversation, but without domain-specific infrastructure.

The overall pass rate was 21%.

CategoryPass Rate
Regulation gap analysis11%
Service outage30%
Transaction & identity fraud21%
Security incident22%

The model wasn’t incapable. On isolated, well-scoped questions, accuracy was high. Failures concentrated in five specific modes:

Failure Mode% of Failures
Hallucinated specifics — fabricated deadlines, thresholds, or regulatory citations34%
Context collapse — lost institution-specific details mid-analysis, reverted to generic guidance22%
Missing cross-references — failed to identify overlapping obligations across frameworks18%
No source attribution — correct assertions with no traceable citation15%
False confidence — uncertain or evolving positions presented as settled fact11%

That was our starting point.

What we built

Two infrastructure layers.

The compliance context layer addresses depth. General-purpose interfaces accept a text prompt — but compliance work requires structured, persistent knowledge about the institution. This layer ingests and maintains the business profile, regulatory exposure map, obligation registry, internal policy corpus, and operational documents. When the system analyzes a regulation, it reasons against a deep, structured model of who the customer actually is. Not a one-line description in a prompt.

The orchestration harness addresses discipline. A single model call — no matter how intelligent — can’t replicate the methodical, multi-step process a compliance analyst follows. The harness decomposes complex analysis into controlled steps:

  • Source grounding. Every claim traces to a specific section, paragraph, and clause. Unattributed assertions are blocked.
  • Persistent reasoning. Context accumulates across steps. The system doesn’t forget its own prior findings.
  • Confidence calibration. The system distinguishes between established requirements, regulatory guidance, and genuine ambiguity.
  • Domain guardrails. Checks that catch common model errors — wrong jurisdiction, superseded rules, confused terminology — built from analysis of 1,000+ model failures and calibrated against expert review.

The result: 93%

Same 510 tasks. Same 5-member panel. Same scoring rubric.

The overall pass rate was 93%. Same class of model. The difference was infrastructure.

CategoryBaselineWith MidLyr
Regulation gap analysis11%89%
Service outage30%97%
Transaction & identity fraud21%92%
Security incident22%94%
Overall21%93%

Two failure categories — context collapse and missing source attribution — were eliminated entirely. The remaining failures split across hallucinated specifics, missing cross-references, and false confidence. Some involved recently amended state-level rules that had not yet been incorporated into our corpus — a data freshness issue, not a reasoning failure.

Failure ModeBaseline (% of failures)With Infrastructure (% of failures)
Hallucinated specifics34%29%
Context collapse22%0%
Missing cross-references18%43%
No source attribution15%0%
False confidence11%28%

Why this matters

Teams evaluating AI for compliance often ask: why wouldn’t we just use ChatGPT?

It’s the right question. General-purpose assistants are impressive, accessible, and improving fast. For a quick regulatory lookup or a first-pass summary, they work.

But compliance decisions carry real consequences. A CCO needs traceable sources, structured reasoning, awareness of what applies to their specific institution, and honest signals about uncertainty. General-purpose assistants don’t provide any of that — not because they’re not smart enough, but because they weren’t built to.

These are infrastructure problems, not intelligence problems. The model is the same. The difference is everything around it.


If your compliance team is evaluating AI — or already hitting the limits of general-purpose tools — we’d like to hear from you. Reach out at midlyr.com.