General-purpose LLMs perform well on isolated regulatory questions. Ask ChatGPT or Claude about a specific BSA/AML requirement or FDIC filing rule and you’ll often get a sharp, well-reasoned answer. But regulatory compliance at a bank isn’t isolated questions — it’s multi-step analysis across overlapping federal and state frameworks, with institution-specific context that builds from one step to the next.
We tested that distinction rigorously. The results were stark.
We benchmarked against real compliance officers
We assembled a panel of 5 compliance advisors — CCOs and BSA Officers from U.S. community banks and mid-size institutions ($800M–$8B in assets), spanning single-state and multi-state operations (up to 5 jurisdictions). We gave them AI-generated compliance output and asked one question: would you sign off on this in front of a regulator?
The task set wasn’t synthetic. We ran 510 compliance tasks drawn from production workflows — real incidents with real operational details: dollar amounts, affected customer counts, filing deadlines, multi-system dependencies. The 4 highest-volume categories:
| Category | Tasks | Example |
|---|---|---|
| Regulation gap analysis | 130 | Identify missing regulatory coverage — e.g., state breach notification laws, BSA/FinCEN SAR requirements for non-bank creditors |
| Service outage | 128 | Manage regulatory reporting obligations during a SEV-1 payment processing outage blocking loan disbursements |
| Transaction & identity fraud | 124 | Investigate a 47-application synthetic ID fraud ring ($2.3M exposure), assess SAR filing obligations; flag suspicious ACH patterns and KYC verification failures |
| Security incident | 128 | Assess breach notification triggers after credential compromise, coordinate FDIC filing for PII exposure, determine evidence preservation for unauthorized employee access |
Every task was scored on 5 dimensions — factual accuracy, source traceability, institution specificity, completeness, and uncertainty signaling. All 5 had to pass. Any single failure meant a fail.
The baseline: 21%
We tested Claude Opus 4.5 with RAG — institution profile in the system prompt, relevant regulatory text retrieved from a curated corpus. This is a best-case setup for a team using general-purpose LLMs. More sophisticated than a raw ChatGPT conversation, but without domain-specific infrastructure.
The overall pass rate was 21%.
| Category | Pass Rate |
|---|---|
| Regulation gap analysis | 11% |
| Service outage | 30% |
| Transaction & identity fraud | 21% |
| Security incident | 22% |
The model wasn’t incapable. On isolated, well-scoped questions, accuracy was high. Failures concentrated in five specific modes:
| Failure Mode | % of Failures |
|---|---|
| Hallucinated specifics — fabricated deadlines, thresholds, or regulatory citations | 34% |
| Context collapse — lost institution-specific details mid-analysis, reverted to generic guidance | 22% |
| Missing cross-references — failed to identify overlapping obligations across frameworks | 18% |
| No source attribution — correct assertions with no traceable citation | 15% |
| False confidence — uncertain or evolving positions presented as settled fact | 11% |
That was our starting point.
What we built
Two infrastructure layers.
The compliance context layer addresses depth. General-purpose interfaces accept a text prompt — but compliance work requires structured, persistent knowledge about the institution. This layer ingests and maintains the business profile, regulatory exposure map, obligation registry, internal policy corpus, and operational documents. When the system analyzes a regulation, it reasons against a deep, structured model of who the customer actually is. Not a one-line description in a prompt.
The orchestration harness addresses discipline. A single model call — no matter how intelligent — can’t replicate the methodical, multi-step process a compliance analyst follows. The harness decomposes complex analysis into controlled steps:
- Source grounding. Every claim traces to a specific section, paragraph, and clause. Unattributed assertions are blocked.
- Persistent reasoning. Context accumulates across steps. The system doesn’t forget its own prior findings.
- Confidence calibration. The system distinguishes between established requirements, regulatory guidance, and genuine ambiguity.
- Domain guardrails. Checks that catch common model errors — wrong jurisdiction, superseded rules, confused terminology — built from analysis of 1,000+ model failures and calibrated against expert review.
The result: 93%
Same 510 tasks. Same 5-member panel. Same scoring rubric.
The overall pass rate was 93%. Same class of model. The difference was infrastructure.
| Category | Baseline | With MidLyr |
|---|---|---|
| Regulation gap analysis | 11% | 89% |
| Service outage | 30% | 97% |
| Transaction & identity fraud | 21% | 92% |
| Security incident | 22% | 94% |
| Overall | 21% | 93% |
Two failure categories — context collapse and missing source attribution — were eliminated entirely. The remaining failures split across hallucinated specifics, missing cross-references, and false confidence. Some involved recently amended state-level rules that had not yet been incorporated into our corpus — a data freshness issue, not a reasoning failure.
| Failure Mode | Baseline (% of failures) | With Infrastructure (% of failures) |
|---|---|---|
| Hallucinated specifics | 34% | 29% |
| Context collapse | 22% | 0% |
| Missing cross-references | 18% | 43% |
| No source attribution | 15% | 0% |
| False confidence | 11% | 28% |
Why this matters
Teams evaluating AI for compliance often ask: why wouldn’t we just use ChatGPT?
It’s the right question. General-purpose assistants are impressive, accessible, and improving fast. For a quick regulatory lookup or a first-pass summary, they work.
But compliance decisions carry real consequences. A CCO needs traceable sources, structured reasoning, awareness of what applies to their specific institution, and honest signals about uncertainty. General-purpose assistants don’t provide any of that — not because they’re not smart enough, but because they weren’t built to.
These are infrastructure problems, not intelligence problems. The model is the same. The difference is everything around it.
If your compliance team is evaluating AI — or already hitting the limits of general-purpose tools — we’d like to hear from you. Reach out at midlyr.com.