Why ChatGPT and Claude Can't Handle Bank Regulatory Compliance

General-purpose LLMs perform well on isolated regulatory questions. Ask ChatGPT or Claude about a specific BSA/AML requirement or FDIC filing rule and you’ll often get a sharp, well-reasoned answer. But regulatory compliance at a bank isn’t isolated questions — it’s multi-step analysis across overlapping federal and state frameworks, with institution-specific context that builds from one step to the next.

We tested that distinction rigorously. The results were stark.

We benchmarked against real compliance officers

We assembled a panel of 5 compliance advisors — CCOs and BSA Officers from U.S. community banks and mid-size institutions ($800M–$8B in assets), spanning single-state and multi-state operations (up to 5 jurisdictions). We gave them AI-generated compliance output and asked one question: would you sign off on this in front of a regulator?

The task set wasn’t synthetic. We ran 510 compliance tasks drawn from production workflows — real incidents with real operational details: dollar amounts, affected customer counts, filing deadlines, multi-system dependencies. The 4 highest-volume categories:

Category	Tasks	Example
Regulation gap analysis	130	Identify missing regulatory coverage — e.g., state breach notification laws, BSA/FinCEN SAR requirements for non-bank creditors
Service outage	128	Manage regulatory reporting obligations during a SEV-1 payment processing outage blocking loan disbursements
Transaction & identity fraud	124	Investigate a 47-application synthetic ID fraud ring ($2.3M exposure), assess SAR filing obligations; flag suspicious ACH patterns and KYC verification failures
Security incident	128	Assess breach notification triggers after credential compromise, coordinate FDIC filing for PII exposure, determine evidence preservation for unauthorized employee access

Every task was scored on 5 dimensions — factual accuracy, source traceability, institution specificity, completeness, and uncertainty signaling. All 5 had to pass. Any single failure meant a fail.

The baseline: 21%

We tested Claude Opus 4.5 with RAG — institution profile in the system prompt, relevant regulatory text retrieved from a curated corpus. This is a best-case setup for a team using general-purpose LLMs. More sophisticated than a raw ChatGPT conversation, but without domain-specific infrastructure.

The overall pass rate was 21%.

Category	Pass Rate
Regulation gap analysis	11%
Service outage	30%
Transaction & identity fraud	21%
Security incident	22%

The model wasn’t incapable. On isolated, well-scoped questions, accuracy was high. Failures concentrated in five specific modes:

Failure Mode	% of Failures
Hallucinated specifics — fabricated deadlines, thresholds, or regulatory citations	34%
Context collapse — lost institution-specific details mid-analysis, reverted to generic guidance	22%
Missing cross-references — failed to identify overlapping obligations across frameworks	18%
No source attribution — correct assertions with no traceable citation	15%
False confidence — uncertain or evolving positions presented as settled fact	11%

That was our starting point.

What we built

Two infrastructure layers.

The compliance context layer addresses depth. General-purpose interfaces accept a text prompt — but compliance work requires structured, persistent knowledge about the institution. This layer ingests and maintains the business profile, regulatory exposure map, obligation registry, internal policy corpus, and operational documents. When the system analyzes a regulation, it reasons against a deep, structured model of who the customer actually is. Not a one-line description in a prompt.

The orchestration harness addresses discipline. A single model call — no matter how intelligent — can’t replicate the methodical, multi-step process a compliance analyst follows. The harness decomposes complex analysis into controlled steps:

Source grounding. Every claim traces to a specific section, paragraph, and clause. Unattributed assertions are blocked.
Persistent reasoning. Context accumulates across steps. The system doesn’t forget its own prior findings.
Confidence calibration. The system distinguishes between established requirements, regulatory guidance, and genuine ambiguity.
Domain guardrails. Checks that catch common model errors — wrong jurisdiction, superseded rules, confused terminology — built from analysis of 1,000+ model failures and calibrated against expert review.

The result: 93%

Same 510 tasks. Same 5-member panel. Same scoring rubric.

The overall pass rate was 93%. Same class of model. The difference was infrastructure.

Category	Baseline	With MidLyr
Regulation gap analysis	11%	89%
Service outage	30%	97%
Transaction & identity fraud	21%	92%
Security incident	22%	94%
Overall	21%	93%

Two failure categories — context collapse and missing source attribution — were eliminated entirely. The remaining failures split across hallucinated specifics, missing cross-references, and false confidence. Some involved recently amended state-level rules that had not yet been incorporated into our corpus — a data freshness issue, not a reasoning failure.

Failure Mode	Baseline (% of failures)	With Infrastructure (% of failures)
Hallucinated specifics	34%	29%
Context collapse	22%	0%
Missing cross-references	18%	43%
No source attribution	15%	0%
False confidence	11%	28%

Why this matters

Teams evaluating AI for compliance often ask: why wouldn’t we just use ChatGPT?

It’s the right question. General-purpose assistants are impressive, accessible, and improving fast. For a quick regulatory lookup or a first-pass summary, they work.

But compliance decisions carry real consequences. A CCO needs traceable sources, structured reasoning, awareness of what applies to their specific institution, and honest signals about uncertainty. General-purpose assistants don’t provide any of that — not because they’re not smart enough, but because they weren’t built to.

These are infrastructure problems, not intelligence problems. The model is the same. The difference is everything around it.

If your compliance team is evaluating AI — or already hitting the limits of general-purpose tools — we’d like to hear from you. Reach out at midlyr.com.

We benchmarked against real compliance officers

The baseline: 21%

What we built

The result: 93%

Why this matters

Your team runs the operations.MidLyr runs the layer underneath.

Your team runs the operations.
MidLyr runs the layer underneath.