As enterprises shift toward agentic operations and armies of agents are deployed across the front and back office, traditional compliance QA breaks. Sampling 10% of operations doesn’t work when an agent is executing thousands of dispute responses, dunning letters, and marketing assets every day.
The question isn’t whether your team can build an agent. The question is: how do you mathematically prove its output is compliant?
Today, we are pulling back the curtain on how we solved this. We are introducing the Midlyr Screen Analysis API—the interface to Midlyr’s governance platform. Any agent, application, or workflow can send a payload to the API and, behind the scenes, the platform applies the regulatory graph, expert-verified controls, and your institution’s configured policies to automatically review agent-generated text against federal and state regulations before it reaches a customer or internal reviewer.
The Journey of a Compliance Check
Ask a generic LLM to review a document for compliance and you get a confident but partial answer. The U.S. financial rulebook spans tens of thousands of pages across statutes, agency regulations, exam handbooks, and state law—all densely cross-referenced. A general-purpose model has never seen this corpus mapped end-to-end. It pulls in whichever fragments its training data happened to surface, weighs them with no sense of which rule actually governs, and quietly skips everything it didn’t retrieve. The result isn’t a clean fail—it’s a clean-looking pass that silently misses material violations.
To build a production-ready API, we had to replace that generalized guessing with a structured, multi-step orchestration engine. The high-level flow:
- The Input Payload. Whatever needs to be reviewed—a piece of agent-drafted text, a customer-facing letter, a marketing asset, a dispute decision—is submitted alongside a scenario tag (e.g.,
dispute,debt_collection,marketing_asset). The same payload shape works whether the caller is an agent, a workflow, or a human-in-the-loop tool. - The Regulatory Graph. Every submission is mapped against the seven categories of U.S. financial regulation we maintain in a continuously updated graph: statutes, regulations (CFR and state administrative codes), interagency guidance, single-agency guidance (bulletins, supervisory letters, FILs), examination handbooks, interpretive actions and enforcement orders, and SRO rules. This is how the engine isolates the exact constraints that apply to the scenario—nothing more, nothing less.
- The Compliance Expert-Vetted Assessment Engine. A multi-step, agentic analysis pipeline that walks the payload through the isolated rule set the way a senior compliance officer would. Every step has been reviewed, tuned, and signed off by compliance domain experts—not just engineers.
- A Decision You Can Act On. Not a paragraph of LLM prose. You receive a quantitative risk score (0-100), prioritized findings flagged as P1 (Blocking) or P2 (Material), and traceable regulatory citations down to the exact section—ready for an agent to act on automatically, or for a human reviewer to audit in seconds.
The Proof: 300+ Expert-Verified Test Cases
Architecture looks great on a whiteboard, but compliance requires proof. How do we know this engine comprehensively catches the regulatory violations a generic model would miss?
We don’t rely on generic AI benchmarks. We built the hardest compliance exam in the industry.
We partnered directly with compliance domain experts to build a continuous evaluation suite of over 300 highly structured, real-world scenarios spanning the core scenario contexts. To “pass,” the API must return the exact correct p1 or p2 priority finding, alongside the precise regulatory citation.
A look inside the evaluation suite:
| Scenario / Target | The Agent’s Drafted Output (Input Payload) | Expected API Finding (The “Answer Key”) |
|---|---|---|
dispute (Reg E) | “We are denying your unauthorized transaction claim based solely on our EMV/PIN verification logs.” | P1 (Blocking). Violation: failure to conduct a reasonable investigation; chip/PIN use is not conclusive evidence of authorization under CFPB guidance. Citation: Electronic Fund Transfers > § 1005.11(c) |
marketing_asset (Reg Z) | “Sign up today for a 0% introductory APR! Don’t miss out on this offer.” | P1 (Blocking). Violation: introductory promotional rate advertised without the required disclosure of the rate’s expiration date and the post-promotional rate at equal prominence. Citation: Truth in Lending > § 1026.16(g) |
complaint (CFPB Response) | “We received your complaint on March 1 and are still investigating as of April 20.” | P1 (Blocking). Violation: institution exceeded the 15-day initial response window for CFPB consumer complaints (50 days elapsed). Citation: CFPB Company Portal Manual > Responding to Complaints > Timing of Responses |
Continuous Confidence: The Headline Metrics
Every time we change anything inside the engine, the API is automatically run end-to-end against our full benchmark of expert-verified scenarios before that change is allowed to ship. It is a continuous evaluation pipeline—the same discipline a top engineering team applies to its production software, applied here to compliance accuracy.
Headline metrics from a recent production benchmark run across over 300 complex operational scenarios:
- 97.2% P1 (Blocking) Recall: the engine successfully identified and flagged 97.2% of all critical regulatory violations.
- 94.7% P2 (Material) Recall: the engine successfully identified and flagged 94.7% of all material compliance concerns.
By running this continuous evaluation pipeline, we mathematically prove our accuracy before a single piece of compliance text is evaluated in production.
Stop Wondering. Start Proving.
Mapping thousands of federal and state regulations into a structured graph, keeping that corpus continuously up to date, and building the expert-verified evaluation suite that proves the engine actually works—it’s tedious, specialized, never-finished work. You don’t have to take it on. Plug into the Midlyr Screen Analysis API and inherit it on day one.
The foundation models are here. The agents are being built. Now, they can operate safely.
Ready to govern your agents in production? Get started on the Midlyr Developer Portal →