RESEARCH · BENCHMARK RELEASE

Introducing GTM‑BENCH

An open benchmark for evaluating whether AI agents can find the right buyers for the right seller - with evidence.
Buyer/seller coherence
June 2026
Agentic GTM workflows
72
GTM tasks
11
task types
15
market Categories
59,881
Source Queries

API Partners

We're introducing GTM-Bench, an open benchmark for go-to-market agents. GTM-Bench was built to evaluate whether AI systems can complete real prospecting work end to end: infer what a seller offers, define an actionable ideal customer profile, retrieve matching accounts and contacts, and support every record with evidence.

Each task consists of a natural-language instruction, a controlled data environment, and a requirement that the agent produce three reviewable artifacts: an offer summary, an ICP, and a ranked prospect list. This structure mirrors how prospecting work is actually briefed, performed, and reviewed on GTM teams.

The first version includes 72 tasks across 11 task types and 15 market categories, designed from a taxonomy of 59,881 real prospecting queries submitted to Bebop.ai. We are releasing the task catalog, agent harness, evaluation code, benchmark calculation code, and leaderboard to give model providers, agent builders, researchers, and GTM teams a shared way to measure progress.

You can find the open-source release at gtm-bench.ai

Why GTM needs it

Most AI benchmarks don't capture the shape of real GTM work.

At Blackpearl, we have spent the past several years building AI systems for go-to-market work. That has forced us to think hard about how to measure whether these systems are actually any good at it.

To date, there has not been a benchmark that isolates the GTM lead-generation problem. Existing evaluations such as CRMArena, CRMArena-Pro, and Microsoft's Sales Research Bench test business-system reasoning: customer service, information retrieval, and grounded reporting over sales data. In coding, benchmarks like SWE-bench and Terminal-Bench became leading indicators of agent capability. GTM has had no equivalent, which means buyers compare systems on demos rather than outcomes.

The gap matters beyond our own industry. Poor matching wastes sales budgets and fills inboxes with irrelevant outreach, and AI has made outbound generation nearly free. A benchmark that rewarded agents for returning more rows would encourage exactly the wrong behaviour.

"A benchmark that rewards agents for returning more rows would encourage the wrong behaviour. GTM-Bench is designed to do the opposite."

From the GTM-Bench paper
how it works

Each task mirrors real GTM work.

Each task gives an agent a natural-language GTM instruction, a controlled data environment, and a standard operating environment. To complete it, the agent must produce three artifacts:

OFFER.md

The offer

What the seller appears to be offering — inferred, not given.
ICP.md

The buyer

The ideal customer profile that follows from the task and offer.
RESULTS.md

The prospects

A ranked list of accounts or contacts, with evidence and activation context.

A useful system cannot jump straight to a lead list. It has to understand the seller, reason about the buyer, retrieve candidates, and decide which records are worth acting on.

Scoring

Useful matches, not more rows.

Every returned record is judged on match quality (does this company or contact genuinely fit?) and audit quality (is it the right real-world entity, with supported claims?). Records grade into three bands:

A-grade · +1

Right company, right contact, clear offer need, supported claims. Creates positive utility.

B-grade · 0

Plausible but not activation-ready. Neutral.

Sub-B · −1

Wrong fits, unsupported claims, unresolved identities. Actively reduces score.

An agent cannot win by flooding the evaluator with weak leads. This reflects real GTM work: a bad lead is not just "less good" — it wastes budget, damages trust, and creates spam.

Initial results

Six frontier generalist systems, one purpose-built GTM system.

GTM-Bench v1 · Net score and A-grade rate
System
Net score
A-grade rate
Blackpearl RTSA — purpose-built
+26,615.6
40.9%
OpenAI GPT-5.5 / Codex
+4,040.9
37.7%
Claude Sonnet 4.6 / Claude Code
+400.1
27.3%
Claude Opus 4.7 / Claude Code
−2,476.6
31.7%
DeepSeek V4 Pro / Hermes
−3,398.0
21.8%
Gemini 3.5 Flash / Hermes
−10,671.9
13.6%
Kimi K2.6 / Hermes
−15,402.3
22.8%
Blackpearl RTSA
+26,615.6
GPT-5.5 / Codex
+4,040.9
Claude Sonnet 4.6
+400.1
Claude Opus 4.7
−2,476.6
DeepSeek V4 Pro
−3,398.0
Gemini 3.5 Flash
−10,671.9
Kimi K2.6
−15,402.3
GTM-Bench is developed by Blackpearl research; RTSA is a Blackpearl system. The full task catalog, harness, evaluation code, and run artifacts are open source so results can be independently verified. Full results, confidence intervals, and methodology: gtm-bench.ai.

The results show why volume-weighted scoring matters. Some systems returned many plausible-looking rows, but enough were weak, unsupported, or incorrectly matched that their total utility became negative. RTSA and GPT-5.5 had similar A-grade rates — but RTSA produced far more useful volume, which is what matters in production GTM.

"[Quote placeholder — Nick Lissette, CEO. 1–2 sentences on why measurable outcomes matter for GTM AI. To be drafted and approved.]"

Nick Lissette · Chief Executive Officer, Blackpearl Group
What we learned

The strongest runs filtered before they wrote.

Across 432 generalist-agent traces, the strongest runs retrieved broad candidate pools, then used structured filtering, website evidence, and row-level pruning before writing final results. The weakest runs stopped too early, overproduced weak rows, or made claims the evidence couldn't support.

The lesson for GTM agents: retrieval alone is not enough. The hard part is deciding what to keep, what to reject, and when the evidence is strong enough to act.

"[Quote placeholder — Sam Daish, CTO, or Max Polaczuk, lead author. 1–2 sentences on the benchmark methodology / open-source release. To be drafted and approved.]"

Sam Daish · Chief Technology Officer, Blackpearl Group
API Partners

From the frontier.

[Placeholder module — quotes to be sourced and approved per partner. Carousel built by design; cards below are layout mocks with lorem ipsum.]

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[Name placeholder]
[Title placeholder] · OpenAI
What's next

An early step toward better evaluation  for GTM agents.

Future versions will expand task coverage, improve reproducibility, and incorporate richer buyer-affinity signals — going beyond likely fit to estimate whether a matched prospect is actually likely to buy.

Our goal is to help the GTM and AI communities understand where agents are useful today and how to make them better over time. That will take input from more than Blackpearl. If you want to contribute tasks, challenge the rubric, or run your own system against the benchmark, the code and leaderboard are at gtm-bench.ai, or reach the research team at research@blackpearl.com.