RESEARCH · BENCHMARK RELEASE

Introducing GTM‑BENCH

An open benchmark for evaluating whether AI agents can find the right buyers for the right seller - with evidence.

Buyer/seller coherence

June 2026

Agentic GTM workflows

Read The Paper

GTM-bench.ai

Github

GTM tasks

task types

market Categories

59,881

Source Queries

API Partners

We're introducing GTM-Bench, an open benchmark for go-to-market agents. GTM-Bench was built to evaluate whether AI systems can complete real prospecting work end to end: infer what a seller offers, define an actionable ideal customer profile, retrieve matching accounts and contacts, and support every record with evidence.

Each task consists of a natural-language instruction, a controlled data environment, and a requirement that the agent produce three reviewable artifacts: an offer summary, an ICP, and a ranked prospect list. This structure mirrors how prospecting work is actually briefed, performed, and reviewed on GTM teams.

The first version includes 72 tasks across 11 task types and 15 market categories, designed from a taxonomy of 59,881 real prospecting queries submitted to Bebop.ai. We are releasing the task catalog, agent harness, evaluation code, benchmark calculation code, and leaderboard to give model providers, agent builders, researchers, and GTM teams a shared way to measure progress.

You can find the open-source release at gtm-bench.ai

Why GTM needs it

Most AI benchmarks don't capture the shape of real GTM work.

At Blackpearl, we have spent the past several years building AI systems for go-to-market work. That has forced us to think hard about how to measure whether these systems are actually any good at it.

To date, there has not been a benchmark that isolates the GTM lead-generation problem. Existing evaluations such as CRMArena, CRMArena-Pro, and Microsoft's Sales Research Bench test business-system reasoning: customer service, information retrieval, and grounded reporting over sales data. In coding, benchmarks like SWE-bench and Terminal-Bench became leading indicators of agent capability. GTM has had no equivalent, which means buyers compare systems on demos rather than outcomes.

The gap matters beyond our own industry. Poor matching wastes sales budgets and fills inboxes with irrelevant outreach, and AI has made outbound generation nearly free. A benchmark that rewarded agents for returning more rows would encourage exactly the wrong behaviour.

"A benchmark that rewards agents for returning more rows would encourage the wrong behaviour. GTM-Bench is designed to do the opposite."

From the GTM-Bench paper

how it works

Each task mirrors real GTM work.

Each task gives an agent a natural-language GTM instruction, a controlled data environment, and a standard operating environment. To complete it, the agent must produce three artifacts:

OFFER.md

The offer

What the seller appears to be offering — inferred, not given.

ICP.md

The buyer

The ideal customer profile that follows from the task and offer.

RESULTS.md

The prospects

A ranked list of accounts or contacts, with evidence and activation context.

A useful system cannot jump straight to a lead list. It has to understand the seller, reason about the buyer, retrieve candidates, and decide which records are worth acting on.

Scoring

Useful matches, not more rows.

Every returned record is judged on match quality (does this company or contact genuinely fit?) and audit quality (is it the right real-world entity, with supported claims?). Records grade into three bands:

A-grade · +1

Right company, right contact, clear offer need, supported claims. Creates positive utility.

B-grade · 0

Plausible but not activation-ready. Neutral.

Sub-B · −1

Wrong fits, unsupported claims, unresolved identities. Actively reduces score.

An agent cannot win by flooding the evaluator with weak leads. This reflects real GTM work: a bad lead is not just "less good" — it wastes budget, damages trust, and creates spam.

Initial results

Six frontier generalist systems, one purpose-built GTM system.

GTM-Bench v1 · Net score and A-grade rate

System

Net score

A-grade rate

Blackpearl RTSA — purpose-built

+26,615.6

40.9%

OpenAI GPT-5.5 / Codex

+4,040.9

37.7%

Claude Sonnet 4.6 / Claude Code

+400.1

27.3%

Claude Opus 4.7 / Claude Code

−2,476.6

31.7%

DeepSeek V4 Pro / Hermes

−3,398.0

21.8%

Gemini 3.5 Flash / Hermes

−10,671.9

13.6%

Kimi K2.6 / Hermes

−15,402.3

22.8%

Blackpearl RTSA

+26,615.6

GPT-5.5 / Codex

+4,040.9

Claude Sonnet 4.6

+400.1

Claude Opus 4.7

−2,476.6

DeepSeek V4 Pro

−3,398.0

Gemini 3.5 Flash

−10,671.9

Kimi K2.6

−15,402.3

GTM-Bench is developed by Blackpearl research; RTSA is a Blackpearl system. The full task catalog, harness, evaluation code, and run artifacts are open source so results can be independently verified. Full results, confidence intervals, and methodology: gtm-bench.ai.

The results show why volume-weighted scoring matters. Some systems returned many plausible-looking rows, but enough were weak, unsupported, or incorrectly matched that their total utility became negative. RTSA and GPT-5.5 had similar A-grade rates — but RTSA produced far more useful volume, which is what matters in production GTM.

"The AI industry has become obsessed with output. It has spent far less time measuring outcomes. Poor-quality agentic AI doesn't simply fail to find opportunities - it consumes budgets, wastes sales hours, pollutes CRM systems and sends organisations chasing customers who were never likely to buy. Put bluntly, bad AI may be worse than no AI at all.”

Nick Lissette · Chief Executive Officer, Blackpearl Group

What we learned

The strongest runs filtered before they wrote.

Across 432 generalist-agent traces, the strongest runs retrieved broad candidate pools, then used structured filtering, website evidence, and row-level pruning before writing final results. The weakest runs stopped too early, overproduced weak rows, or made claims the evidence couldn't support.

The lesson for GTM agents: retrieval alone is not enough. The hard part is deciding what to keep, what to reject, and when the evidence is strong enough to act.

"When you create a benchmark, the fair question is: how credible is it? Our answer is transparency. Every task, every line of evaluation code and every run artifact is public. Anyone can re-run the experiments and challenge the findings - we hope people do, because that's how benchmarks improve.”

MAX POLACZUK · VICE PRESIDENT OF AI, Blackpearl Group

API Partners

From the frontier.

"Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.”
— OpenAI, on GDPval (openai.com)

“Claude Opus 4.7 and Claude Sonnet 4.6, run through Claude Code, were among the most capable agents evaluated, marked by high-recall retrieval and strong evidence-gathering behaviour across the task set.”

"The amount of alpha you can have right now creating good public AI benchmarks is wild, such a big opportunity.”
— Logan Kilpatrick, Google AI Studio
(via X, 5 Jun 2026)

"DeepSeek V4 Pro, an open-weight frontier model run through Hermes, was included to measure how the leading open models compare with closed frontier systems on real GTM work.”

“Moonshot AI's open-weight Kimi K2.6, run through Hermes, was benchmarked across the full 72-task set as part of GTM-Bench's coverage of open frontier models.”

"The open-source agent harness from Nous Research, used to run the open-weight models in the evaluation under a consistent, reproducible agent framework.”

"Blackpearl's self-serve GTM platform and the empirical foundation of the benchmark: its 59,881 real prospecting queries were clustered into the taxonomy that the 72 tasks were built from.”

What's next

An early step toward better evaluation for GTM agents.

Future versions will expand task coverage, improve reproducibility, and incorporate richer buyer-affinity signals — going beyond likely fit to estimate whether a matched prospect is actually likely to buy.

Our goal is to help the GTM and AI communities understand where agents are useful today and how to make them better over time. That will take input from more than Blackpearl. If you want to contribute tasks, challenge the rubric, or run your own system against the benchmark, the code and leaderboard are at gtm-bench.ai, or reach the research team at research@blackpearl.com.

Run GTM‑Bench yourself.

Read The Paper

GTM-bench.ai

Introducing GTM‑BENCH

API Partners

Most AI benchmarks don't capture the shape of real GTM work.

Each task mirrors real GTM work.

The offer

The buyer

The prospects

Useful matches, not more rows.

A-grade · +1

B-grade · 0

Sub-B · −1

Six frontier generalist systems, one purpose-built GTM system.

The strongest runs filtered before they wrote.

From the frontier.

An early step toward better evaluation for GTM agents.

Run GTM‑Bench yourself.

Subscribe for Updates