Introducing GTM‑BENCH
API Partners

We're introducing GTM-Bench, an open benchmark for go-to-market agents. GTM-Bench was built to evaluate whether AI systems can complete real prospecting work end to end: infer what a seller offers, define an actionable ideal customer profile, retrieve matching accounts and contacts, and support every record with evidence.
Each task consists of a natural-language instruction, a controlled data environment, and a requirement that the agent produce three reviewable artifacts: an offer summary, an ICP, and a ranked prospect list. This structure mirrors how prospecting work is actually briefed, performed, and reviewed on GTM teams.
The first version includes 72 tasks across 11 task types and 15 market categories, designed from a taxonomy of 59,881 real prospecting queries submitted to Bebop.ai. We are releasing the task catalog, agent harness, evaluation code, benchmark calculation code, and leaderboard to give model providers, agent builders, researchers, and GTM teams a shared way to measure progress.
You can find the open-source release at gtm-bench.ai
Most AI benchmarks don't capture the shape of real GTM work.
At Blackpearl, we have spent the past several years building AI systems for go-to-market work. That has forced us to think hard about how to measure whether these systems are actually any good at it.
To date, there has not been a benchmark that isolates the GTM lead-generation problem. Existing evaluations such as CRMArena, CRMArena-Pro, and Microsoft's Sales Research Bench test business-system reasoning: customer service, information retrieval, and grounded reporting over sales data. In coding, benchmarks like SWE-bench and Terminal-Bench became leading indicators of agent capability. GTM has had no equivalent, which means buyers compare systems on demos rather than outcomes.
The gap matters beyond our own industry. Poor matching wastes sales budgets and fills inboxes with irrelevant outreach, and AI has made outbound generation nearly free. A benchmark that rewarded agents for returning more rows would encourage exactly the wrong behaviour.
"A benchmark that rewards agents for returning more rows would encourage the wrong behaviour. GTM-Bench is designed to do the opposite."
Each task mirrors real GTM work.
Each task gives an agent a natural-language GTM instruction, a controlled data environment, and a standard operating environment. To complete it, the agent must produce three artifacts:
The offer
The buyer
The prospects
A useful system cannot jump straight to a lead list. It has to understand the seller, reason about the buyer, retrieve candidates, and decide which records are worth acting on.
Useful matches, not more rows.
Every returned record is judged on match quality (does this company or contact genuinely fit?) and audit quality (is it the right real-world entity, with supported claims?). Records grade into three bands:
A-grade · +1
B-grade · 0
Sub-B · −1
An agent cannot win by flooding the evaluator with weak leads. This reflects real GTM work: a bad lead is not just "less good" — it wastes budget, damages trust, and creates spam.
Six frontier generalist systems, one purpose-built GTM system.
The results show why volume-weighted scoring matters. Some systems returned many plausible-looking rows, but enough were weak, unsupported, or incorrectly matched that their total utility became negative. RTSA and GPT-5.5 had similar A-grade rates — but RTSA produced far more useful volume, which is what matters in production GTM.
"[Quote placeholder — Nick Lissette, CEO. 1–2 sentences on why measurable outcomes matter for GTM AI. To be drafted and approved.]"
The strongest runs filtered before they wrote.
Across 432 generalist-agent traces, the strongest runs retrieved broad candidate pools, then used structured filtering, website evidence, and row-level pruning before writing final results. The weakest runs stopped too early, overproduced weak rows, or made claims the evidence couldn't support.
The lesson for GTM agents: retrieval alone is not enough. The hard part is deciding what to keep, what to reject, and when the evidence is strong enough to act.
"[Quote placeholder — Sam Daish, CTO, or Max Polaczuk, lead author. 1–2 sentences on the benchmark methodology / open-source release. To be drafted and approved.]"
From the frontier.
[Placeholder module — quotes to be sourced and approved per partner. Carousel built by design; cards below are layout mocks with lorem ipsum.]
An early step toward better evaluation for GTM agents.
Future versions will expand task coverage, improve reproducibility, and incorporate richer buyer-affinity signals — going beyond likely fit to estimate whether a matched prospect is actually likely to buy.
Our goal is to help the GTM and AI communities understand where agents are useful today and how to make them better over time. That will take input from more than Blackpearl. If you want to contribute tasks, challenge the rubric, or run your own system against the benchmark, the code and leaderboard are at gtm-bench.ai, or reach the research team at research@blackpearl.com.
