Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text Accuracy Max 100.0% Released Jun 2025

Homepage Code

Results

Models scored

99.3%

Top: Gemini 3.1 Pro

92.5%

Median

Best results

Top primary scores; one row per model.

99.3%

98.9%

98.2%

98.0%

97.9%

96.7%

Trinity Large Thinking

94.7%

Gemini 3 Flash (Thinking)

90.2%

Kimi K2.7 Code

90.1%

Claude Haiku 4.5

83.0%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	Gemini 3.1 Pro	99.3%	CoT	19 Feb 2026	Self-reported	Primary
2	GPT 5.4	98.9%	—	05 Mar 2026	Self-reported	Primary
3	Claude Opus 4.5	98.2%	—	24 Oct 2025	Self-reported	Primary
4	GPT 5.5	98.0%	CoT	23 Apr 2026	Self-reported	Primary
5	Claude Sonnet 4.6	97.9%	—	17 Feb 2026	Self-reported	Primary
6	GPT 5 (Thinking)	96.7%	—	07 Aug 2025	Self-reported	Primary
7	Trinity Large Thinking	94.7%	0-shot · standard	01 Apr 2026	Self-reported	Primary
8	Gemini 3 Flash (Thinking)	90.2%	—	17 Dec 2025	Self-reported	Primary
9	Kimi K2.7 Code	90.1%	0-shot · agentic	12 Jun 2026	Third-party	Primary Verified
10	Claude Haiku 4.5	83.0%	—	15 Oct 2025	Self-reported	Primary
11	Gemini 2.5 Flash (Thinking)	79.5%	—	17 Dec 2025	Self-reported	Primary
12	Qwen 3.5 27B	79.0%	—	24 Feb 2026	Third-party	Primary Verified
13	Gemini 2.5 Pro (Thinking)	77.8%	—	17 Dec 2025	Self-reported	Primary
14	GPT 5	38.6%	—	07 Aug 2025	Self-reported	Primary

Go to section

Search

Tau2-Bench Telecom

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: