TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text accuracy Max 100.0% Released Jun 2025
12
Results
12
Models scored
99.3%
Top: Gemini 3.1 Pro
93.5%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Aug 2025Dec 2025Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Gemini 3.1 Pro 99.3% CoT Feb 19, 2026 self reported primary
2 GPT 5.4 98.9% Mar 5, 2026 self reported primary
3 Claude Opus 4.5 98.2% Oct 24, 2025 self reported primary
4 GPT 5.5 98.0% CoT Apr 23, 2026 self reported primary
5 Claude Sonnet 4.6 97.9% Feb 17, 2026 self reported primary
6 GPT 5 (Thinking) 96.7% Aug 7, 2025 self reported primary
7 Gemini 3 Flash (Thinking) 90.2% Dec 17, 2025 self reported primary
8 Claude Haiku 4.5 83.0% Oct 15, 2025 self reported primary
9 Gemini 2.5 Flash (Thinking) 79.5% Dec 17, 2025 self reported primary
10 Qwen 3.5 27B 79.0% Feb 24, 2026 third party primary verified
11 Gemini 2.5 Pro (Thinking) 77.8% Dec 17, 2025 self reported primary
12 GPT 5 38.6% Aug 7, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task