Tau2-Bench Telecom
Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 99.3% | CoT | Feb 19, 2026 | self reported | primary |
| 2 | GPT 5.4 | 98.9% | — | Mar 5, 2026 | self reported | primary |
| 3 | Claude Opus 4.5 | 98.2% | — | Oct 24, 2025 | self reported | primary |
| 4 | GPT 5.5 | 98.0% | CoT | Apr 23, 2026 | self reported | primary |
| 5 | Claude Sonnet 4.6 | 97.9% | — | Feb 17, 2026 | self reported | primary |
| 6 | GPT 5 (Thinking) | 96.7% | — | Aug 7, 2025 | self reported | primary |
| 7 | Gemini 3 Flash (Thinking) | 90.2% | — | Dec 17, 2025 | self reported | primary |
| 8 | Claude Haiku 4.5 | 83.0% | — | Oct 15, 2025 | self reported | primary |
| 9 | Gemini 2.5 Flash (Thinking) | 79.5% | — | Dec 17, 2025 | self reported | primary |
| 10 | Qwen 3.5 27B | 79.0% | — | Feb 24, 2026 | third party | primary verified |
| 11 | Gemini 2.5 Pro (Thinking) | 77.8% | — | Dec 17, 2025 | self reported | primary |
| 12 | GPT 5 | 38.6% | — | Aug 7, 2025 | self reported | primary |
