GPT for Excel on SpreadsheetBench: 370 correct, 19 debatable, 11 wrong
What is SpreadsheetBench?
SpreadsheetBench is a public benchmark for spreadsheet AI agents, introduced at NeurIPS 20241. GPT for Excel entered the V1 - Verified (400) track: 400 real-world spreadsheet tasks across the Lookup & Aggregation (161), Data Manipulation (142) and Computation & Logic (97) categories.
Every task is a real question pulled from a public Excel forum (i.e. ExcelForum, Chandoo, MrExcel, or ExcelGuru) attached to the original messy workbook. The kind of question a real Excel user actually posted.
Example tasks:
- Lookup & Aggregation
I need to return a value from a third column by looking up and matching data contained within two other specific columns. I have attempted to use VLOOKUP and MATCH functions without success. The attached example, Sample.xlsx, demonstrates what I'm trying to achieve.
- Data Manipulation
Extract rows from Sheet1 where column B equals 'TELIVISION' OR column C equals 'CLASS III' OR column C equals 'CLASS IV'. Put the results in Sheet2 starting at A3 with headers in A2
- Computation & Logic
I need to round the amount in the second column to the nearest $5. And I want the formula in C1 to round down to 365.00 when the difference to the nearest 5 is in the middle exactly.
Evaluation is single-pass per task, with no re-runs and the output must exactly match every cell of the golden file. An off-by-one row, a trailing space, or a rounded percentage - and the task counts as a failure.
Our score
GPT for Excel scored 92.50% - 370 of 400 tasks correct on the first attempt. The agent ran on Opus 4.7 at medium reasoning.
The 30 misses split into two very different buckets:
- 19 debatable - defensible answers that didn't match the golden file (more below)
- 11 real failures - genuine errors (in formulas, in scripts), each folded back into the agent (more below)
Performance is consistent across all three task families:
Category | N | GPT for Excel |
|---|---|---|
Lookup & Aggregation | 161 | 91% |
Data Manipulation | 142 | 94% |
Computation & Logic | 97 | 93% |
Overall | 400 | 92.5% |
"For the robust, an error is information." — Nassim Taleb
We did a line-by-line review of every miss, to understand what went wrong and how we could improve. Let's take a deep dive on the misses.
19 debatable: defensible answers to ambiguous prompts
Not every failure is a wrong answer. 19 of our 30 misses came from prompt ambiguities or reasonable alternative interpretations. The agent made a sound decision that simply didn't match the expected output exactly.
A few patterns came up over and over:
i. The prompt and the example contradict each other
Case 49036 asks the agent:
"Additionally, how can I append the text 'WIN RATE' to the displayed percentage result without altering the number format which should show as an integer percentage (e.g., 67% WIN RATE) with .00 decimal"
Both an integer and with two decimals - the prompt contradicts itself. Our agent followed the example (67%); benchmark expected the .00 form (66.67%). Either reading is defensible and easily corrected by the agent if you clarify your expectations.
ii. The prompt under-specifies a key choice
Case 374-31 asks the agent:
Delete the blank rows that appear above a cell with the text 'Code' in column A
What counts as a blank row? Rows that are entirely empty? Rows where only column A is blank? Both readings are common; the prompt picks neither. Our agent went with the strict reading (entirely empty) and described the alternative in its answer.
We could fill another page with patterns like these (e.g. unclear inclusive-exclusive bounds). In total, we proposed 21 corrections to the SpreadsheetBench team. The breakdown:
- 6 wrongly-named files
- 6 real evaluation errors
- 9 debatable on prompt reading
They agreed with us on four cases and adjusted the score accordingly.
Learning from 11 real misses, but 7 unlikely to recur
Most of the remaining failures were one-off slip-ups: a missing $ anchor in a formula, a destructive edit in a script, and so on. They rarely reproduce.
Our re-runs confirm this is due to variance, as opposed to a capability gap: 7 of the 11 failures flip to success on more than 50% of re-runs (2 even flip every time). Other models available in the product (such as Opus 4.6 or GPT-5.5) score even higher on these specific cases.
Success rate on 10 runs | Opus 4.7 Medium | Opus 4.6 High | GPT-5.5 |
|---|---|---|---|
>= 50% | 7 | 9 | 8 |
> 0% and < 50% | 1 | 2 | 2 |
0% | 3 | 0 | 1 |
All of which still puts us at #2 on the public leaderboard, behind Tetra-Beta-2 (DealGlass) at 94.25% and ahead of every other entry.
Reviewing the misses surfaced two areas to improve in our product:
- Reinforcing agent rules (e.g., checking the target range is empty or safe to overwrite before filling formulas, tightening Office.js handling for things like table conversion and deletion)
- Improving how the agent sees the spreadsheet: in 2 of the 3 cases that consistently failed, the relevant information was already in the sheet but the agent missed it.
Production systems optimize for more than one-shot accuracy
Real-world performance is more than accuracy. Several dimensions sit outside this benchmark's scope:
-
Speed: spreadsheet work is highly interactive, so latency directly affects the user experience. A fast agent that misses occasionally - but is quick to re-prompt - often beats a slower one with marginally better accuracy. What matters is the user's total time-to-result.
-
Raw AI cost: Maxing out accuracy on a benchmark usually means running the most expensive model on every task. The per-task cost then prices out most users - winning the benchmark, losing the customer.
GPT for Excel is ~5× faster than Tetra
Tetra reports an average of 112 seconds per task on this benchmark (median 94s), per their blog post.
Our agent averages 24 seconds, with a median of 20. 90% of our runs finish in under 45 seconds. That gap matters: at 20 seconds, the agent is part of your workflow. At two minutes, it's a background job you forget to check back on later.
Reasonable cost
On this benchmark, GPT for Excel's average task cost ~$0.14 (at provider API pricing), with 90% of runs under $0.24 and a worst case of $1.40. Opus 4.7 is one of the pricier models, and one of the strengths of GPT for Excel is offering a range of models for different budgets. GPT-5.4, for instance, strikes a strong cost-quality balance, well suited for users that prefer to iterate a little in exchange for a lower bill.
We couldn't find any information about run cost for other listed tools.
Towards a benchmark representative of real world usage
Full disclosure: we love benchmarks. They make for a remarkably fair comparison. A benchmark proves capability on a snapshot: what an agent does on a fixed set of tasks, in one attempt, on one day. SpreadsheetBench in particular is a rigorous, public, and reproducible benchmark, and we take that 92.5% seriously.
But reality is wider than a snapshot.
GPT for Excel has thousands of daily users, and many of their everyday tasks sit outside what SpreadsheetBench measures:
- Visual spreadsheet operations (charts, pivot tables, conditional formatting)
- Bulk operations at scale (cleaning, categorization, deduplication, scoring, enrichment, content generation, translation - across thousands of rows)
Accuracy is only part of the picture, too. We want to measure speed and per-task cost alongside quality - and to capture the choices that actually shape the product experience: which model, which reasoning level, when to fall back to a cheaper option.
That's why we're building our own benchmark: broader in task scope, and multi-model / multi-reasoning by design. The full methodology and results will follow soon. In the meantime, we can share a preview.
Stay tuned.
Footnotes
-
Ma, Z., Zhang, B., Zhang, J., Yu, J., Zhang, X., Zhang, X., Luo, S., Wang, X., & Tang, J. (2024). SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track. Renmin University of China, Tsinghua University, Zhipu.AI. https://arxiv.org/abs/2406.14991. ↩


