We participated in SpreadsheetBench and scored 92.5%

GPT for Excel on SpreadsheetBench: 370 correct, 19 debatable, 11 wrong

What is SpreadsheetBench?

SpreadsheetBench is a public benchmark for spreadsheet AI agents, introduced at NeurIPS 2024¹. GPT for Excel entered the V1 - Verified (400) track: 400 real-world spreadsheet tasks across the Lookup & Aggregation (161), Data Manipulation (142) and Computation & Logic (97) categories.

Every task is a real question pulled from a public Excel forum (i.e. ExcelForum, Chandoo, MrExcel, or ExcelGuru) attached to the original messy workbook. The kind of question a real Excel user actually posted.

Example tasks:

Lookup & Aggregation

I need to return a value from a third column by looking up and matching data contained within two other specific columns. I have attempted to use VLOOKUP and MATCH functions without success. The attached example, Sample.xlsx, demonstrates what I'm trying to achieve.

Data Manipulation

Extract rows from Sheet1 where column B equals 'TELIVISION' OR column C equals 'CLASS III' OR column C equals 'CLASS IV'. Put the results in Sheet2 starting at A3 with headers in A2

Computation & Logic

I need to round the amount in the second column to the nearest $5. And I want the formula in C1 to round down to 365.00 when the difference to the nearest 5 is in the middle exactly.

Evaluation is single-pass per task, with no re-runs and the output must exactly match every cell of the golden file. An off-by-one row, a trailing space, or a rounded percentage - and the task counts as a failure.

Our score

GPT for Excel scored 92.50% - 370 of 400 tasks correct on the first attempt. The agent ran on Opus 4.7 at medium reasoning.

The 30 misses split into two very different buckets:

19 debatable - defensible answers that didn't match the golden file (more below)
11 real failures - genuine errors (in formulas, in scripts), each folded back into the agent (more below)

Performance is consistent across all three task families:

Category	N	GPT for Excel
Lookup & Aggregation	161	91%
Data Manipulation	142	94%
Computation & Logic	97	93%
Overall	400	92.5%

"For the robust, an error is information." — Nassim Taleb

We did a line-by-line review of every miss, to understand what went wrong and how we could improve. Let's take a deep dive on the misses.

19 debatable: defensible answers to ambiguous prompts

Not every failure is a wrong answer. 19 of our 30 misses came from prompt ambiguities or reasonable alternative interpretations. The agent made a sound decision that simply didn't match the expected output exactly.

A few patterns came up over and over:

i. The prompt and the example contradict each other

Case 49036 asks the agent:

"Additionally, how can I append the text 'WIN RATE' to the displayed percentage result without altering the number format which should show as an integer percentage (e.g., 67% WIN RATE) with .00 decimal"

Both an integer and with two decimals - the prompt contradicts itself. Our agent followed the example (67%); benchmark expected the .00 form (66.67%). Either reading is defensible and easily corrected by the agent if you clarify your expectations.

ii. The prompt under-specifies a key choice

Case 374-31 asks the agent:

Delete the blank rows that appear above a cell with the text 'Code' in column A

What counts as a blank row? Rows that are entirely empty? Rows where only column A is blank? Both readings are common; the prompt picks neither. Our agent went with the strict reading (entirely empty) and described the alternative in its answer.

Example of a debatable case: ambiguous definition of blank rows

We could fill another page with patterns like these (e.g. unclear inclusive-exclusive bounds). In total, we proposed 21 corrections to the SpreadsheetBench team. The breakdown:

6 wrongly-named files
6 real evaluation errors
9 debatable on prompt reading

They agreed with us on four cases and adjusted the score accordingly.

Learning from 11 real misses, but 7 unlikely to recur

Most of the remaining failures were one-off slip-ups: a missing $ anchor in a formula, a destructive edit in a script, and so on. They rarely reproduce.

Our re-runs confirm this is due to variance, as opposed to a capability gap: 7 of the 11 failures flip to success on more than 50% of re-runs (2 even flip every time). Other models available in the product (such as Opus 4.6 or GPT-5.5) score even higher on these specific cases.

Success rate on 10 runs	Opus 4.7 Medium	Opus 4.6 High	GPT-5.5
>= 50%	7	9	8
> 0% and < 50%	1	2	2
0%	3	0	1

All of which still puts us at #2 on the public leaderboard, behind Tetra-Beta-2 (DealGlass) at 94.25% and ahead of every other entry.

GPT for Excel cost per task on SpreadsheetBench

Reviewing the misses surfaced two areas to improve in our product:

Reinforcing agent rules (e.g., checking the target range is empty or safe to overwrite before filling formulas, tightening Office.js handling for things like table conversion and deletion)
Improving how the agent sees the spreadsheet: in 2 of the 3 cases that consistently failed, the relevant information was already in the sheet but the agent missed it.

Production systems optimize for more than one-shot accuracy

Real-world performance is more than accuracy. Several dimensions sit outside this benchmark's scope:

Speed: spreadsheet work is highly interactive, so latency directly affects the user experience. A fast agent that misses occasionally - but is quick to re-prompt - often beats a slower one with marginally better accuracy. What matters is the user's total time-to-result.
Raw AI cost: Maxing out accuracy on a benchmark usually means running the most expensive model on every task. The per-task cost then prices out most users - winning the benchmark, losing the customer.

GPT for Excel is ~5× faster than Tetra

Tetra reports an average of 112 seconds per task on this benchmark (median 94s), per their blog post.

Our agent averages 24 seconds, with a median of 20. 90% of our runs finish in under 45 seconds. That gap matters: at 20 seconds, the agent is part of your workflow. At two minutes, it's a background job you forget to check back on later.

Reasonable cost

On this benchmark, GPT for Excel's average task cost ~$0.14 (at provider API pricing), with 90% of runs under $0.24 and a worst case of $1.40. Opus 4.7 is one of the pricier models, and one of the strengths of GPT for Excel is offering a range of models for different budgets. GPT-5.4, for instance, strikes a strong cost-quality balance, well suited for users that prefer to iterate a little in exchange for a lower bill.

We couldn't find any information about run cost for other listed tools.

Towards a benchmark representative of real world usage

Full disclosure: we love benchmarks. They make for a remarkably fair comparison. A benchmark proves capability on a snapshot: what an agent does on a fixed set of tasks, in one attempt, on one day. SpreadsheetBench in particular is a rigorous, public, and reproducible benchmark, and we take that 92.5% seriously.

But reality is wider than a snapshot.

GPT for Excel has thousands of daily users, and many of their everyday tasks sit outside what SpreadsheetBench measures:

Visual spreadsheet operations (charts, pivot tables, conditional formatting)
Bulk operations at scale (cleaning, categorization, deduplication, scoring, enrichment, content generation, translation - across thousands of rows)

Accuracy is only part of the picture, too. We want to measure speed and per-task cost alongside quality - and to capture the choices that actually shape the product experience: which model, which reasoning level, when to fall back to a cheaper option.

That's why we're building our own benchmark: broader in task scope, and multi-model / multi-reasoning by design. The full methodology and results will follow soon. In the meantime, we can share a preview.

Stay tuned.

Ma, Z., Zhang, B., Zhang, J., Yu, J., Zhang, X., Zhang, X., Luo, S., Wang, X., & Tang, J. (2024). SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track. Renmin University of China, Tsinghua University, Zhipu.AI. https://arxiv.org/abs/2406.14991. ↩

We participated in SpreadsheetBench and scored 92.5%

TABLE OF CONTENTS

GPT for Excel on SpreadsheetBench: 370 correct, 19 debatable, 11 wrong

Our score

19 debatable: defensible answers to ambiguous prompts

Learning from 11 real misses, but 7 unlikely to recur

Production systems optimize for more than one-shot accuracy

GPT for Excel is ~5× faster than Tetra

Reasonable cost

Towards a benchmark representative of real world usage

Related Articles

How to Use COUNTIF Function in Excel (Step-by-Step with Examples)

10 Best AI Tools for Data Analysis in Excel & Google Sheets (2026 Guide)

GPT for Work - 2026 April updates

We participated in SpreadsheetBench and scored 92.5%

TABLE OF CONTENTS

GPT for Excel on SpreadsheetBench: 370 correct, 19 debatable, 11 wrong

Our score

19 debatable: defensible answers to ambiguous prompts

Learning from 11 real misses, but 7 unlikely to recur

Production systems optimize for more than one-shot accuracy

GPT for Excel is ~5× faster than Tetra

Reasonable cost

Towards a benchmark representative of real world usage

Footnotes

Related Articles

How to Use COUNTIF Function in Excel (Step-by-Step with Examples)

10 Best AI Tools for Data Analysis in Excel & Google Sheets (2026 Guide)

GPT for Work - 2026 April updates