How AI is Changing the Way We Benchmark Data Tools

Written by: Chris Parmer and Greg Wilson

We have known since the late 1970s that experts try to figure out as quickly as possible why their plans won't work so that they can zero in on viable solutions as quickly as possible. This idea is sometimes summed up by saying that an expert doesn't build a proof of concept: instead, she tries to build a disproof of concept to help her understand what's going to be hard and what might need to be rethought entirely. At Plotly, we've recently been exploring how AI tools can accelerate the process of benchmarking data solutions; this blog post shares what we have been doing and how well it has worked.

Our starting point is a technical question: How fast are various tools for manipulating tabular data and how do their APIs compare? How do Pandas and Polars compare on binning time-series data, is DuckDB faster for sorting aggregates than SQLite, and in what ways do the answers change as the data grows larger? What’s a 1-liner in one library but a multi-step imperative workflow in another? And what does the code actually look like?

Why benchmarking data tools is so time consuming

The most reliable way to find out is to do the same thing with each system and measure how long each takes, but it takes time to learn how to use any tool, and learning several well enough to be sure we're building meaningful tests is expensive. Finding or generating test data is another headache, as is validating our benchmarks (i.e. making sure that the code is actually calculating the right thing).

As for comparing syntax, we could read the examples in the documentation for each library but those examples will use different datasets and different queries and often highlight the strengths and elegance of the tool and not its weaknesses or how it compares directly with other libraries.

Using AI for faster benchmarking

Enter AI, slicing through learning curves and speeding up tedious manual effort. We turned to Claude (Anthropic’s competitor to ChatGPT) and asked it to generate two million rows of data for grocery sales in cities in Canada, the US, and Mexico. After a few tweaks to the prompt, Claude gave us this Python code for geography:

geo_data = {
    'USA': {
        'California': ['Los Angeles', 'San Francisco', 'San Diego'],
        'New York': ['New York City', 'Buffalo', 'Albany'],
        'Texas': ['Houston', 'Austin', 'Dallas']
    },
    'Canada': {
        'Ontario': ['Toronto', 'Ottawa', 'Hamilton'],
        'Quebec': ['Montreal', 'Quebec City', 'Gatineau'],
        'British Columbia': ['Vancouver', 'Victoria', 'Kelowna']
    },
    'Mexico': {
        'Jalisco': ['Guadalajara', 'Puerto Vallarta', 'Zapopan'],
        'Nuevo Leon': ['Monterrey', 'San Pedro Garza García', 'Apodaca'],
        'Baja California': ['Tijuana', 'Mexicali', 'Ensenada']
    }
}

and this for items:

item_categories = ['Produce', 'Dairy', 'Bakery', 'Meat', 'Beverages']
item_subcategories = {
    'Produce': ['Fruits', 'Vegetables'],
    'Dairy': ['Milk', 'Cheese', 'Yogurt'],
    'Bakery': ['Bread', 'Pastries'],
    'Meat': ['Beef', 'Chicken', 'Pork'],
    'Beverages': ['Soda', 'Water', 'Juice']
}
item_names = {
    'Fruits': ['Apple', 'Banana', 'Orange'],
    'Vegetables': ['Carrot', 'Broccoli', 'Spinach'],
    'Milk': ['Whole Milk', 'Skim Milk', 'Almond Milk'],
    'Cheese': ['Cheddar', 'Mozzarella', 'Swiss'],
    'Yogurt': ['Greek Yogurt', 'Regular Yogurt', 'Flavored Yogurt'],
    'Bread': ['White Bread', 'Whole Wheat Bread', 'Sourdough'],
    'Pastries': ['Croissant', 'Muffin', 'Danish'],
    'Beef': ['Ground Beef', 'Steak', 'Roast'],
    'Chicken': ['Whole Chicken', 'Chicken Breast', 'Chicken Thighs'],
    'Pork': ['Pork Chops', 'Bacon', 'Ham'],
    'Soda': ['Cola', 'Lemon-Lime', 'Root Beer'],
    'Water': ['Spring Water', 'Sparkling Water', 'Flavored Water'],
    'Juice': ['Orange Juice', 'Apple Juice', 'Grape Juice']
}

A few more prompts gave us sales dates, profit margins, serial numbers, and everything else we needed to create complex dataframes to run our benchmarks on.

Then we asked Claude to generate three equivalent benchmarks to use with SQLite, Pandas, and Polars. The only thing we have fixed is indentation to make the code fit nicely in this post:

-- SQLite
SELECT
Item_category,
Item_subcategory,
COUNT(*) as count,
AVG(price) as avg_price
FROM sales
GROUP BY item_category, item_subcategory
ORDER BY item_category, avg_price DESC;

# Pandas
result = df.groupby(['item_category', 'item_subcategory'])\
.agg({'price': ['count', 'mean']}).reset_index()
result.columns = ['item_category', 'item_subcategory', 'count', 'avg_price']
result = result.sort_values(['item_category', 'avg_price'], ascending=[True, False])

# Polars
result = (
    df.group_by(['item_category', 'item_subcategory'])
    .agg([
        pl.count().alias('count'),
        pl.mean('price').alias('avg_price')
    ])
    .sort(['item_category', 'avg_price'], descending=[False, True])
)

These queries show how different tools have different conceptual models of aggregation. SQL is “inside out” (the group by expression is declared in the middle of the expression) while Polars and Pandas flow more linearly. Polars has very little “implementation detail boilerplate” embedded within the syntax—with the exception of “alias”, every character is related to the actual aggregation computations. Pandas, on the other hand, contains the boilerplate like reset_index() and requires the statements to be written in three lines instead of chained together in one expression.

While readability is strongly influenced by familiarity, we find the Polars code easier to follow than the alternatives. It’s more concise, it reads left-to-right without as much mental backtracking, and in general feels less cluttered.

The real takeaway, though, is how much time we saved getting this far. The snippet of Polars shown above, and the other 29 snippets included in our benchmarking, were all generated in response to a single prompt:

The attached script runs a set of 30 queries in Pandas, SQLite, and DuckDB.

Modify it to also run Polars code.

Of course, not all prompts are that simple:

OK, now generate different pivot-table like queries that go through the entire

set of possibilities of what a pivot table could do. I’d expect there to be at least

30 but be systematic to go through all possible permutations.

These queries should involve four main things:

Grouping / Binning Columns - creating a grouping column or columns. Create queries for each data type (category, number, date). Provide examples of a single grouping column and multiple grouping columns of different types. The numerical and date binning should be “smart”, taking into account the min/max of the column and coming up with an appropriate number of bins.
Aggregating columns — creating aggregations for columns. Provide examples of sum, mean, median, min, max, count, and any others that might be useful. Again, data type aware.
Sorting columns - sort the values by the values of the different aggregations or by the grouping columns (e.g. alphabetically) or by the values of the columns themselves.
Limiting the number of values displayed (“Top N”) - for the categorical bins, only show the top N values based on the sort column or by a different column.

Do it systematically so that you can show you’ve covered all the possibilities.

That prompt is a little over 200 words and is about as detailed as what a conscientious instructor would write for an exercise in a programming class. (One beneficial side effect of programmers adopting LLMs might be that they finally start writing the kinds of problem specifications that their teachers have always told them to.)
We didn't copy and paste Claude’s answers and trust that they were doing the right thing (see Simon Willison’s blog and many other sources for why not). Instead, we read through them to find functions or methods we might want to use, how we could chain them together, and so on. Doing this helped us build our mental model of these tools more quickly; in particular, Claude's suggestions steered us toward things that we would eventually have found on our own if we'd had time for "eventually". As just one example, consider the following SQL:

-- Top 3 item categories by profit in each country
SELECT *
FROM (
    SELECT
        country,
        item_category,
        SUM(price * margin) as total_profit,
        ROW_NUMBER() OVER (
PARTITION BY country ORDER BY SUM(price * margin) DESC
        ) as rn
    FROM sales
    GROUP BY country, item_category
)
WHERE rn <= 3
ORDER BY country, total_profit DESC;

One of us has written a tutorial on SQL for data scientists, but no longer uses SQL every day, and would have struggled for an hour or more to construct this query and verify that it’s correct. That second point highlights another benefit of this approach: by getting Claude to generate code for several different systems, we could compare their results as well as their performance to check that they were doing the right thing. There’s still a chance that Claude is making the same mistake four times over, and that those mistakes are so subtle that our review didn’t catch them, but the same would be true of code written by a human being.

Our benchmarks are not exhaustive or authoritative, but they do compare the tools we were interested in, and do so the way we want to compare them. Benchmarks like Clickbench and TPC-H compare databases but don’t include Pandas or Polars; other benchmarks often use specialized, domain-specific datasets that aren’t as easily understood as our mock grocery store dataset.

A few takeaways

We learned a few things from this exercise:

We’re not looking for exact performance metrics; we want orders of magnitude to help guide design. That said: (a) Polars and DuckDB are within the same order of magnitude for performance (DuckDB was mostly 1-5x faster than Polars but Polars was 1-4x faster for a few queries) (b) Polars and DuckDB are within 1-2 orders of magnitude faster than Pandas (2-20x faster) (c) Polars and DucKDB are within 2 orders of magnitude faster than SQLite (14-93x faster).
Syntax varies widely. Certain analytic queries are elegant in one library and obtuse in another. For example, a seemingly simple query like “Top 3 items by profit in each country” requires ROW_NUMBER in SQL, multiple lines of imperative code in Pandas, but can be done in one long elegant chained expression in Polars.
The syntax is what AI thinks the syntax will be. There may be simpler or more elegant ways to write these queries but short of investing 10s-100s of hours learning a library or asking an expert (if you know one!), how do you know? In some cases, my software intuition can tell that something kind of “looks wrong” or “is there really not a simpler way to do this?”. But that doesn’t always work. I’m more ambitious with AI, delving deep into other languages in tools that I wouldn’t have done otherwise, but in doing so I might be doing things the “popular but wrong way”. It reminds me of copying and pasting snippets of Stack Overflow at the start of my career.
Backwards incompatibility in libraries hurts. The newer or less frequently used something is, the more carefully you need to check what the AI produces. Every piece of SQL that Claude generated for SQLite was correct the first time; we had to make a couple of minor corrections to its offerings for Pandas and DuckDB, and there were several issues with Polars. Most of the latter were related to Claude using an older version of the Polars API; we can only hope that adoption of LLM-based tools will serendipitously encourage programmers to pay more attention to backward compatibility.

– SQL
SELECT *
FROM (
   SELECT
       country,
       item_category,
       SUM(price * margin) as total_profit,
       ROW_NUMBER() OVER (PARTITION BY country ORDER BY SUM(price * margin) DESC) as rn
   FROM sales
   GROUP BY country, item_category
)
WHERE rn <= 3
ORDER BY country, total_profit DESC;

# Pandas
df['profit'] = df['price'] * df['margin']
result = df.groupby(['country', 'item_category'])['profit'].sum().reset_index(name='total_profit')
result = result.sort_values(['country', 'total_profit'], ascending=[True, False])
result = result.groupby('country').head(3)
result = df.groupby(['country', 'state', 'item_category']).agg({
   'price': ['count', 'sum']
}).reset_index()
result.columns = ['country', 'state', 'item_category', 'count', 'total_sales']
result = result.sort_values(['country', 'state', 'total_sales'], ascending=[True, True, False])
result = result.groupby(['country', 'state']).head(3)

# Polars
df.group_by(['country', 'state', 'item_category'])
.agg([
   pl.count().alias('count'),
   pl.sum('price').alias('total_sales')
])
.sort(['country', 'state', 'total_sales'], descending=[False, False, True])
.group_by(['country', 'state'])
.head(3)

And things like dates are gnarly across the board, but especially so in Polars:

– SQL
strftime('%Y-Q%d', date_sold, 'start of month', '-2 months', 'weekday 4') AS quarter,

# Pandas
df['date_sold'].dt.to_period('Q').astype(str)

# Polars
df.with_columns(
    pl.concat_str([
        pl.col('date_sold').dt.year().cast(pl.Utf8),
        pl.lit('-Q'),
        pl.col('date_sold').dt.quarter().cast(pl.Utf8)
    ]).alias('quarter')
)
.group_by('quarter')

The final takeaway is that if you know what you want, AI is much more than just autocomplete on steroids or Stack Overflow without the snarkiness. Like many people, we have questions about environmental impact, bias, copyright infringement, and intentional misuse, but for benchmarking, writing unit tests, and building a mental model of a new tool, we now believe that Jon Udell and others are right: it's the biggest thing to happen to programming since the invention of the Internet.