61

DataAgent

Benchmark research paper for Data Science Capabilities in Large Language Models

There are many benchmarks out there as LLMs are starting to take over jobs. We've seen benchmarks for math, physics, and software engineering tasks, but none specifically for data exploration and analysis. That's the gap we wanted to fill.

Data science is dominated by heavy data processing, complex analysis, and extracting deep insights from datasets. It's also repetitive. A lot of the work involves basic code execution and answering straightforward questions about data. As LLMs get access to code interpreters and understand how data is structured and fetched, they should theoretically be able to handle these tasks. But how well do they perform, actually?

What We Built

We created a benchmark dataset to evaluate GPT-3.5 as a "Language Data Scientist" (LDS) that could answer natural language questions about datasets without any prior training (zero-shot). The system works in three phases: gathering background info on the dataset using Pandas, generating a plain-language action plan using Chain-of-Thought and SayCan prompting techniques, and then executing that plan to answer the original query.

We tested it on 15 benchmark datasets ranging from 50 to 300 rows, covering everything from weather data to sales figures. Each dataset came with 15 questions of varying difficulty—some as simple as "how many rows are in this dataset?" and others more complex like "use linear regression to predict the next value in this column."

The Results

Running on GPT-3.5, the system achieved 32.89% accuracy (74 out of 225 questions correct). Performance was surprisingly consistent across dataset sizes: 33% for small datasets, 29% for medium, and 36% for large ones.

The model excelled at straightforward data extraction and simple correlations. But it struggled with multi-part questions—like if you asked for three different statistics in one query, it would often nail one or two and completely miss the rest. Edge cases were rough too. Ask it for the median of a categorical column (which doesn't make sense) and instead of saying "hey, that doesn't work," it would just return something incorrect, usually the first column in the dataset.

One specific challenge: questions asking "which column has the most missing values?" in datasets that had no missing values would consistently trip it up. The model would confidently return the wrong answer instead of recognizing the edge case. This is consistently an issue I saw with all current LLMs. Hallucations are common, reducing their reliability on hard truth operations like data analysis.

Looking Forward

Right now the score sits at 32.89% running on GPT-3.5, but we know this number will get significantly larger as newer LLMs are released. GPT-4 is 82% less likely to produce factual errors and handles complex, multi-part instructions way better. This benchmark dataset will serve as a measurement stick for how well future models can handle zero-shot data science tasks.

If LLMs can handle the repetitive, low-level data work, data scientists can focus on the actually interesting problems. We're not there yet, but this research shows it's possible.

Preprint on arXiv: Link to print

Published on IEEE Xplore: Link to document