How accurate is AI ad testing compared to human panels?

Kettio's AI ad testing achieves a Spearman correlation of 0.775 against real consumer panel data on the UW benchmark dataset, and 70.3% pairwise accuracy on the CreativeRanking CTR benchmark. These results are validated across 5 independent academic datasets spanning images and video.

Can AI predict ad performance before launch?

Yes. Kettio's system predicts ad performance before launch by simulating target audience responses to each creative variant. On academic benchmarks with known outcomes, Kettio correctly identifies the stronger ad in a pair 70.3% of the time — without seeing any performance data.

What benchmarks has Kettio been tested against?

Kettio has been validated against 5 academic benchmarks: UW Consumer Panel (purchase intent), PittAds from CVPR (video emotional resonance), CreativeRanking (real CTR data), ADD1000 with eye-tracking, and PVP from ACL 2025 (2,521 human annotators). Results are published openly.

How does Kettio compare to other ad testing tools?

Kettio is the only ad creative testing platform that publishes accuracy benchmarks against independent academic datasets. On the PVP benchmark (ACL 2025), Kettio outperformed GPT-4o zero-shot on every measured metric. Most competing tools do not disclose validation data.

Research & Benchmarks

AI Ad Creative Testing, Validated.

We tested Kettio’s creative performance prediction system against 5 independent academic benchmarks with real human judgment data. Every result is published here. Nobody else in this space has done the same.

ρ=0.000UW Consumer PanelSpearman correlation

0.0%Pairwise AccuracyCreativeRanking CTR

0BenchmarksIndependent datasets

The Problem

Every ad testing tool claims accuracy. We’re the only ones who prove it.

The ad creative testing market is full of platforms that promise to predict your next winning creative. They say “AI-powered.” They say “data-driven.” None of them tell you how accurate their predictions actually are.

We ran our creative performance prediction system against five independent academic datasets—published by university research labs, peer-reviewed at CVPR and ACL, with ground truth from thousands of real human evaluators. We didn’t cherry-pick the ones where we looked good. We published every result.

If your current creative testing software can’t tell you its Spearman correlation against real human judgments, ask yourself why.

The Benchmarks

5 Datasets. Real Human Judgments. Every Result.

Each benchmark uses ground-truth data from real human evaluators. No synthetic validation. No grading our own homework.

UW Consumer PanelUniversity research dataset

Purchase intent160 ads

ρ = 0.775Spearman correlation

Real consumer panel data scoring ads on purchase intent. Kettio's predictions correlate at 0.775 with human panel consensus — approaching the reliability ceiling of human-to-human agreement.

PittAds (CVPR)Carnegie Mellon — CVPR publication

Video emotional resonance2,400+ video ads

ρ = 0.681Spearman correlation

A computer vision benchmark of real video advertisements with crowdsourced persuasion and emotional response ratings. Validates that our system extends beyond static images to video creative.

CreativeRankingIndustry CTR benchmark

Real click-through rates500+ ad pairs

70.3%Pairwise accuracy

Given two ads from the same campaign, Kettio correctly identifies the one with higher real-world CTR 70.3% of the time — without seeing any performance data.

ADD1000Academic benchmark with eye-tracking

Subjective ad ratings1,000 ad images

ρ = 0.535Spearman correlation

One thousand advertisement images scored by human evaluators with accompanying eye-tracking data from 57 participants. Validates performance on a large-scale, diverse ad corpus.

PVP (ACL 2025)Seoul National University — ACL publication

Visual persuasion1,089 images / 2,521 annotators

Beat GPT-4oZero-shot, no training data

Over 28,000 images scored by 2,521 personality-profiled annotators. Kettio outperformed GPT-4o on Spearman correlation (0.221 vs 0.19) and Pearson correlation (0.195 vs 0.19). Zero-shot — no fine-tuning, no training data.

Head to Head

Kettio vs. Direct AI Scoring

“Can’t I just ask ChatGPT to rate my ads?” You can. Here’s what happens when you do.

PVP Benchmark (ACL 2025)	Kettio	GPT-4o (direct)	Fine-tuned LLaMA
Spearman ρ	0.221	0.19	0.25
Pearson r	0.195	0.19	0.25
Pairwise accuracy	58.1%	n/a	n/a
Training data required	None	None	Full dataset

PVP benchmark: 1,089 images, 2,521 personality-profiled annotators. Kettio used zero-shot inference with no training data.
Read the full breakdown →

0Images scored on PVP benchmark

0Human annotators validating ground truth

0+Total ads across all benchmarks

Our Approach

Why It Works (Without the Secret Sauce)

Most AI ad testing asks a model to rate your ad on a scale of 1 to 5. The problem is well-documented: language models are terrible at producing reliable numerical ratings. They regress to the middle, produce distributions that look nothing like real human responses, and fail to differentiate meaningfully between creatives.

Kettio takes a fundamentally different approach. Our proprietary scoring system is grounded in published behavioral science research and adapted for ad creative evaluation. Rather than asking for a number, we simulate how a specific consumer persona—built from your target audience profile—would actually respond to each creative. The system then translates those simulated responses into calibrated performance predictions.

The result is a prediction that accounts for who your audience is, not just what the ad looks like. That’s why audience targeting produces an 8–12 percentage point accuracy lift—and why no one running real ad campaigns targets “everyone.”

For a deeper look at the research integrity behind these numbers, see “We Ran 9 Experiments and Kept Zero”—a post about what happens when we test modifications to the system and they don’t beat the baseline.

Key Findings

What 16 Experiments Taught Us

+12pp

Audience Targeting Is Everything

Generic AI scoring gets you generic results. When we condition predictions on your specific target audience — age, shopping behavior, brand awareness, price sensitivity — accuracy jumps 8–12 percentage points. This is the single highest-leverage variable in the entire system.

ρ=0.681

Video Works Too

On the PittAds benchmark (CVPR), Kettio achieves ρ=0.681 on video ad emotional resonance — validating that creative performance prediction extends beyond static images to the short-form video formats that dominate social advertising.

3 Languages

Cross-Cultural Validation

Our benchmarks span English, Chinese, and Korean datasets. The same system works across languages and cultural contexts without localization or retraining — a critical requirement for brands running global campaigns.

0 Training

Zero-Shot Generalization

Kettio has never been fine-tuned on any benchmark dataset. Every result reported here is zero-shot: the system had never seen these ads or their scores before. It works on your ads the same way it works on ours — no category-specific training needed.

FAQ

Common Questions About AI Ad Testing

Creative performance prediction uses AI to forecast which ad creatives will perform best with a target audience before any media spend. Unlike A/B testing which requires live traffic and budget, predictive creative testing scores ads against simulated audience responses calibrated to real human judgment data. Kettio's system is validated against 5 independent academic benchmarks.

On the UW Consumer Panel benchmark, Kettio achieves a Spearman correlation of 0.775 against real consumer panel data — approaching the ceiling of human-to-human test-retest reliability. Traditional consumer panels cost $10,000–$15,000 per study and take 4–6 weeks. Kettio delivers results in minutes.

Yes, with measurable accuracy. On the CreativeRanking benchmark using real click-through rate data, Kettio correctly identifies the stronger ad in a pair 70.3% of the time — without seeing any performance data. This is well above the 50% random baseline and competitive with expensive human panel studies.

On the PVP benchmark (ACL 2025, 1,089 images, 2,521 human annotators), Kettio outperformed GPT-4o on every measured metric. Direct LLM scoring suffers from well-documented biases: regression to the mean, position bias, and inability to produce calibrated probability distributions. Kettio's proprietary methodology was specifically designed to overcome these limitations.

Both. On the PittAds benchmark (CVPR), Kettio achieves ρ=0.681 on video ad emotional resonance. The system is validated across static images, video creative, and both English and non-English markets.

Zero-shot means Kettio was never trained, fine-tuned, or shown any examples from the benchmark datasets. Every result reported on this page was generated by a system that had never seen these ads or their human scores before — the same system that scores your ads in production.

See It Yourself

Test It on Your Own Ads

Upload your creatives, define your target audience, and see how Kettio scores them. The same system that beat GPT-4o on academic benchmarks, running on your actual ads.

Score your ads free For agencies