AI Ad Creative Testing, Validated.
We tested Kettio’s creative performance prediction system against 5 independent academic benchmarks with real human judgment data. Every result is published here. Nobody else in this space has done the same.
Every ad testing tool claims accuracy. We’re the only ones who prove it.
The ad creative testing market is full of platforms that promise to predict your next winning creative. They say “AI-powered.” They say “data-driven.” None of them tell you how accurate their predictions actually are.
We ran our creative performance prediction system against five independent academic datasets—published by university research labs, peer-reviewed at CVPR and ACL, with ground truth from thousands of real human evaluators. We didn’t cherry-pick the ones where we looked good. We published every result.
If your current creative testing software can’t tell you its Spearman correlation against real human judgments, ask yourself why.
5 Datasets. Real Human Judgments. Every Result.
Each benchmark uses ground-truth data from real human evaluators. No synthetic validation. No grading our own homework.
Kettio vs. Direct AI Scoring
“Can’t I just ask ChatGPT to rate my ads?” You can. Here’s what happens when you do.
PVP benchmark: 1,089 images, 2,521 personality-profiled annotators. Kettio used zero-shot inference with no training data.
Read the full breakdown →
Why It Works (Without the Secret Sauce)
Most AI ad testing asks a model to rate your ad on a scale of 1 to 5. The problem is well-documented: language models are terrible at producing reliable numerical ratings. They regress to the middle, produce distributions that look nothing like real human responses, and fail to differentiate meaningfully between creatives.
Kettio takes a fundamentally different approach. Our proprietary scoring system is grounded in published behavioral science research and adapted for ad creative evaluation. Rather than asking for a number, we simulate how a specific consumer persona—built from your target audience profile—would actually respond to each creative. The system then translates those simulated responses into calibrated performance predictions.
The result is a prediction that accounts for who your audience is, not just what the ad looks like. That’s why audience targeting produces an 8–12 percentage point accuracy lift—and why no one running real ad campaigns targets “everyone.”
For a deeper look at the research integrity behind these numbers, see “We Ran 9 Experiments and Kept Zero”—a post about what happens when we test modifications to the system and they don’t beat the baseline.
What 16 Experiments Taught Us
Audience Targeting Is Everything
Generic AI scoring gets you generic results. When we condition predictions on your specific target audience — age, shopping behavior, brand awareness, price sensitivity — accuracy jumps 8–12 percentage points. This is the single highest-leverage variable in the entire system.
Video Works Too
On the PittAds benchmark (CVPR), Kettio achieves ρ=0.681 on video ad emotional resonance — validating that creative performance prediction extends beyond static images to the short-form video formats that dominate social advertising.
Cross-Cultural Validation
Our benchmarks span English, Chinese, and Korean datasets. The same system works across languages and cultural contexts without localization or retraining — a critical requirement for brands running global campaigns.
Zero-Shot Generalization
Kettio has never been fine-tuned on any benchmark dataset. Every result reported here is zero-shot: the system had never seen these ads or their scores before. It works on your ads the same way it works on ours — no category-specific training needed.
Common Questions About AI Ad Testing
Test It on Your Own Ads
Upload your creatives, define your target audience, and see how Kettio scores them. The same system that beat GPT-4o on academic benchmarks, running on your actual ads.