We Tested Our Ad Scoring System Against 1,000+ Images and Beat ChatGPT. Here's What We Learned.

Kettio predicts which ad creatives will win before you spend. We built a system that takes your ads, builds a digital version of your target audience, and tells you which creative will perform best and why. No survey panels. No focus groups. No wasted media budget on test flights.

Bold claim. So we tested it.

The Benchmark

We ran Kettio’s scoring pipeline against the PVP dataset, an academic benchmark from Seoul National University published at ACL 2025. It contains over 28,000 images scored by 2,521 human annotators across 20 topics, including advertising. Each annotator was profiled with Big Five personality traits and Schwartz values, which let us test whether our synthetic audience personas could predict how real people respond to visual persuasion.

We scored 1,089 images. No fine-tuning. No training data. Kettio had never seen any of these images or scores before.

Here’s what happened.

Metric	Kettio SSR	GPT-4o (PVP paper)	Fine-tuned LLaMA (PVP paper)
Spearman correlation	0.221	0.19	0.25
Pearson correlation	0.195	0.19	0.25
Pairwise accuracy	0.581	n/a	n/a

Kettio beat GPT-4o on every metric. Zero-shot. No training data. And we came close to a fine-tuned LLaMA model that was specifically trained on this dataset.

The pairwise accuracy number is the one that matters most. It means: given two creatives, Kettio picks the one humans preferred 58% of the time. On a benchmark where the humans themselves barely agree with each other (the dataset’s inter-annotator agreement is essentially zero).

On advertising images specifically, our correlation jumped to 0.37. That’s our home turf and the signal is strong.

How It Works

Most AI ad testing asks a model to rate an ad on a numerical scale. The problem is that LLMs are terrible at this. They regress to the middle, spit out 3s and 4s on everything, and produce distributions that look nothing like how real humans respond to ads.

We don’t ask for a number. We ask for a reaction.

Kettio builds a persona from your target audience profile (age, ad skepticism, price sensitivity, shopping intent, brand familiarity, and more), then shows them the creative and asks: “You’re scrolling through your feed. You see this. What do you do next and why?”

The model writes a free-text rationale. Something like: “I paused for a second because the visual caught my eye, but the messaging feels too premium for my budget. I’d keep scrolling.”

We then take that rationale, embed it, and compare it against calibrated anchor statements at each point on a scoring scale using semantic similarity. The cosine distance between the rationale and the anchors produces a probability distribution, not a single number. That distribution becomes the score.

This approach is based on Semantic Similarity Rating (SSR), a method published by PyMC Labs and Colgate-Palmolive that achieved 90% of human test-retest reliability on purchase intent surveys. We adapted it for ad creative scoring with our own persona system, our own anchor bank, and a completely different use case.

A large creative benchmark is narrowed through competing evaluation lenses to compare the clarity of their ranking signals.

The Surprising Finding: Hedging Is the Signal

During testing, we tried to improve the system by forcing the model to be more direct. Instead of letting it hedge (“this is somewhat persuasive”), we told it: “State clearly whether you would act or not. Do not hedge.”

The scores spread out. But the accuracy tanked.

It turns out the hedging is doing real work. When the model hedges a lot (“while this doesn’t particularly appeal to me, I can see how some viewers might…”) that maps to low persuasion. When it hedges less (“this is quite compelling, I’d probably look into this”) that maps to higher persuasion. The degrees of qualification create a subtle gradient that the embedding model picks up.

Forcing directness collapsed that gradient into binary yes/no responses and destroyed the information.

This independently confirms what the original SSR researchers found: pushing LLMs toward more extreme, more human-looking response distributions can actually reduce their ranking accuracy. The safe, hedged middle contains more signal than the confident extremes.

The same problem is exactly why you can’t explicitly ask for a number, akin to a Likert Scale.

What This Means for Agencies

If you’re running creative for clients, you already know the problem. You produce 3 to 5 variants, pick the few the team likes best, run it for a week, then kill the losers based on performance data. That first week of spend on losing creatives is pure waste.

Kettio gives you a preflight check. Upload your creatives, define your audience, and in minutes you get a ranked list with win probabilities and written rationales explaining why each persona segment responded the way they did.

The rationale is the real product. No survey panel tells you: “Your 28-year-old skeptical browser paused because the visual created urgency but bounced because the price signal felt premium and she’s highly price-sensitive.” Kettio does. For every creative, for every audience segment, in minutes.

At 58% pairwise accuracy on a worst-case academic benchmark (cross-cultural annotators who can’t agree with themselves), the real-world accuracy on your actual audience with well-defined personas is likely higher. And even when the system picks wrong on rank, the rationale still tells you something useful about how your audience sees the work.

Distributed confidence across several ads produces a more reliable shortlist than one overconfident winner selection.

Where We’re Going

We’re extending SSR to video. Social advertising is video-first now and the same methodology applies: build the persona, show them the content, capture the reaction, score it. For video, we’re adding separate Hook, Persuasion, and Action scores because the first three seconds of a social video are a completely different evaluation than the full narrative.

We’re also running pilot programs with agencies. If you’re spending on creative testing or just guessing which variant to back, we should talk.

The Bottom Line

We built this system from scratch. It beats GPT-4o on an academic benchmark and produces richer output than a survey panel. The approach is grounded in published research and validated against real human judgments.

Preflight your ads against your audience. See winners before you spend.

That’s Kettio.

If you want early access or want to run a pilot, reach out at spencer@kettio.com.