Silicon Personalities: Why Most AI Audiences Are Lying to You

Every AI ad testing company will tell you they’ve built synthetic audiences. They’ll say they simulate your target buyer. They’ll show you scores, feedback, preferences—all generated by an AI pretending to be a 28-year-old mom in Dallas or a Gen Z sneakerhead in Brooklyn.

There’s just one problem: the personas are fake.

Not fake in the philosophical sense. Fake in the measurable, peer-reviewed, published-at-a-top-conference sense. A new study from the ACM Web Conference 2026 tested whether persona-conditioned LLMs can actually represent real people. The answer, for most implementations, is no.

The Paper That Should Worry Every AI Audience Company

“Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents” (Taday Morocho et al., WWW ’26) ran over 70,000 persona-question instances using real U.S. microdata from the World Values Survey. They conditioned LLMs on multi-attribute demographic profiles—age, gender, income, education, religion, ethnicity, occupation—and compared the outputs against how those real people actually answered.

The findings are damning:

Persona prompting does not yield a clear aggregate improvement. In many cases, adding demographic conditioning made the model’s predictions worse than asking the same question with no persona at all.
Marginalized groups experience disproportionate distortions. The model doesn’t just get things slightly wrong for underrepresented demographics—it caricatures them. Error concentrates in exactly the subgroups that matter most for targeted advertising.
Most items show minimal change, while a small subset blow up. Persona effects are wildly inconsistent. The model might get age right but completely fabricate religious or ethnic response patterns.
Errors redistribute, not reduce. Demographic conditioning doesn’t make the model smarter about people. It just moves the mistakes around—sometimes into places where they cause more damage.

The researchers put it plainly: LLM outputs under persona conditioning are “model-dependent artifacts rather than measurements of the world.”

What This Means for Ad Testing

Think about what most AI audience tools actually do. They take a demographic profile—say, “35-year-old female, household income $75K, interested in fitness”—and paste it into a system prompt. Then they ask the model to evaluate your ad creative “as” that person.

That’s persona prompting. That’s exactly what this paper tested. And that’s exactly what fails.

The model doesn’t have training data that represents how a specific demographic subgroup actually responds to advertising stimuli. It has internet text. It has stereotypes. It has patterns from Reddit threads and product reviews and marketing copy. When you ask it to “be” a 45-year-old Black woman evaluating a skincare ad, it’s not simulating her perspective—it’s generating a plausible-sounding response that conforms to whatever statistical patterns it absorbed during training.

The result? You get feedback that looks real. It’s articulate. It references the right concerns. It even uses appropriate language. But the underlying signal—whether this person would actually stop scrolling, actually click, actually buy—is noise dressed up as data.

Silicon Personalities Need Silicon Grounding

Here’s where it gets interesting. The paper doesn’t say synthetic audiences are impossible. It says the naive approach—prompt-and-pray persona conditioning—doesn’t work. The research community has known for years that when you ground LLM outputs in real data, the picture changes dramatically.

That’s the entire premise behind Kettio.

We don’t hand a model a demographic profile and ask it to roleplay. We ground your audience in real-world behavioral data first. Before we ever generate a synthetic response, we validate whether the model can accurately simulate your specific target audience. If it can’t—if the grounding data shows that the model’s representation of your audience diverges from reality—we tell you that upfront.

No other platform does this. They can’t, because they skipped the hard part.

Two Systems, Not One

Kettio doesn’t rely on a single persona-prompted response to score your ads. We built two distinct evaluation methodologies:

Bradley-Terry Voting for CTR Prediction. Instead of asking one synthetic persona to rate your ad, we run a panel of voters that compare creatives head-to-head. The Bradley-Terry model aggregates pairwise preferences into a ranking that predicts which creative wins the click. Our panel approach hit 70.3% within-product pairwise accuracy—the first time we broke 70% on click-through prediction. No single persona prompt gets you there.
Semantic Similarity Rating (SSR) for Audience Response. Our SSR pipeline doesn’t ask the model for a number. It generates a free-text behavioral reaction, embeds it, and scores it against calibrated anchor statements using semantic similarity. The scoring happens in embedding space, not in the model’s head. This means the model’s demographic biases get filtered through a measurement layer that’s grounded in real human response patterns.

Both systems are grounded. Both are validated against real-world outcomes. Neither relies on the naive persona prompting that this paper proved unreliable.

The Uncomfortable Truth About “AI Audiences”

Most companies selling AI audience testing are doing exactly what this research says doesn’t work. They’re conditioning a model on demographics, asking it to evaluate content, and packaging the output as consumer insight.

You can’t tell the difference from the outside. The reports look professional. The feedback sounds human. The scores have decimal places. But if the underlying method is ungrounded persona prompting, the results are—to use the technical term—model-dependent artifacts.

They’re not predictions. They’re hallucinations with confidence intervals.

What We Actually Validate

When you build an audience on Kettio, here’s what happens behind the scenes:

Grounding check. We assess whether the model can reliably represent your target demographic for the specific evaluation task. Not all audiences are equally simulable—we’re honest about that.
Platform-specific behavioral calibration. A TikTok shopper and an Instagram browser don’t evaluate ads the same way. We apply platform-specific behavioral heuristics that align the model’s vocabulary with real platform behavior patterns.
Multi-voter aggregation. Single-persona responses are noisy. We aggregate across multiple synthetic voters using statistical models designed to extract signal from noisy pairwise comparisons.
Embedding-space scoring. The final score doesn’t come from the LLM’s opinion. It comes from measuring the semantic distance between the model’s behavioral response and calibrated anchors derived from real consumer data.

Every step adds a layer of grounding that pure persona prompting doesn’t have.

The Bottom Line

The research is clear: telling an LLM to “be” your target customer doesn’t make it your target customer. The model’s internal representation of demographics is incomplete, biased, and inconsistent—especially for the specific subgroups that advertisers most need to reach.

Silicon personalities need silicon grounding. Without it, you’re not testing your ads against your audience. You’re testing them against a language model’s best guess about what your audience might be like.

Kettio was built to close that gap. We ground first, simulate second, and validate always.

Stop testing your ads against hallucinated audiences. Try Kettio and see what grounded synthetic feedback actually looks like.