← Back to Blog
guides13 min read

A/B Test Ads with AI Agents: Champion-Challenger Testing Without Media Spend

Spencer Merrill|
A/B Test Ads with AI Agents: Champion-Challenger Testing Without Media Spend
API GuideA/B Test Ads with AI Agents: Champion-Challenger Testing Without Media Spend

Traditional A/B testing is expensive, slow, and requires real media spend. You create two ad variants, run them both with real budget, wait for statistical significance, and then kill the loser — after you've already paid to show it to thousands of people.

AI-powered A/B testing flips this model. You test creatives before they ever run, using synthetic audience evaluation to predict which one will win. No media budget. No waiting days for data. Results in seconds.

This guide covers Kettio's Champion-Challenger API — a statistically rigorous head-to-head comparison system that uses Bradley-Terry pairwise voting to determine winners with confidence intervals.

Champion-Challenger vs. Rank API: When to Use Which

Kettio offers two ways to compare creatives, and they serve different purposes:

Feature Rank API Champion-Challenger
Purpose Rank many creatives quickly Rigorous head-to-head comparison
Assets 1-20 assets 2+ assets (1 champion vs challengers)
Method Absolute scoring with SSR Bradley-Terry pairwise voting (6 votes)
Output Scores + rankings Win probability + confidence level
Best for Screening many variants, finding top candidates Final decision between 2-3 finalists

The typical agent workflow: use the Rank API to screen 10+ variants down to 2-3 finalists, then use Champion-Challenger for the final decision with statistical confidence.

How Bradley-Terry Voting Works

The Champion-Challenger API doesn't just score each ad independently — it runs a blinded pairwise comparison. Here's the methodology:

  1. Blinding. The two creatives are shown to the synthetic audience without labels. The model doesn't know which is the "champion" and which is the "challenger." This eliminates anchoring bias.
  2. Balanced voting. Six independent votes are cast. Presentation order is rotated to prevent position bias (the tendency to prefer whichever option is shown first).
  3. Bradley-Terry model. Votes are aggregated using the Bradley-Terry probability model, which converts win/loss records into a continuous win probability. This is the same model used in chess rankings (Elo) and academic preference studies.
  4. Confidence classification. The win probability is classified into actionable confidence levels: high (clear winner), likely (probable winner), too-close (essentially tied), or inconclusive (not enough signal).

Running a Champion-Challenger Test

Basic Example

curl -X POST https://kettio.com/api/champion-challenger   -H "Authorization: Bearer YOUR_TOKEN"   -H "Content-Type: application/json"   -d '{
    "ads": [
      {
        "id": "current-hero",
        "imageUrl": "https://cdn.example.com/hero-current.png",
        "label": "Current Hero"
      },
      {
        "id": "variant-minimal",
        "imageUrl": "https://cdn.example.com/hero-minimal.png",
        "label": "Minimalist Variant"
      },
      {
        "id": "variant-bold",
        "imageUrl": "https://cdn.example.com/hero-bold.png",
        "label": "Bold Variant"
      }
    ],
    "championId": "current-hero",
    "evaluation_goal": "purchase-intent",
    "persona_data": {
      "description": "Budget-conscious parents aged 35-44, comparing products before buying"
    },
    "platform": "instagram"
  }'

Response

{
  "champion": {
    "id": "current-hero",
    "imageUrl": "https://cdn.example.com/hero-current.png"
  },
  "challengers": [
    {
      "challengerId": "variant-minimal",
      "winProbability": 0.72,
      "confidence": "likely",
      "votes": { "wins": 4, "losses": 2, "total": 6 },
      "recommendation": "The minimalist variant likely outperforms the current hero. Consider testing in a small live campaign."
    },
    {
      "challengerId": "variant-bold",
      "winProbability": 0.38,
      "confidence": "too-close",
      "votes": { "wins": 2, "losses": 4, "total": 6 },
      "recommendation": "No clear winner between the bold variant and the current hero. They perform similarly for this audience."
    }
  ]
}

Interpreting Results

Confidence Win Probability Agent Action
high 85%+ Clear winner. Switch to the challenger.
likely 65-84% Probable winner. Run a small live test to confirm.
too-close 40-64% Essentially tied. Pick based on secondary factors (brand consistency, production cost, etc.).
inconclusive <40% Not enough signal. Try different audiences or goals.

JavaScript Integration

interface ChampionChallengerRequest {
  ads: Array<{
    id: string;
    imageUrl: string;
    ssrScore?: number;
    label?: string;
  }>;
  championId?: string;
  evaluation_goal: string;
  persona_data?: { description: string };
  platform?: string;
}

interface ChallengerResult {
  challengerId: string;
  winProbability: number;
  confidence: 'high' | 'likely' | 'too-close' | 'inconclusive';
  votes: { wins: number; losses: number; total: number };
  recommendation: string;
}

async function runChampionChallenger(
  token: string,
  request: ChampionChallengerRequest
): Promise<{ champion: any; challengers: ChallengerResult[] }> {
  const response = await fetch('https://kettio.com/api/champion-challenger', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${token}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(request),
  });

  if (!response.ok) throw new Error(`API error: ${response.status}`);
  return response.json();
}

Agent Pattern: Screen → Test → Decide

Here's the full workflow an AI marketing agent should follow when choosing the best creative from a large set:

async function findBestCreative(apiKey, adUrls, audience, goal) {
  // Phase 1: SCREEN — Rank all candidates to find top 3
  const screenResult = await fetch('https://kettio.com/api/v1/rank', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      assets: adUrls.map((url, i) => ({ url, id: `ad-${i}` })),
      audience,
      goal
    })
  }).then(r => r.json());

  const topThree = screenResult.ranked.slice(0, 3);
  console.log('Top 3 candidates:', topThree.map(a => a.asset_id));

  // Phase 2: TEST — Run Champion-Challenger on the finalists
  const testResult = await fetch('https://kettio.com/api/champion-challenger', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      ads: topThree.map(a => ({
        id: a.asset_id,
        imageUrl: a.asset_url,
        ssrScore: a.score
      })),
      championId: topThree[0].asset_id,
      evaluation_goal: goal,
      persona_data: { description: audience.description }
    })
  }).then(r => r.json());

  // Phase 3: DECIDE — Act on the confidence level
  const bestChallenger = testResult.challengers
    .filter(c => c.confidence === 'high' || c.confidence === 'likely')
    .sort((a, b) => b.winProbability - a.winProbability)[0];

  if (bestChallenger) {
    console.log(`Switch to ${bestChallenger.challengerId} (win prob: ${bestChallenger.winProbability})`);
    return bestChallenger.challengerId;
  } else {
    console.log(`Keep champion: ${topThree[0].asset_id}`);
    return topThree[0].asset_id;
  }
}

This three-phase approach is both efficient (screening many candidates with the cheaper Rank API) and rigorous (making the final call with pairwise voting).

Testing Across Multiple Audiences

A creative might beat the champion for one audience segment but lose for another. Test across your key segments:

const audiences = [
  { name: 'Core audience', description: 'Budget-conscious parents, 35-44' },
  { name: 'Expansion', description: 'Young professionals, 25-34, urban' },
  { name: 'Lookalike', description: 'Suburban homeowners, 45-54, moderate income' },
];

const crossAudienceResults = await Promise.all(
  audiences.map(audience =>
    runChampionChallenger(token, {
      ads: [
        { id: 'current', imageUrl: currentAdUrl },
        { id: 'challenger', imageUrl: newAdUrl }
      ],
      championId: 'current',
      evaluation_goal: 'purchase-intent',
      persona_data: { description: audience.description }
    })
  )
);

for (let i = 0; i < audiences.length; i++) {
  const result = crossAudienceResults[i];
  const challenger = result.challengers[0];
  console.log(
    `${audiences[i].name}: ${challenger.confidence} (${(challenger.winProbability * 100).toFixed(0)}% win)`
  );
}

When to Use Champion-Challenger

  • Before launching a new campaign — test your top 2-3 creative options against each other before committing budget.
  • When refreshing creative — test the new version against the current winner to make sure you're actually improving.
  • When the Rank API scores are close — if two creatives score within 0.5 points of each other, Champion-Challenger gives you a more decisive answer.
  • For high-stakes campaigns — when the budget is large enough that picking the wrong creative has real financial consequences.

Statistical Methodology

For those who want to understand the math: the Champion-Challenger API uses the Bradley-Terry model, the same probabilistic framework behind Elo ratings in chess and the "arena" comparisons used to evaluate LLMs.

Given k wins out of n comparisons, the estimated win probability is:

P(challenger wins) = k / n, adjusted by a Bayesian prior to handle small sample sizes

With 6 votes and presentation-order balancing, the system achieves enough statistical power to distinguish between creatives with meaningfully different appeal while keeping costs low (each comparison uses a small number of evaluation credits).

The confidence levels map to these probability ranges to give agents clear action triggers rather than raw probabilities that require human interpretation.

Stop Guessing, Start Testing

Run AI A/B tests on your ad creatives before spending a dollar on media.

Get Started Free →

Frequently Asked Questions

How many credits does a Champion-Challenger test cost?

Each A/B comparison uses 6 evaluation votes. The exact credit cost depends on the number of challengers — each challenger is compared against the champion independently.

Can I test more than 2 creatives at once?

Yes. You can include multiple challengers in a single request. Each challenger is independently compared against the champion using the same Bradley-Terry methodology.

How does this compare to real A/B testing?

AI A/B testing predicts which creative will win based on synthetic audience evaluation. It's faster and cheaper than real A/B tests, but it's a prediction, not a measurement of actual performance. Use it to filter out clearly inferior options before committing real budget. For high-confidence results, you can still run a small live test to validate.

What does "too-close" confidence mean?

It means the two creatives perform similarly for this audience and goal. Neither is clearly better. You can safely pick either one, or differentiate based on other factors like brand consistency, production cost, or performance on different audience segments.

Can I specify the platform for testing?

Yes. Pass the platform field (e.g., "instagram", "facebook", "tiktok") to contextualize the evaluation. The platform affects how the synthetic audience evaluates the creative — an ad optimized for TikTok will be judged differently than one for Facebook.

AI A/B testingad testingchampion-challengerBradley-Terrycreative testingad optimization