compere: Multi-Armed Bandit Ranking with Fewer Comparisons

How compere uses bandit algorithms to rank items with minimal pairwise comparisons — applications in search evaluation, recommendation, and tournament design, vs Bradley-Terry and TrueSkill.

The problem

You have n items and you want to rank them. You can run pairwise comparisons (item A is better than item B) but each comparison costs money, time, or human attention. How do you find the correct ranking with the fewest comparisons?

The naive approach is O(n²): compare every pair. For 100 items, that’s 4,950 comparisons. For 1,000 items, that’s 499,500.

The bandit approach is O(n log n): actively choose which pair to compare next, based on what you’ve learned so far. For 100 items, that’s ~700 comparisons. For 1,000 items, that’s ~10,000. A 50-100x reduction.

compere is the bandit approach, packaged as a Python library.

What compere is

compere is a Python library that implements multi-armed bandit algorithms for the active ranking problem: choose which pair of items to compare next to minimise the total number of comparisons needed to find the correct ranking.

The core API is:

from compere import Ranker, BradleyTerryModel

ranker = Ranker(
    model=BradleyTerryModel(),
    algorithm="dueling_bandits",  # or "borda_bandits", "plackett_luce_bandits"
)

# Add items
for item in items:
    ranker.add_item(item.id)

# Active comparison loop
while not ranker.is_confident():
    i, j = ranker.choose_pair()  # bandit picks the best pair
    winner = get_human_judgment(i, j)  # or model, or A/B test, etc.
    ranker.update(i, j, winner)

# Get the ranking
ranking = ranker.ranking()

The library also includes a Bradley-Terry model for fitting the comparison data, and a Plackett-Luce model for partial rankings.

The competitive landscape

ApproachProjectComparisons neededBest for
Naive O(n²)(manual)n(n-1)/2Tiny n, exact
Bradley-Terry(scikit-crowd, choice-learn)O(n²) samples, O(n²) fitOff-line, many samples
TrueSkill / TrueSkill-2trueskill, openskillO(n²) matchesCompetitive games
Elo(implementations everywhere)O(n²)Chess, simple ratings
Plackett-Luce(research)O(n²)Partial rankings
Active ranking (bandit)compereO(n log n)Limited budget, online

compere is the active ranking option. It assumes the budget for comparisons is the bottleneck (human labelling cost, A/B test opportunity cost, or expert attention).

The math, briefly

A dueling-bandit algorithm maintains a confidence interval on the relative rank of every pair. At each step, it picks the pair whose confidence interval is widest (the one we know least about) and updates. When all confidence intervals are below a threshold, we stop.

The guarantees: with high probability, the algorithm returns the correct ranking in O(n log n) comparisons. Naive O(n²) is provably worse in the worst case.

The catch: the constants in O(n log n) are larger than naive. For very small n (say, n < 10), naive may be faster. For n > 50, bandit is consistently 5-50x faster.

When compere is the right answer

  • Search evaluation. You have 100 search result variants and want to know which ranks best. Humans labelling pairwise costs money. Use compere.
  • Recommendation A/B tests. You have 50 candidate recommenders and want to pick the top 5. Random A/B testing is O(n²) opportunity cost. Use compere.
  • Tournament design. You have 64 players and want a fair bracket with the minimum number of matches. Use compere.
  • LLM output ranking. You have 100 LLM completions and want to pick the best. Use compere with the LLM as the comparator (cheaper than human labels).

When compere is NOT the right answer

  • The budget for comparisons is unlimited. Then naive O(n²) is simpler.
  • You have fewer than ~20 items. Then the constants matter more than the asymptotics.
  • The items have features you can use to predict ranking a priori. Then a learning-to-rank model (LambdaMART, ListNet) is better.
  • The comparison outcomes are not iid (e.g. positional bias, order effects). Then you need a different model (Plackett-Luce with positional correction).

A concrete example: LLM output ranking

Imagine you’re comparing 30 prompt templates for a production LLM application. You want to find the top 5.

from compere import Ranker, BradleyTerryModel
import openai

ranker = Ranker(model=BradleyTerryModel(), algorithm="dueling_bandits")
for i, prompt in enumerate(prompts):
    ranker.add_item(f"prompt_{i}")

judging_prompt = """
Which output is better for the task "{task}"?
Output A: {a}
Output B: {b}
Answer with just "A" or "B".
"""

while not ranker.is_confident(confidence=0.95):
    i, j = ranker.choose_pair()
    a = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": prompts[i].format(task=task)}])
    b = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": prompts[j].format(task=task)}])
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judging_prompt.format(task=task, a=a, b=b)}]
    )
    winner = "A" if "A" in response.choices[0].message.content else "B"
    ranker.update(f"prompt_{i}", f"prompt_{j}", winner)

ranking = ranker.ranking()
print("Top 5 prompts:", ranking[:5])

With 30 prompts and a 95% confidence target, compere needs ~80-120 LLM comparisons. The naive approach would need 435. That’s a 4-5x reduction in LLM cost.