LMArena

Platforms & Tools

A crowdsourced platform where users compare AI models head-to-head in blind conversations, producing Elo-based rankings that reflect real human preferences.

Like a talent show where the audience votes - popular does not always mean best, but the rankings still reveal preferences.

LMArena (formerly LMSYS Chatbot Arena) is a crowdsourced evaluation platform where users interact with two anonymous AI models simultaneously and vote on which response they prefer. With over 6 million votes collected, it produces rankings based on real human preferences rather than automated benchmarks, making it one of the most widely cited measures of LLM quality in practice.

The platform works by presenting users with two anonymous model responses to the same prompt. After reading both responses, users vote for their preferred one or declare a tie. The identities of the models are revealed only after voting, preventing bias. These pairwise comparisons are aggregated using the Bradley-Terry model (an evolution of the original Elo rating system used in chess) to produce stable rankings with confidence intervals. The shift from online Elo to Bradley-Terry provides more stable ratings and better handles the growing number of models being evaluated.

LMArena is valued because it captures aspects of model quality that automated benchmarks miss, such as writing style, helpfulness, nuance, and instruction following. However, it also has limitations: votes skew toward longer, more detailed responses regardless of accuracy, and the user population may not represent all use cases. Despite these caveats, LMArena rankings have become a de facto industry standard, frequently referenced in model release announcements and competitive comparisons between AI labs.

Last updated: February 26, 2026

LMArena

Related Terms