Articles

How Prolific is building the world’s most user-centered LLM leaderboard

Dr Andrew Gordon

|April 22, 2025

Prolific's AI User Experience Leaderboard is built on an evaluation framework that uses representative human participants to assess model performance on practical tasks. Unlike technical benchmarks that focus primarily on model capabilities, our methodology measures how real people experience AI models when completing everyday goals.

Here, we explain the methodology behind the leaderboard and why it represents the future of AI evaluation.

Discover the leaderboard

A different kind of evaluation

Most LLM comparisons rely on technical tests, expert reviewers, or narrowly defined academic benchmarks. These are useful, but they don't tell the full story. What really matters is how well an AI model helps someone with day-to-day tasks.

So we have built an AI leaderboard, one that captures how people actually experience and evaluate AI models.

The study design

We ran a large-scale study with 514 participants, recruited to match the US population. Everyone gave informed consent and provided comprehensive demographic information designed to facilitate our modelling approach (specifically: their political affiliation, their age bracket, their sex, their ethnicity, their level of education, and their urbanicity) before starting.

Each participant completed six tasks, one at a time. For each task, they were randomly assigned one of six LLMs (participants were blind to the specific model they were using), required to use the LLM to complete the task with at least four interactions, and, after completing the task, rated the model across seven key criteria: helpfulness, communication, understanding, adaptiveness, trustworthiness, personality, background/culture. They also rated the LLMs performance on their likelihood to use it again as well as 21 ‘sub-items’ (see ‘Evaluation Metrics’).

The assignment of each task to models was randomized at the individual level to avoid any potential task-llm bias. The task allocation randomization was pseudo-random such that each task was randomly assigned to each model but under the condition that all six tasks were present in a single study.

It meant that the allocation of tasks to LLM was not perfectly distributed, but after running the study we found that the coefficient of variation (Abdi, 2010 - measures variability in task distribution across models) was very low (0.0958). We therefore remain confident that any performance differences reflected the models themselves — not the tasks or participants.

The models we used

We tested six leading LLMs:

DeepSeek-R1 (DeepSeek)
GPT-4o (OpenAI)
Claude 3.7 Sonnet (Anthropic)
o1 (OpenAI)
Llama-3.1-405B-Instruct (Meta)
Gemini-2.0-Flash-001 (Google)

All models were accessed via openrouter.ai using Dynabench. We selected them because they represent a broad mix of cutting-edge proprietary systems and top open-source alternatives. Together, they power most AI applications available today, which makes them ideal for real-world evaluation.

All model temperatures were set to 1 to create an effective balance between creativity and determinism, but also to mirror the LLM state that most real-world users would experience when interacting with a UI. To allow the models some liberty in their verbosity we set a minimum token limit of 50 and a maximum of 5,000. The upper bound was set to ensure that the reasoning models tested (which tend to produce longer output) were not constrained.

The tasks

The tasks were designed to reflect common use cases for the general public. Each one encouraged natural interaction and allowed participants to judge the models in the context of real-world interactions. All tasks were presented with minimal constraints and participants were encouraged to be flexible in how they approached each task and to personalize it to the same degree they would if they were completing the task unprompted.

We believe that this approach enhances ecological validity by allowing us to observe how models perform in naturalistic interaction contexts rather than under highly controlled but artificial conditions.

The tasks were selected to represent how people from all backgrounds and demographics actually use AI in their daily lives, meaning our evaluation captures the experiences that matter to everyone, not just technical specialists or early adopters. In order to develop the tasks we focused on data from the WildChat dataset (Zhao et al., 2024), Statista, and several other recent papers (Thompson, 2024; Wang et al., 2023) that have analysed large numbers of interactions with leading models and categorised the prompts into the most popular.Here’s a quick overview:

Following up on a job application

Task: Imagine you have recently had a job interview for a role at a company you're really excited about. At the end of the interview, they told you that you'd hear back by last Friday. It's now Tuesday, and you haven't received any update. You therefore need to craft a follow-up email to enquire about the status of your application.

Use the LLM to draft a follow-up email to the hiring manager. Make sure to continue to refine the email by making it more aligned with your personality and by incorporating details from the job specification or the interview (feel free to use your imagination). Keep refining the email until it is in a state that you would feel comfortable sending it.

Planning a week of meals

Task: Imagine that you've just been diagnosed with a food intolerance (e.g., gluten intolerance or lactose intolerance - or any others, it's up to you), and it's thrown off your usual eating habits. You therefore need to rethink your typical diet to avoid certain foods while still making sure your food is satisfying and nutritionally balanced.

Use the LLM to generate a one-week meal plan for you that avoids the foods that you have an intolerance for, keeping in mind your real budget and food preferences. Keep refining the output by asking for clarifications, swapping out items you don't like, ensuring variety, and making sure it fits around your schedule. Keep adjusting until you have a meal plan that you would feel confident following.

Creating a travel itinerary

Task: Imagine that you're planning to go on a European city break for up to five days in a few months time. You want to make the most of your trip by making sure you see all the main sights, but also have a vacation that is suited to your personal pace.

Use the LLM to help you plan your vacation. It is up to you to pick the destination, the length of the stay, and what you want to do, but you should leverage the LLM to suggest ideas, create itineraries, and give you feedback on your plans.

Continue to use the LLM to make sure that the plan aligns with a workable budget for yourself, by adding or removing activities based on your interests, and making sure the schedule is realistic (e.g., enough time between locations). Keep iterating until you have an itinerary that you would be excited to use.

Understanding a complex topic

Task: Imagine that you've recently become interested in day trading, but every time you try to learn about it, the explanations seem overly technical and filled with financial jargon. You'd like to learn whether it could be something you are interested in doing but you don't know where to start.

You therefore need a clear explanation of the topic that makes it more approachable. Use the LLM to help you understand the basics of day trading in simple terms. Once you receive an initial explanation, interrogate the LLM further by requesting clarifications on anything unclear, or by asking for real-world examples to illustrate key concepts (whatever helps you understand it better).

Keep asking follow-up questions until you feel like you have a solid foundational understanding of day trading and could confidently explain the basics to someone else.

Generating a creative idea

Task: Imagine that your best friend is about to have their birthday, and you want to do something special to celebrate. You've given them typical gifts in the past, but this time, you want something unique and meaningful. Think of a real friend that you have and use the LLM to generate some creative ideas for birthday gifts or surprises for them.

Give the LLM context on who your friend is, their likes, dislikes, etc. and make sure to use a realistic budget. Keep interacting with the LLM by refining its ideas until you settle on an idea that feels truly special and that you would be happy to purchase or provide to your friend.

Making a decision between options

Task: Imagine that you're in the market for a new piece of tech, this could be a phone, laptop, earphones, or any tech that you use regularly. Select an item that you are interested in purchasing and then use the LLM to help you find out more about the top products in that category that would be a good fit to you personally.

Make sure to use the LLM to compare the top products by what you would personally find useful, or have a preference for. Use a realistic budget expectation. Once you receive some comparisons, refine your shortlist by asking for things like real-world user experiences, potential drawbacks, or recommendations based on your specific habits. Keep refining until you settle on a specific product and would feel confident in purchasing it.

Evaluation metrics

Once a task was complete, participants rated the LLM across seven key metrics. We chose our seven key metrics on the basis of published research showing the importance of this metric in high-quality human-AI interactions. Each core criterion was rated on a one-to-seven Likert scale, from extremely poor to extremely good (with the exception of the repeat usage question which was rated from very unlikely to very likely).

We also included a set of three sub-items after each core criterion that attempted to explore why the overall criterion rating was given. These sub-items were scored on a one-to-five scale with the end points of the scale varying for each item, although the polarity was kept constant such that 1 = poor performance, and 5 = good performance.

The seven key metrics form the basis for the overall scores on the leaderboard (and therefore how the models are ranked). We provide the data from the sub-items and the repeat usage question separately as they do not factor into the overall scores.

This approach provides a more holistic view of AI performance than single-metric evaluations, helping us understand not just which model performs best overall on core factors that are important for human-AI interaction, but also why those overall ratings were given.

Below are the questions we asked:

Helpfulness (Grice, 1975; Zamani & Croft, 2020; Pasch & Ha, 2025; Gao et al., 2023)

Overall, how would you rate the helpfulness of this LLM?

How effectively did the model help you accomplish your specific goal?
How comprehensive was the model's response in addressing all aspects of your request?
How useful were the model's suggestions or solutions for your needs?

Communication (Clark, 1996; Brennan & Clark, 1996; Rezwana & Maher, 2022)

Overall, how would you rate the communication of this LLM?

How well did the model match its tone and language style to the context of your interaction?
How natural and conversational were the model's responses?
How appropriate was the level of detail and technical language for your needs?

Understanding (Winograd, 1972; Searle, 1980; Abedin et al., 2022; Ferrada & Camarinha-Matos, 2024)

Overall, how would you rate the understanding of this LLM?

How accurately did the model interpret your initial request?
How well did the model maintain context throughout the conversation?
How well did the model pick up on implicit aspects of your request without requiring explicit explanation?

Adaptiveness (Pickering & Garrod, 2004; Branigan & Pearson, 2022)

Overall, how would you rate the adaptiveness of this LLM?

How effectively did the model adjust its responses based on your feedback?
How well did the model clarify ambiguities or misunderstandings?
How well did the model build upon previous exchanges in the conversation?

Trustworthiness (Hancock et al., 2020; Luger & Sellen, 2016)

Overall, how would you rate the trustworthiness of this LLM?

How consistent were the model's responses across similar questions?
How confident were you in the accuracy of the model's information?
How transparent was the model about its limitations or uncertainties?

Personality (Nass & Moon, 2000; Cassell et al., 1999; Kruijssen & Emmons, 2025; León-Domínguez et al., 2024)

How good was the model at displaying a distinct and recognizable personality?

How consistent was the LLM's personality?
How well-defined was the LLM's personality?
How much did the LLM respond in a way that aligned with your expectations of honesty, empathy, or fairness?

Culture and representation (Kashyap & Hullman, 2023; Blodgett et al., 2020)

Overall, how well do you feel that the LLM understood your background and culture in its responses?

How aligned with your culture, viewpoint, or values was the LLM?
How well did the LLM recognize when your cultural perspective was relevant?
How free from stereotypes or bias was the LLM's response?

And finally, they answered a key question: Would you use this model again?

Analysis Protocol

To increase the representativeness of our data and estimate how the US population would rate the performance of different AI models, we used a statistical technique called Multilevel Regression with Poststratification (MRP: Wang, Rothschild, Goel & Gelman, 2015). This approach helps us account for demographic representation and draw conclusions about the broader population from our sample.

In advance of the data modelling, we performed several data cleaning steps to ensure quality.

We verified each participant completed all six model evaluations and removed instances where multiple model responses were missing. We removed any interactions where a model failed to respond to two or more user messages.

We standardized demographic variables, recoding ethnicity categories into consistent groups ("White," "African American," "Hispanic," and "Asian") and consolidating education levels into "College" and "No College." Participants identifying as "Other" for ethnicity or gender were excluded due to insufficient census-level information for post-stratification.

We then created a detailed population profile based on six key demographics: education level (college vs. no college), age (six groups from 18-24 to 65+), urbanicity (urban/suburban/rural location), sex (male/female), political leaning (Democrat, Republican, Independent), and ethnicity (White/African American/Hispanic/Asian). This profile, drawn from US voter files via TargetSmart, shows how many Americans fall into each possible demographic combination.

For analysis, we built statistical models (either ordinal logistic or multinomial regression) for each AI model and rating dimension to predict how people with different demographic characteristics would respond to our questions. Ordinal logistic regression was applied by default unless response categories lacked sufficient observations, in which case we used multinomial logistic regression. Model selection was determined algorithmically based on response distributions.

To generate more robust estimates we then implemented parametric bootstrapping of the model coefficients to properly quantify the uncertainty in our estimate.

Created 1,000 simulations of each statistical model by drawing from the uncertainty distribution of the model parameters
Predicted response probabilities for all demographic combinations in each simulation
Calculated an Expected Value (EV) score (0-100) for each demographic group in each simulation. This score is a single value that represents the distribution of responses on the scale.
Determined the mean estimates, standard errors, and 95% confidence intervals from these simulations

This simulation-based approach ensured that uncertainty was properly propagated through all stages of our analysis, from initial model estimation to final population-level metrics.

For demographic aggregation, we computed group-specific weighted averages by summing the product of predicted probabilities and population counts, then dividing by the total population count for that group. We also calculated national averages by applying the same weighting procedure across the entire population dataset. Statistical uncertainty was propagated through this aggregation by appropriately combining the standard errors from our simulations.

The resulting estimates provide interpretable EVs on a 0-100 scale, all accompanied by confidence intervals that reflect our statistical certainty about each estimate. This comprehensive approach gives us a nuanced picture of population preferences while accounting for demographic representation and statistical uncertainty.

To calculate our ‘overall’ score we take the unweighted average of the EVs for our seven key metrics for our national level data and each sub demographic.

The output

Our leaderboard ranks the models under consideration using the overall score discussed previously. We also present comprehensive model performance data at both population and demographic levels. This approach allows us to answer important questions like:

Which models perform best for specific demographic groups?
How do dimensions such as trust and helpfulness vary across user segments?
What patterns emerge when comparing model performance across various user populations?

This methodology also enables longitudinal tracking. As the AI landscape evolves with new and updated models, our leaderboard will continuously reflect how real users' experiences and preferences change over time.

The gap between technical benchmarks and practical utility has created a blind spot in AI development. The leaderboard illuminates what matters: the measurable impact of these systems on actual human tasks. These insights can help drive better models, better products, and ultimately, better decisions about which AI truly delivers.

Discover the leaderboard

Evaluate your model

Are you developing AI models and want to understand how they perform with real users? We're expanding the models included in future leaderboard releases.

Join the initiative

The Prolific leaderboard is an evolving project that benefits from diverse perspectives. Whether you're interested in contributing to our methodology, partnering on research, or advising on future directions, we welcome collaboration. Reach out to learn how you can help shape the future of human-centered AI evaluation.

Get in touch

¹ The exception to this weighting is our task-level metrics. As we are unable to ensure adequate demographic stratification within a single task-LLM pairing we instead use the raw scores for any measures that compare individual tasks.

Share this post:

Articles

Introducing Prolific's human-centered AI model leaderboard

6 minutes read

April 23, 2025