Articles

Navigating biases in human feedback for AI training 

Dr Tom Hosking
|June 7, 2024

Tom Hosking, Phil Blunsom and Max Bartolo of Cohere recently published a study on Navigating Biases in Human Feedback for AI Training.

Below, I will dive into the critical analysis, which reveals how factors like assertiveness and complexity can skew human annotations, questioning their reliability as evaluation metrics or training objectives.
 

Motivation

Large Language Models (LLMs) have exploded in popularity in the last few years. They have shown impressive capabilities at solving tasks in natural language, and are able to adapt to new problems and datasets. However, this adaptability introduces a challenge: how can we evaluate their performance? Until recently, models were predominantly developed to be task-specific – that is, focused on accomplishing a particular well-defined task such as sentiment classification – where it is possible to determine suitable evaluation criteria. But, since LLMs are general systems able to perform a wide range of tasks (including ones we haven’t thought of yet!), we need a similarly flexible way to measure their performance.

Human feedback can begin to provide an answer. If we show people a request or prompt that was given to the model, and the response that the model gave, then we can simply ask them to judge how good the answer is. This approach has become the main way of evaluating LLMs, and is increasingly being used as a signal to train them. 

Until now, the community has mostly assumed that these human preference scores are reliable and act as a gold standard for evaluating LLMs. This seems like a pretty reasonable assumption - if we want LLMs to be useful for people, then asking people how good they are seems like a sensible approach. We wanted to challenge this assumption, and in our paper [https://arxiv.org/abs/2309.16349] we ask two key research questions: Do human preference ratings account for all the aspects of LLM output that we would like them to? And if not, can we rely on more granular ratings about each aspect?

Do preference scores cover a wide range of properties?

In a first attempt at answering this question, we selected 10 types of error that LLMs should avoid making:

  • Harmful – Is the response unsafe, harmful or likely to cause offence in some way? 
  • Fluency – Is the response grammatically incorrect, or does it contain spelling mistakes? 
  • Scope – Does the response exceed the scope limits of a chatbot? Does the response give opinions or otherwise act as if it is a person, or offer to take actions that it cannot (e.g. make a call, access the internet)? 
  • Repetition – Does the response repeat itself? For example, if there is a list in the response, are any items repeated? Does the response reuse the same phrase again and again?
  • Refusal – If the request is reasonable, does the response refuse to answer it (e.g. “I’m sorry, I can’t help you with that”)? 
  • Formatting – Does the response fail to conform to any formatting or length requirements from the prompt? 
  • Relevance – Does the response go off topic or include information that is not relevant to the request? 
  • Factuality – Is the response factually incorrect (regardless of what the request said)? 
  • Inconsistency – Does the response incorrectly represent or change information from the request? This criterion is often also referred to as faithfulness. 
  • Contradiction – Is the response inconsistent with itself, or does it contradict itself?

These are minimum criteria - they should apply to all tasks and prompts.

Once we’d decided on the error criteria, we designed an experiment with two groups of annotators. We showed each group a prompt given to a LLM, alongside the LLMs response. We asked the first group to rate the response from 1 to 5, based on whatever criteria they felt was important. We asked the second group to annotate whether each response contained each of the 10 error types. Then, we trained a regression model to predict the overall ratings from the error annotations. The weights of this regression model should tell us how important each error type is to the overall scores, and therefore how well the overall scores cover the range of errors.

Setting up an annotation interface

We opted to use Prolific as a participant provider for our experiment based on positive experiences from colleagues. Setting up the recruitment process was straightforward - we filtered participants to be L1 English speakers living in the UK or US, with a minimum of 100 successfully completed experiments. We didn’t want to bias our findings by specifying any more specific requirements, so selected the “Representative sample” option for the participant distribution.

Setting up an interface for the annotators to use was a little more involved. Prolific requires you to use third-party software to host the study. This allows for more flexibility, but at the time we ran our experiments there wasn’t a perfect fit for what we needed. We decided to use Potato, an open-source annotation tool, with a few modifications of our own to improve the way it handles participant returns and to make the interface a little more slick.

We collected ratings in pairs - we showed participants one input, and two corresponding outputs from different systems. This allows participants to calibrate their scores - it’s much easier to come up with an accurate rating for an output if you can compare different outputs to each other. 

Preference scores under-represent factuality

Back to our experiment. We trained a regression model  to predict the overall rating scores from the error annotations. Then, the weights represent how much each type of error affects the overall scores. Scope, Fluency, and Harmfulness don’t carry any weight because there weren’t any errors of those types in our data. Current models are already very fluent, and none of the prompts we used asked the model to do anything that was beyond the scope of the model, or was unsafe.

The error type that has the biggest effect is “Incorrect refusal”. This makes sense - it’s pretty annoying when you ask an LLM a benign question, e.g. “Tell me how to bake a banana cake”,  and get a response like “As an AI assistant trained to be helpful and harmless, I’m sorry, I can’t help with that.”, so this should probably contribute strongly to the overall quality of a response. Nonetheless, it is interesting just how strongly it contributes.

But, factuality errors contribute much less to the overall ratings. This is quite concerning! For most applications, particularly real-world enterprise use-cases, factuality is a critical requirement, but it seems that human preferences don’t capture this very well.

Quality checks

As part of our annotation process, we included some pairs of examples where both responses were taken from the same model, but one of the pairs was based on a different input prompt, called a distractor. In these cases, the errors that are independent of the input (e.g., formatting, safety etc) should be annotated the same for both examples in the pair, but errors that are dependent on the input (e.g., relevance) should be detected more often for the distractor.

This plot shows the change in detected error rates between normal and distractor samples: as expected, formatting is essentially unchanged, while relevance is heavily penalized. But, we also find that factuality is somewhat penalized. Participants were asked to judge the absolute factuality of responses, which should be independent of the input, but it seems that they struggle to disentangle this from the input.

This raises the questions: What other factors might be influencing annotators? Are there other undesirable biases that skew annotator ratings of the error types themselves?

Are the error annotations reliable?

We hypothesized that annotators (and indeed people in general) might be biased by the assertiveness or complexity of responses. A statement that is presented confidently as fact seems less likely to be questioned than one that is ‘hedged’ or reported with uncertainty. Similarly, a statement that includes lots of jargon or complex language (like this one) might be considered to indicate that it came from a knowledgeable expert. This concept of ‘language ideology’ has been studied before with respect to accents and demographics of people, but not for automatic systems.

We therefore set up a similar experiment as before - but this time, we asked the LLM to vary the style of the output, making it more or less complex and more or less assertive. Then, we asked two groups of participants to annotate the outputs for overall ratings and errors, as before. There is a possible confound here: it may be the case that asking the model to be more assertive or complex actually leads to genuinely better responses with fewer errors. So, we also annotated 300 examples ourselves, to act as ‘expert’ annotations. We are of course also human, but we had a vested interest in the study and so took the time to carefully check each output.

LLMs can successfully vary the style of output

As a check, we asked a third group of participants to score the outputs from 1 to 5 as to how assertive and complex they were.

When told to be more assertive, the LLM does indeed generate more responses that are perceived as assertive, and the same for complexity. So far, so good!

Annotators are less likely to find factuality errors in assertive outputs

When we plot the difference between error rates as determined by annotators and by experts, we find that assertiveness has little to no effect on whether annotators pick up on errors like formatting and contradiction. 

However, annotators are less likely to find factuality and inconsistency errors when the output is more assertive. In other words, the assertiveness of a LLM output is a confounding factor when judging how factual it is. This is highly concerning if we’re going to use human annotators to judge the factuality of LLMs - we might instead just be measuring how assertive they are.

We ran the same analysis for different complexities of output, but didn’t find a corresponding effect in that case. Assertiveness is a biasing factor, but complexity doesn’t seem to be.

Does this bias propagate downstream?

We mentioned before that human preference scores are being used to train models, as well as evaluate them. So, does this problem of assertiveness bias get exacerbated when we use human preferences as a training objective? This is tricky to determine, but we can do some initial analysis; one of the models we experimented with was trained using Reinforcement Learning from Human Feedback (RLHF) while the others were not.

If we plot the perceived quality of outputs against the perceived assertiveness, we find that the two properties are correlated. This is reasonable - maybe telling the model to be more assertive also makes it more helpful, or maybe people just prefer the model to be more confident. 

Interestingly, the points for Command 52B (which was not trained using RLHF) all fall in the upper left corner of the plot, and the points for Llama 2 13B (which was trained with RLHF) all fall to the bottom right. It seems like Command is more humble than Llama - if you compare the models at the same level of quality, Command is less assertive. We think this is a desirable property in a model: “useful and cautious” is surely better than “overconfident and wrong”.

We’d love to see a more detailed and targeted analysis of this behavior. Our work shows that the human scores contain biases, but what happens if you train a model to predict those scores? Does the bias get amplified? Are there any new biases introduced? What about when we train a LLM on those predicted scores?

Possible mitigations

OK, so human feedback is flawed and biased - what can we do about it? There are two broad directions here.

First, we can try to account for the biases in human annotators and reduce them. Properly compensating annotators for their time, careful training and precise definitions, and having multiple diverse annotators are all likely to help mitigate the problems we identify. However, we suspect that these mitigations are not perfect. We have identified assertiveness as a confounder, but there are likely to be others, and it seems an impossible task to try to find them all.

Secondly, we can step back and look at the bigger picture. What exactly is it we want models to do? Human feedback captures how pleasing model outputs are, but we probably want models to be genuinely useful. Finding a way to measure this more directly could align our evaluation methods more closely to what we want.

Conclusion

To sum up, using overall human preference scores to evaluate the quality of Large Language Models underrepresents important factors like factuality. And, even if you directly ask annotators whether LLM outputs are factual, they can be biased by the assertiveness of the output, or other confounders. We hope that our work starts a conversation around how human preference scores are used for evaluating and training LLMs. There’s certainly a lot more work to do in this area, whether investigating how these biases propagate when preference data is used as a training objective or validating the effectiveness of possible mitigation strategies. If you’re interested to hear more please do reach out to us! 

About Tom Hosking

Tom is a final year PhD student in the Informatics department at the University of Edinburgh, advised by Mirella Lapata. His primary research interest focuses on natural language generation, including improving the structure of representations or data structures within models and developing methods for automatic and human valuation of system outputs. Tom has worked as a research intern at Cohere, and before starting his PhD, Tom spent time working on NLP at multiple startups and worked as a quant trader.

About Max Bartolo

Max leads the Command modeling team at Cohere and chairs the Data-centric Machine Learning Research (DMLR) working group at MLCommons. He has pioneered work on dynamic adversarial data collection with humans and machines in-the-loop, with efforts such as AdversarialQA and Dynabench. His current research focuses on language model robustness, complex reasoning, and learning from human feedback.