Articles

Data quality and AI safety: 4 ways bad data affects AI and how to avoid it

Jane Hillman

|October 24, 2023

People assume AI means that a computer program is acting intelligently on its own to accomplish a task. But that's not true. AI is reactionary — it only acts on what it is told to act on. Driverless cars driving dangerously, unfair mortgage and insurance practices, and even medical chatbots with deadly medical advice can all be linked to bad AI. And those issues all started with bad data (data that is corrupted by bias, flawed by incorrect standardization, or obtained by improper research) that was baked into their frameworks.

This is why AI safety is a concern. AI safety is about protecting your input and making sure you use clean data to get the most unbiased, fair, and effective results.

Once AI gets ahold of data, you could see any one of these four problems.

1. Incorrect data input creates bad results

Data needs a quality standard because once you input something for artificial intelligence and machine learning algorithms, the material is processed and spit out, regardless of whether the data is correct. AI doesn’t differentiate between good and bad input data - it works on logic. And bad data costs businesses an estimated $15 million dollars a year, according to Gartner.

Imagine a calculator where the 3 key only spits out 7 as a value, without you realizing it. Every calculation using 3 is now corrupt and will result in incorrect data. If left unnoticed, your end results will be flawed.

Making sure you have clean, high-quality, and unbiased data is the only way to guarantee you have trustworthy and safe results. You can start ensuring cleaner input data by removing duplicate observations, filtering out unwanted edge-case data, and fixing structural errors (naming, labeling, etc.).Once you have safe and clean information as a base, your AI won’t be misled by incorrect data.

2. Bad data shuts down projects

Imagine the data used to develop your AI is on a conveyor belt. The data moves along the belt, and everyone on the factory floor is accessing it. If the data is bad, multiple departments can be affected — managers, developers, designers, and more. Once the bad data is discovered, you have to go back to the beginning and ensure the data is safe and correct. Project costs could soar as employees now have to double-check and update the information instead of being able to implement it. You risk the project shutting down entirely.

Zillow, the online real estate marketplace, had to shut down a business unit after bad AI caused millions in damages. In 2018, Zillow bought tens of thousands of homes in the hopes of quickly “flipping” the homes and reselling them at a much higher profit. They purchased homes based on a “Zestimate” - AI they developed to specifically estimate home values and the likelihood they could be renovated.

It did not go as planned.

The culprit? Manipulated AI algorithms that failed to foresee COVID-19 and a rapidly decreasing nationwide labor shortage. The AI algorithm had a median error rate of 1.9%, which went as high as 6.9% for off-market homes. The bad data fed into the algorithm caused an overvaluation in the AI, which led to Zillow paying well above market value for homes.

Timothy Chan, a former Facebook data scientist, explains: “To aggressively scale the Zillow Offers business, Zillow executives intentionally adjusted their algorithm estimates upwards, which accomplished the goal of increasing buying conversion rates but at higher offer prices. Zillow Offers, coming off a terrific Q2 with 15% gross margins thanks to generous price appreciations was feeling pretty confident and continued to expand. Unfortunately, the market in Q3 reversed and instead of +12% growth, the housing market saw -5–7% drops, resulting in $300M in losses and an expected $245M in write-downs.”

Zillow Offers lost $245 million in Q3, with an expected $304 million write-down in Q4, laid off 25% of its staff, and saw its stock plummet by 25%. The company was forced to sell 7,000 homes in its inventory quickly to make up for its losses.

Cleaning your data at the input stage is your most effective weapon against bad data filtering down the conveyor belt. Watch for the telltale signs of poor data quality (it’s old, it’s inconsistent, it’s siloed within departments). This is where Prolific can help source the best, most reliable research participants for your data.

3. Biased data creates biased AI

“When the data we feed the machines reflects the history of our own unequal society, we are, in effect, asking the program to learn our own biases.” — The Guardian’s Inequality Project

Bias in data collection for AI will result in skewed results that will affect that AI’s fair application. The data scientist’s credo is ‘Garbage In, Garbage Out’ — and that applies to bias as well. If garbage theories and speculation go into the algorithm, then garbage speculation will come out.

To catch bias before the data is collected, first identify the common types of bias prevalent in data collection:

Confirmation bias - finding results that prove your argument or decision-making and only accepting data that validates your argument. For instance, you think Product A is awesome, so you find a group on Facebook that also loves Product A. You won’t even consider the opinions of people who don’t like Product A. That’s confirmation bias.

Historical bias - this is baked-in prejudicial data that happens when you use data from archives and systemic records that are not representative of the general population. For example, a recent healthcare risk algorithm was found to be flawed because it used cost data based on healthcare spending, but the correlations between race and income were skewed.

Survivorship bias - focusing on the most recognizable factor in a data set. For example, Bill Gates and Mark Zuckerberg never finished college. Therefore, that data might lead some to assume that college is not necessary.

Availability bias - leads you to believe something based on the emotion you feel about the circumstance. For instance, would you think it’s more dangerous to be a cop or a lumberjack? Knowing what we see in the media about police officers, you might assume it’s more dangerous. But lumberjacks have one of the highest mortality rates.

Sampling bias - is using non-randomized and specific data groups to determine the outcome of your results.

When you don’t consistently and strenuously monitor the quality of your data, not only does your business and reputation suffer, but the data quality issues can also lead to bad AI algorithms that may have a serious impact on lives.

Take PredPol, for instance. PredPol is an AI program used by law enforcement agencies. It predicts likely areas where future crime might occur. Police departments, besieged by negativity and public relations issues in the wake of high-profile incidents, now turn to data to help them solve crimes, abstaining themselves from responsibility by saying, “The bias has been removed. Let the data run the show.”

Unfortunately, PredPol data is based on crime reports that may have systemic racism built into them. Those crime reports are basically driven by data sets from overpoliced Black and Brown neighborhoods. For instance, if the crime reports are using a data set that says 50 square blocks of a city are most likely to have high crime rates based on the arrests made there, police officers may then “over patrol” that area, expecting trouble. They will then make arrests based on their prejudice from the data rather than on actual patrol and observation, as suggested by this report on predictive policing by the UK government’s Centre for Data Ethics and Innovation.

To counteract biased data, you have to start by collecting data as objectively as possible. Use well-prepared questions that do not guide respondents toward making a particular answer. To remove further bias, ensure the data is representative of the population or group you are studying. If you are using volunteers to help in collecting data, you should see to it that everyone is collecting and recording data in the same way and that they all understand the need to avoid prompting the respondents to particular answers.

Prolific is invested in the fight for clean data so that clean AI can be a reality and not a guessing game. We do this by connecting researchers with vetted and reliable participants around the world, enabling fast, reliable, and large-scale data collection.

4. Bad data has dangerous real-world implications

Bad data not born of bias is equally harmful. It’s the data scientist and researcher’s responsibility to make sure clean, unbiased data is implemented as a base for AI algorithms. Here are two of the most high-profile examples we’ve seen of flawed data turning into AI nightmares.

Bad AI drives Tesla to the brink

Self-driving cars used to be relegated to sci-fi literature, but thanks to advancements in AI, auto manufacturers are making it a reality. That reality is turning into a nightmare for some, however, as self-driving cars have been veering into dangerous territory lately.

Although there are a number of manufacturers incorporating some level of self-driving technology, the most well-known is Tesla. They made headlines with accidents resulting in AI failures in their self-driving cars. In one case, the AI failed to bring the vehicles to stop at a stop sign. In another case, a fatality occurred when the AI mistook a trailer truck as “bright sky”. According to an official statement from Tesla, this occurred when “Neither Autopilot nor the driver noticed the white side of the tractor-trailer against a brightly lit sky, so the brake was not applied”

It's suggested that this has to do with Tesla’s AI relying on “neural networks” to do the thinking for its system. Neural networks usually have an input layer, which obtains data from sources like data files, images, hardware sensors, and more. Another layer processes the data, and then a third layer - output layers - processes your data points.

The problem with neural networks is that they only “mimic” what the human brain can achieve, and that doesn’t account for variances. So if the data set doesn’t include “elephants in the road” then the AI won’t be able to process what to do when they encounter an elephant in the road. The incomplete data leads to faulty AI.

Amazon recruitment “learns” discrimination

Amazon wanted to be at the forefront of AI in human resources and developed a recruiting AI system to help sort through the millions of resumes it receives annually. The problem was that bad algorithmic learning short-circuited the AI and caused it to incorporate bias into the system, corrupting the outcome.

The AI created its own data sets based on unconscious bias that appeared in the data. Amazon’s AI was trained to find applicants by following patterns in resumes sent to the company over a 10-year period. Since there was a preponderance of males in the tech industry, the AI “learned” that males were a preference in hiring. Resumes that included the word “women’s” as in “women’s chess club captain” were penalized, as were graduates from two all-women colleges.

Amazon itself did everything it could to neutralize the situation beforehand. The company edited the programs to treat these terms as impartial, but the AI learned to discriminate on its own.

Amazon later scrapped the program, claiming it was never used by Amazon recruiters. The company would not disclose if other recruiters had used the program.

Clean data is a moral responsibility

For data scientists, clean data is a moral obligation, one that will create safe AI that is free from bias and threat. The same applies to data collection. A safe, fair, and unbiased system of collecting data is the only way AI will become better at what it does.

Discover how to collect clean and ethical data for AI in The Quick Guide to AI Ethics for Researchers. This handy guide has everything you need to grips with AI ethics, and 4 key tips that will help you train AI ethically and responsibly. Download your copy now.

Photo by MrsBrown

Share this post:

Cited papers

The "Naturalistic Free Recall" dataset: How Prolific supported pioneering memory research

2 minutes read

February 21, 2025