How to Measure Gender and Racial Bias in Large Language Model Outputs

When you ask a large language model to rank job applicants, it doesn’t just read resumes-it makes assumptions. And those assumptions often reflect deep-seated biases in society. A 2024 study in PNAS tested this by feeding over 361,000 fake resumes to top LLMs, changing only the names, genders, and racial cues. The results? The models didn’t stay neutral. They reinforced stereotypes-sometimes in surprising ways.

What Gets Scored Higher? The Numbers Don’t Lie

The study looked at how models like GPT-3.5 Turbo, GPT-4, Claude-3, and Llama2 rated candidates with identical qualifications but different identities. Here’s what they found:

  • White female candidates scored 0.223 points higher on average than white males.
  • Black female candidates scored 0.379 points higher-the highest of all groups.
  • Black male candidates scored 0.303 points lower than white males.

That’s not noise. Those differences are statistically significant and could mean the difference between getting an interview or being passed over. For example, at a hiring threshold where only the top 10% of applicants move forward, a 0.3-point difference could push a qualified Black male candidate out of the pool-while a white female candidate with the same resume slips in.

And here’s the twist: the bias got worse with better resumes. In the top 10% of applicants, GPT-3.5 Turbo showed even stronger favoritism toward white women. Meanwhile, the strongest penalty against Black men showed up in the bottom 10%-meaning the model didn’t just ignore them, it actively punished them when they seemed less qualified.

Intersectionality Isn’t Just a Buzzword-It’s in the Code

Most people assume bias works like a simple math problem: gender bias + racial bias = total bias. But it doesn’t. The study found that the effect of being both Black and female wasn’t the sum of being Black and being female-it was more than the sum. Black women got higher scores than white women, even though white women also got a boost. That’s not a bug. It’s a pattern.

This is called intersectional bias. It means the model doesn’t see gender and race as separate traits. It sees them as combined identities-and it treats those combinations in ways no one designed it to. And current debiasing techniques? They don’t catch this. Reinforcement learning from human feedback (RLHF), adversarial training, fairness constraints-all were used in these models. And still, the bias stayed.

Gender Stereotypes Are Hardwired

It’s not just about hiring. Ask a model: "Who is a nurse?" "Who is a CEO?" The answers are predictable. One study found that when a female pronoun was used, the model chose a stereotypically female job 6.8 times more often than when a male pronoun was used. For male pronouns, it picked male jobs 3.4 times more often.

But here’s what’s worse: the model doesn’t just match stereotypes-it amplifies them. Women were steered toward caregiving roles even when their resumes showed leadership experience. Men? Their career paths stayed broad. The model didn’t push men into nursing. It just didn’t push them out of engineering, finance, or management.

Look at WinoBias, a dataset designed to test stereotype resistance. GPT-3.5 got anti-stereotypical questions wrong 2.8 times more often than stereotypical ones. GPT-4? 3.2 times. That means if you ask, "The nurse was in the hospital. Was it a man or a woman?" the model is far more likely to say "woman," even when the context gives no clue.

A mechanical brain with gears of stereotypes lifting one applicant and pushing another down, in Golden Age comic art.

Race Bias Is Stronger Than Gender Bias

A separate analysis across four bias dimensions-race, gender, health, and religion-found that race showed the strongest bias, followed by gender, then health and religion. That’s not just about names. It’s about how models interpret cultural signals: accents in text, university names, neighborhood references, even word choices.

Asian and Hispanic candidates showed mixed results across models. Some models favored Asian applicants; others penalized them. That inconsistency suggests bias isn’t universal-it’s shaped by training data sources. If most of the data came from U.S. tech companies, the model learns that "white male" = "standard." If it was trained on data from Europe or India, the patterns shift.

Bigger Models Don’t Mean Fairer Results

You’d think more powerful models would be smarter. But the opposite is true. GPT-4, Claude-3-Opus, and Llama2-70B-all large models-showed larger biases than smaller ones like Llama2-7B or Alpaca-7B.

Why? Because bigger models absorb more data. And more data doesn’t mean cleaner data. It means more history, more stereotypes, more inequality. The models aren’t learning fairness. They’re learning the world as it is-not as it should be.

Real-World Consequences

This isn’t theoretical. These models are already being used in:

  • Resume screening for Fortune 500 companies
  • Loan applications processed by automated underwriting systems
  • Healthcare triage tools that prioritize patient care
  • Customer service chatbots that determine who gets help first

Imagine a Black man with a degree from Howard University applying for a job at a company in Georgia. His resume gets flagged as "low fit" by an AI trained mostly on data from Silicon Valley. He never gets a call. Meanwhile, a white woman with the same resume gets an interview. That’s not luck. That’s bias baked into the system.

A courtroom with a computer chip gavel and a screen showing biased resumes, illustrated in classic comic book style.

How to Measure Bias in Your Own LLM

If you’re building or using an LLM, here’s how to test for bias-no PhD needed:

  1. Design a controlled test: Create 500 synthetic profiles. Change only gender, race, and location. Keep education, experience, and skills identical.
  2. Run the test: Ask the model to rank, score, or filter candidates. Record outcomes.
  3. Compare by group: Look at scores for white males as your baseline. Then see how others stack up.
  4. Test intersectionality: Don’t just check "Black" and "female" separately. Check "Black female" as a combined group.
  5. Test across quality levels: Run the test on top 10%, middle 80%, and bottom 10% of resumes. Bias often hides in extremes.

You don’t need fancy tools. Google Sheets and a prompt template will do. The key is consistency. If you’re seeing differences larger than 0.1 points across groups, you’ve got a problem.

Why Debiasing Isn’t Working

Companies try to fix this with "fairness filters" or "bias audits." But here’s the truth: you can’t scrub bias out of a model after it’s trained. The patterns are too deep. They’re in the data. They’re in the way the model learns connections.

Debiasing techniques like RLHF often make things worse. Why? Because they’re trained on human feedback-and humans are biased too. If most reviewers are white and male, the model learns to please them.

What actually helps? Changing the data. Training on diverse, balanced datasets. Auditing for underrepresented voices. Including more women, people of color, and non-binary individuals in the data collection process. And that’s expensive. That’s hard. That’s why most companies don’t do it.

What’s Next?

Measuring bias is the first step. But fixing it? That’s the real challenge. We need:

  • Regulations that require bias testing for AI used in hiring or lending
  • Transparency: companies must publish bias reports, not just claim they "tested for fairness"
  • Independent audits: third parties, not the companies themselves, should run the tests
  • More diverse teams building these models

Until then, every time you use an AI to make a decision about a person, you’re betting on a system that’s been trained on centuries of inequality. And that’s not AI. That’s history.

Can LLMs be completely freed from bias?

No-not with current methods. Bias is embedded in the data, the architecture, and the training process. Even models trained on "neutral" data inherit bias because society itself is biased. The goal isn’t perfection-it’s reduction. And that requires constant testing, transparency, and real diversity in development teams.

Do all LLMs show the same bias patterns?

No. While gender and racial bias are common across models, the severity and direction vary. GPT-4 shows stronger bias than smaller models like Alpaca-7B. Claude-3 shows different patterns than Llama2. Models trained on U.S.-centric data show stronger racial bias than those trained on global datasets. There’s no universal pattern-only shared trends.

Is bias worse in hiring AI than in other applications?

Hiring is one of the most high-stakes areas because bias directly affects livelihoods. But bias in healthcare triage, loan approvals, or even customer service can be just as damaging. The difference is visibility-hiring bias is easier to measure because resumes are structured and outcomes are tracked. Other areas are harder to audit.

Why do female candidates get higher scores?

It’s not because the model "likes" women. It’s because training data associates women with traits like "detail-oriented," "collaborative," and "reliable’-traits often valued in administrative or support roles. The model doesn’t know the difference between a CEO and a receptionist. It just matches patterns. And in many datasets, women are overrepresented in lower-level roles, so the model learns to favor them there-even if they’re qualified for higher ones.

Can I test my own AI system for bias?

Yes. Start with 100 synthetic profiles that vary only in gender, race, and location. Use the same prompts your system uses. Record how it scores or ranks each candidate. Compare results across groups. If differences exceed 0.1 points, dig deeper. You don’t need AI experts-you just need to be careful and consistent.

Write a comment