Introduction
Discriminatory behavior towards certain groups by machine learning (ML) models is especially concerning in critical applications such as hiring. This blog post explains one source of discrimination: the reliance of ML models on different groups’ data distributions. We will show that when ML models use noisy features (which are pervasive in the real world, e.g., exam scores), they’re incentivized to devalue a good candidate from a lower-performing group. This blog post is based on:
Fereshte Khani and Percy Liang, “Feature Noise Induces Loss Discrepancy
Across Groups.” International Conference on Machine Learning. PMLR, 2020
The findings are illustrated by reviewing the hiring process in the
fictitious city of Ney, where recently a group of people has accused the
government of discrimination.
Hiring people in Ney
The government of Ney wants to hire qualified people. Each person in Ney has a skill level that is normally distributed with a mean (mu) and a standard deviation
of (sigma_text{skill}). A person is qualified if their skill level is greater than 0 and non-qualified
otherwise. The government wants to hire qualified people (all people
with skills greater than 0). For example, Alice with skill level 2, is
qualified, but Bob with the skill level of -1 is not qualified.
To assess people’s skills, the government created an exam. The exam score is a noisy indicator of the applicant’s skill since it cannot capture the true skill of a person (e.g., the same applicant would score differently on different versions of SAT). In the city of Ney, exam noise is nice and simple: If an individual has skill (z), then their
score is distributed as (mathcal{N} (z,
sigma_text{noise}^2)),
where (sigma_text{noise}^2) indicates the variance of noise
on the exam.
The government wants to choose a threshold (tau), and hire all
people whose exam scores are greater than (tau). There are two
kinds of errors that the government can make:
- Not hiring a qualified person ((z > 0 land x le tau))
- Hiring a non-qualified person ((z le 0 land x > tau))
For simplicity, let’s assume the government cares about these two types
of errors equally and wants to minimize the overall error, i.e., the
number of non-qualified hired people plus the number of qualified
non-hired people.
Given all exam scores and knowledge of the skill distribution of the people,
what cut-off threshold should the government use to minimize the error (the above equation)?
Is it a good strategy for the government to simply use 0 as the
threshold and hire all individuals with scores greater than zero?
Let’s consider an example where the skill distribution
is (mathcal{N}(-1,1)), and the exam noise
has a standard deviation of (sigma_text{noise}=1). The following lines of code plot
the average error for various thresholds for this example. As
illustrated, 0 is not the best threshold to use. In fact, in this
example, a threshold of (tau=1) leads to minimum error.
The government wants to minimize the number of hired people with negative skill levels + the number of non-hired people with positive skill levels. Hiring all people with positive exam scores (a noisy indicator of the skill) is not optimal.
If 0 is not always the optimal threshold, then what is the optimal
threshold for minimizing error for different values of (mu,
sigma_text{skill}) and (sigma_text{noise})?
Generally, given a person’s exam score ((x)) and the skill level distribution ((mathbb{P}(z))), what can we infer
about their real skill ((z))? Here is where Bayesian inference
comes in.
Bayesian inference
Let’s see what we can infer about a person’s skill given their exam score and knowing the skill level distribution
(mathbb{P} (z)) (known as the prior distribution since it shows the prior over a person’s skill). Using Bayes rule, we can calculate (mathbb{P} (z|x)) (known as the posterior distribution since it shows the distribution over a person’s skill after observing their score).
Let’s first consider two extreme cases:
- If the exam is completely precise
(i.e., (sigma_text{noise}=0)), then the exam score is
the exact indicator of a person’s skill (irrespective of the prior
distribution). - If the exam is pure noise (i.e., (sigma_text{noise}
rightarrow infty)), then the exam score is meaningless, and
the best estimate for a person’s skill is the average
skill (mu) (irrespective of the exam score).
Intuitively, when the noise variance has a value between (0) and (infty), the best estimate of a person’s skill is a number
between their exam score ((x)) and the average skill
((mu)). The figure below shows the standard formulation of the
posterior distribution (mathbb{P} (z mid x)) after observing
an exam score ((x_0)). For more details on how to derive this
formula, see
this.
Based on this formula (and as hypothesized), depending on the amount of noise, (mathbb{E} [zmid x]) is a number between (x) and (mu).
An applicant’s expected skill level is between their exam score and the average skill among Ney people. If the exam is noisier, it is closer to the average skill; if the exam is more precise, it is closer to the exam score.
Optimal threshold
Now that we have exactly characterized the posterior distribution
((mathbb{P} (z mid x))), the government can find the optimal
threshold. For any exam score (x), if the government hires people
with score (x), it incurs (mathbb{P}(z le 0 mid x) )
error (probability of hiring non-qualified people). On the other hand,
if it does not hire people with score (x), it
incurs (mathbb{P}(z > 0 mid x)) error (probability of
non-hiring qualified people). Thus, in order to minimize the error, the
government should hire a person iff (mathbb{P} (z > 0 mid x) >
mathbb{P}(z le 0 mid x)). Since the posterior distribution is a
normal distribution, the government must hire an applicant
iff (mathbb{E}[z mid x] > 0).
Using the formulation in the previous section, we have:
Therefore, the optimal threshold is:
In our running example with average skill (mu=-1)
and (sigma_text{skill} = sigma_text{noise}=1), the optimal threshold is 1.
The figure below shows how the optimal threshold varies according
to (mu) and (sigma_text{noise}).
As (sigma_text{noise}) increases or (mu) decreases,
the optimal threshold moves farther away from (0).
As exams become more noisy or the average skill becomes more negative, the optimal threshold moves further away from 0.
What does machine learning have to do with all of this?
So far, we precisely identified the optimal cut-off threshold given the
exact knowledge of (mu, sigma_text{skill}),
and (sigma_text{noise}). But how can the government find the
optimal threshold using observational data? This is where machine
learning (ML) comes into the picture.
Let’s imagine very favorable conditions. Let’s assume everyone (an infinite number of them!) takes the exam, the government hires all of them and observes their true skills. Further, assume the modeling assumption is perfectly correct (i.e., both the true prior distribution and conditional distribution are normal). What would happen if the government trains a model with an infinite number of ((x,z))
pairs?
Before delving into this, we would like to note that in real-world
scenarios, we do not have infinite data (finite data issues); the
government does not hire everyone (selection bias issues), and the true
skill is not perfectly observable (target noise/biases issues).
Furthermore, the modeling assumptions are often incorrect (model
misspecification issues). Each of these issues may affect the model
adversely; however, in this blog post our goal is to analyze the model
decisions when none of these issues exist. In the next section, we will show that discrimination occurs even under these ideal conditions.
Under these very favorable conditions and the right loss function,
machine learning algorithms can perfectly predict (mathbb{E} [z
mid x]) from (x); therefore, can find the optimal threshold
that minimizes the error. The following few lines of Python code show
how linear regression and logistic regression fit the data. In this
example, we set (mu = -1,
sigma_text{skill}=sigma_text{noise}=1), and as shown in
the figure on the right, the cut-off threshold predicted by the model is
one, which matches the optimal threshold as we observed previously.
Under very favorable conditions, machine learning models find the optimal threshold, which is a function of average skill, exam noise, and skill variance among people.
Optimal thresholds for different groups
So far, we have shown how to calculate the optimal threshold and
illustrated that ML models also recover this threshold. Let’s now
analyze the optimal threshold when different groups exist in the
population. There are two kinds of people in the city of Ney: blue and red. The
blue people’s skills are normally distributed centered
on (mu_text{blue}), and the red people’s skills are normally
distributed centered on (mu_text{red}). The standard deviation for
both groups is (sigma_text{skill}). There can be various
reasons for disparities between groups, for example historically blue
people might not have been allowed to attend school.
First of all, let’s see what happens if the exam is completely precise. As
previously discussed in this case, the optimal threshold to use is 0 for
both groups independent of their distribution. Thus, both groups are
held to the same standard, and the error for the government is 0.
If there is no noise in the exam, then zero is the optimal threshold for both groups and leads to zero error.
Now let’s analyze the case where the exam is noisy
( (sigma_text{noise} > 0)). As discussed in the prior
sections, the optimal threshold depends on the average of the prior
distribution, thus the optimal threshold differs between blue and red
groups. Therefore, if the government knows the demographic information,
then it’s a better strategy for the government to classify different
groups separately (in order to minimize the error). In particular, the
government can calculate the optimal threshold for blue and red people
using Bayesian inference.
People in a group that has lower average skills need to pass a higher bar for hiring! Not only do blue people need to overcome other associated effects of being in a group with lower average skills, they also need to pass a higher bar to get hired.
As stated, the government uses a higher threshold for people in a group
with a lower average skill! Consider two individuals with the same skill
level but from different groups. The blue person is less likely to get
hired by the government than the red person. Surprisingly, blue people
who are already in a group with a lower average skill (which probably
affects their confidence and society’s view of them) need to also pass a
higher bar to get hired!
Finally, note that the gap between thresholds for the different groups
grows as the noise increases.
A blue person has a lower chance of getting hired in comparison with a red person with the same skill.
Conclusion
We examined the discriminatory effect of relying on noisy features. When ML models use noisy features, they’re naturally incentivized to devalue a good score when the candidate in question comes from an overall lower-performing group. Note that noisy features are prevalent in any real-world application (here, we assumed that noise is the same among all individuals, but it’s usually worse for disadvantaged groups). Ideally, we would like to improve the features to better reflect a candidate’s skill/potential or make the features more closely approximate the job requirements. If that’s not possible, it’s important to be conscious that the “optimal decision” is to discriminate, and we should adjust our process (e.g., hiring) in acknowledgment that group membership can shade an individual’s evaluation.
Frequently asked questions
Can we just remove the group membership information, so the model treats individuals from both groups similarly?
Unlike this example where group membership is a removable feature,
real-world datasets are more complex. Usually, datasets contain many
features such that the group membership can be predicted from them
(recall that ML models benefit from predicting group membership since it
lowers error). Thus, it is not obvious how to remove group membership in
these datasets. See
[1,2,3]
for some efforts on removing group information.
Why should we treat these two groups similarly when their distributions are inherently different? Utilizing group membership information reduces error overall and for both groups!
Fairness in machine learning usually studies the impact of ML algorithms
on groups according to protected attributes such as sex, sexual
orientation, race, etc. Usually, there has been some discrimination
towards these groups throughout history, which leads to huge disparities
among their distributions. For example, women (because of their sex)
were not allowed to go to universities. Thus, these disparities are not
inherent and could (and probably should!) change over time. For
instance, see women in the labor force
[4].
Another reason to avoid relying on disparities among protected groups in
models is feedback loops. Feedback loops might exacerbate distributional
disparities among protected groups over time. (e.g., few women get
accepted → the self-doubt between women increases → women perform
worse in the exam → fewer women get accepted and so on). For
instance, see
[5]
and
[6].
Finally, note that although the government objective may be to minimize the
error by weighting the costs of hiring non-qualified and non-hiring
qualified candidates similarly, it is not clear whether the group
objectives should be the same. For example, a group might be worse off
as a result of the government not hiring its qualified members than if
the government had hired its non-qualified members (for example, in
settings where the lack of minority role models in higher-level
positions leads to a lower perceived sense of belonging in other members
of a group). Thus, using group membership to minimize the error is not
necessarily the most beneficial outcome for a group; and depending on
the context we might need to minimize other objectives.
What about other notions of fairness in machine learning?
In this blog post, we studied the ML model’s prediction for two similar individuals (here same z) but from different groups (blue vs. red). This is referred to as the counterfactual notion of fairness. There is another common notion of fairness known as the statistical notion of fairness, which looks at the groups as a whole and compares their incurred error (it is also common to compare the error incurred by qualified members of different groups known as the equal opportunity [7]). Statistical and counterfactual notions of fairness are independent of each other, and satisfying one does not guarantee satisfying the other. Another consequence of feature noise is causing a trade-off between these two notions of fairness, which is beyond this blog post’s scope. See our paper [8] for critiques regarding these two notions and the effect of feature noise on statistical notions of fairness.
Acknowledgement
I would like to thank Percy Liang, Megha Srivastava, Frieda Rong, and Rishi Bommasani, Yeganeh Alimohammadi, and Michelle Lee for their useful comments.