There’s good news from the scholarly community working on the assessment of fairness in algorithms. Computer scientists and statisticians are developing a host of new measures of fairness aimed at providing companies, policymakers, advocates with new tools to assess fairness in different contexts.
The essential insight of the movement is that the field needs many different measures of fairness to capture the variety of normative concepts that are used in different business and legal contexts. Alexandra Chouldechova, Assistant Professor of Statistics and Public Policy at Heinz College, says “There is no single notion of fairness that will work for every decision context or for every goal.”
To find the right measure for the job at hand, she advises, “Start with the context in which you’re going to apply [your decision], and work backwards from there.”
This issue came to a head in the controversy surrounding the COMPAS score. This score attempts to predict the likelihood that person convicted of a crime will commit another crime within a certain period of time. It does this by comparing current offenders to those with similar histories of offending. It is used for a variety of purposes including to identify offenders who should be targeted for interventions, or special conditions of supervision. Controversially, it is also used as a factor to determine whether an offender should be incarcerated and the length of the sentence. It’s use for these purposes has been approved after court challenge, but only under very limited conditions, including a warning about its potential for racial bias.
Last year, ProPublica published a criticism of the score, noting that it mistakenly labeled African-Americans as having a high risk of re-offending at roughly twice the rate as it made the same mistake for whites. In addition to this higher rate of false positives for African-Americans, it also had a higher rate of false negatives – this time in favor of whites. It mistakenly gave a good risk score to whites who reoffended at roughly twice the rate it did for blacks.
This set off alarm bells, as it should. Imagine a study that showed that in a particular court innocent black defendants were convicted at twice the rate as innocent whites, while guilty whites were let off at twice the rate as guilty blacks!
Yet the company that developed COMPAS and other analysts, including the authors of the critical report, noted that the score predicted risk of re-offending – the higher the score the greater the risk of re-offending. Moreover, according to one common measure of fairness, it was not biased. The score predicted the same risk of re-offending regardless of racial grouping. Whites with a score of 7 were just as likely as African-Americans with a score of 7 to re-offend.
Researchers wondered whether the two points of view could be reconciled. Maybe predictive parity, which the COMPAS score had, needed to be supplemented with balance in error rates, which COMPAS did not have. A fair score might need both.
Hopes for an irenic compromise solution vanished, however, when several scholars including Chouldechova and Kleinberg discovered that the conflict was mathematical. In certain circumstances, it is impossible to have equal error rates and predictive parity. A score can have one or the other but not both.
The circumstance that causes this mathematical conflict is when the feature to be predicted is distributed unequally between the groups. In the case of the COMPAS score, African Americans re-offended at higher rates than whites. When this base rate is different for different racial groupings, a score like COMPAS that predicts the same level of risk regardless of racial grouping will not have equal error rates across the racial groups.
It would be nice if a single tool could do it all. But the research community is telling policymakers, companies and advocates there is no such magic bullet. What is the right thing to do? When there is an inherent trade-off, do we want predictive parity or equal error rates?
We cannot rely on statistical analysis to resolve what is essentially a normative dilemma. If we move toward equal error rates, we risk predictive parity, which is an appealing notion of fairness in the criminal justice context. Some have claimed that this statistical notion of fairness embodies the fundamental principle of treating equals equally and so any departure from it would raise constitutional questions of equal treatment requiring judicial strict scrutiny. They also point out that equalizing error rates would release defendants who are more likely to re-offend, thereby reducing public safety.
On the other hand, the principle that the innocent should not be punished is a bedrock principle of our legal system. As William Blackstone put it, “It is better that ten guilty persons escape than that one innocent suffer.” Should the criminal justice system apply that principle less vigorously to African-Americans than to whites?
We need substantial public discussion and reflection to resolve this issue. And the answer we come up with in the criminal justice context will not automatically apply to all decision-making contexts. Tools to aid decision making are in use everywhere – employment, insurance, credit granting, authentication, fraud prevention, marketing, personalized learning, recommendation engines, online advertising. A focus on increasing the positive predictive value of these tools needs to be balanced with an assessment of error rates. Companies, policymakers, advocates and scholars have much to learn from each other as this conversation on algorithmic fairness continues.