What We Learned Auditing Sophisticated AI for Bias – O’Reilly

A recently passed law in New York City requires audits for bias in AI-based hiring systems. And for good reason. AI systems fail frequently, and bias is often to blame. A recent sampling of headlines features sociological bias in generated images, a chatbot, and a virtual rapper. These examples of denigration and stereotyping are troubling and harmful, but what happens when the same types of systems are used in more sensitive applications? Leading scientific publications assert that algorithms used in healthcare in the U.S. diverted care away from millions of black people. The government of the Netherlands resigned in 2021 after an algorithmic system wrongly accused 20,000 families–disproportionately minorities–of tax fraud. Data can be wrong. Predictions can be wrong. System designs can be wrong. These errors can hurt people in very unfair ways.

When we use AI in security applications, the risks become even more direct. In security, bias isn’t just offensive and harmful. It’s a weakness that adversaries will exploit. What could happen if a deepfake detector works better on people who look like President Biden than on people who look like former President Obama? What if a named entity recognition (NER) system, based on a cutting-edge large language model (LLM), fails for Chinese, Cyrillic, or Arabic text? The answer is simple—bad things and legal liabilities.

Learn faster. Dig deeper. See farther.

As AI technologies are adopted more broadly in security and other high-risk applications, we’ll all need to know more about AI audit and risk management. This article introduces the basics of AI audit, through the lens of our practical experience at BNH.AI, a boutique law firm focused on AI risks, and shares some general lessons we’ve learned from auditing sophisticated deepfake detection and LLM systems.

What Are AI Audits and Assessments?

Audit of decision-making and algorithmic systems is a niche vertical, but not necessarily a new one. Audit has been an integral aspect of model risk management (MRM) in consumer finance for years, and colleagues at BLDS and QuantUniversity have been conducting model audits for some time. Then there’s the new cadre of AI audit firms like ORCAA, Parity, and babl, with BNH.AI being the only law firm of the bunch. AI audit firms tend to perform a mix of audits and assessments. Audits are usually more official, tracking adherence to some policy, regulation, or law, and tend to be conducted by independent third parties with varying degrees of limited interaction between auditor and auditee organizations. Assessments tend to be more informal and cooperative. AI audits and assessments may focus on bias issues or other serious risks including safety, data privacy harms, and security vulnerabilities.

While standards for AI audits are still immature, they do exist. For our audits, BNH.AI applies external authoritative standards from laws, regulations, and AI risk management frameworks. For example, we may audit anything from an organization’s adherence to the nascent New York City employment law, to obligations under Equal Employment Opportunity Commission regulations, to MRM guidelines, to fair lending regulations, or to NIST’s draft AI risk management framework (AI RMF).

From our perspective, regulatory frameworks like MRM present some of the clearest and most mature guidance for audit, which are critical for organizations looking to minimize their legal liabilities. The internal control questionnaire in the Office of the Comptroller of the Currency’s MRM Handbook (starting pg. 84) is an extraordinarily polished and complete audit checklist, and the Interagency Guidance on Model Risk Management (also known as SR 11-7) puts forward clear cut advice on audit and the governance structures that are necessary for effective AI risk management writ large. Given that MRM is likely too stuffy and resource-intensive for nonregulated entities to adopt fully today, we can also look to NIST’s draft AI Risk Management Framework and the risk management playbook for a more general AI audit standard. In particular, NIST’s SP1270 Towards a Standard for Identifying and Managing Bias in Artificial Intelligence, a resource associated with the draft AI RMF, is extremely useful in bias audits of newer and complex AI systems.¹

For audit results to be recognized, audits have to be transparent and fair. Using a public, agreed-upon standard for audits is one way to enhance fairness and transparency in the audit process. But what about the auditors? They too must be held to some standard that ensures ethical practices. For instance, BNH.AI is held to the Washington, DC, Bar’s Rules of Professional Conduct. Of course, there are other emerging auditor standards, certifications, and principles. Understanding the ethical obligations of your auditors, as well as the existence (or not) of nondisclosure agreements or attorney-client privilege, is a key part of engaging with external auditors. You should also be considering the objective standards for the audit.

In terms of what your organization could expect from an AI audit, and for more information on audits and assessments, the recent paper Algorithmic Bias and Risk Assessments: Lessons from Practice is a great resource. If you’re thinking of a less formal internal assessment, the influential Closing the AI Accountability Gap puts forward a solid framework with worked documentation examples.

What Did We Learn From Auditing a Deepfake Detector and an LLM for Bias?

Being a law firm, BNH.AI is almost never allowed to discuss our work due to the fact that most of it is privileged and confidential. However, we’ve had the good fortune to work with IQT Labs over the past months, and they generously shared summaries of BNH.AI’s audits. One audit addressed potential bias in a deepfake detection system and the other considered bias in LLMs used for NER tasks. BNH.AI audited these systems for adherence to the AI Ethics Framework for the Intelligence Community. We also tend to use standards from US nondiscrimination law and the NIST SP1270 guidance to fill in any gaps around bias measurement or specific LLM concerns. Here’s a brief summary of what we learned to help you think through the basics of audit and risk management when your organization adopts complex AI.

Bias is about more than data and models

Most people involved with AI understand that unconscious biases and overt prejudices are recorded in digital data. When that data is used to train an AI system, that system can replicate our bad behavior with speed and scale. Unfortunately, that’s just one of many mechanisms by which bias sneaks into AI systems. By definition, new AI technology is less mature. Its operators have less experience and associated governance processes are less fleshed out. In these scenarios, bias has to be approached from a broad social and technical perspective. In addition to data and model problems, decisions in initial meetings, homogenous engineering perspectives, improper design choices, insufficient stakeholder engagement, misinterpretation of results, and other issues can all lead to biased system outcomes. If an audit or other AI risk management control focuses only on tech, it’s not effective.

If you’re struggling with the notion that social bias in AI arises from mechanisms besides data and models, consider the concrete example of screenout discrimination. This occurs when those with disabilities are unable to access an employment system, and they lose out on employment opportunities. For screenout, it may not matter if the system’s outcomes are perfectly balanced across demographic groups, when for example, someone can’t see the screen, be understood by voice recognition software, or struggles with typing. In this context, bias is often about system design and not about data or models. Moreover, screenout is a potentially serious legal liability. If you’re thinking that deepfakes, LLMs and other advanced AI wouldn’t be used in employment scenarios, sorry, that’s wrong too. Many organizations now perform fuzzy keyword matching and resume scanning based on LLMs. And several new startups are proposing deepfakes as a way to make foreign accents more understandable for customer service and other work interactions that could easily spillover to interviews.

Data labeling is a problem

When BNH.AI audited FakeFinder (the deepfake detector), we needed to know demographic information about people in deepfake videos to gauge performance and outcome differences across demographic groups. If plans are not made to collect that kind of information from the people in the videos beforehand, then a tremendous manual data labeling effort is required to generate this information. Race, gender, and other demographics are not straightforward to guess from videos. Worse, in deepfakes, bodies and faces can be from different demographic groups. Each face and body needs a label. For the LLM and NER task, BNH.AI’s audit plan required demographics associated with entities in raw text, and possibly text in multiple languages. While there are many interesting and useful benchmark datasets for testing bias in natural language processing, none provided these types of exhaustive demographic labels.

Quantitative measures of bias are often important for audits and risk management. If your organization wants to measure bias quantitatively, you’ll probably need to test data with demographic labels. The difficulties of attaining these labels should not be underestimated. As newer AI systems consume and generate ever-more complicated types of data, labeling data for training and testing is going to get more complicated too. Despite the possibilities for feedback loops and error propagation, we may end up needing AI to label data for other AI systems.

We’ve also observed organizations claiming that data privacy concerns prevent data collection that would enable bias testing. Generally, this is not a defensible position. If you’re using AI at scale for commercial purposes, consumers have a reasonable expectation that AI systems will protect their privacy and engage in fair business practices. While this balancing act may be extremely difficult, it’s usually possible. For example, large consumer finance organizations have been testing models for bias for years without direct access to demographic data. They often use a process called Bayesian-improved surname geocoding (BISG) that infers race from name and ZIP code to comply with nondiscrimination and data minimization obligations.

Despite flaws, start with simple metrics and clear thresholds

There are many mathematical definitions of bias. More are published all the time. More formulas and measurements are published because the existing definitions are always found to be flawed and simplistic. While new metrics tend to be more sophisticated, they’re often harder to explain and lack agreed-upon thresholds at which values become problematic. Starting an audit with complex risk measures that can’t be explained to stakeholders and without known thresholds can result in confusion, delay, and loss of stakeholder engagement.

As a first step in a bias audit, we recommend converting the AI outcome of interest to a binary or a single numeric outcome. Final decision outcomes are often binary, even if the learning mechanism driving the outcome is unsupervised, generative, or otherwise complex. With deepfake detection, a deepfake is detected or not. For NER, known entities are recognized or not. A binary or numeric outcome allows for the application of traditional measures of practical and statistical significance with clear thresholds.

These metrics focus on outcome differences across demographic groups. For example, comparing the rates at which different race groups are identified in deepfakes or the difference in mean raw output scores for men and women. As for formulas, they have names like standardized mean difference (SMD, Cohen’s d), the adverse impact ratio (AIR) and four-fifth’s rule threshold, and basic statistical hypothesis testing (e.g., t-, x²-, binomial z-, or Fisher’s exact tests). When traditional metrics are aligned to existing laws and regulations, this first pass helps address important legal questions and informs subsequent more sophisticated analyses.

What to Expect Next in AI Audit and Risk Management?

Many emerging municipal, state, federal, and international data privacy and AI laws are incorporating audits or related requirements. Authoritative standards and frameworks are also becoming more concrete. Regulators are taking notice of AI incidents, with the FTC “disgorging” three algorithms in three years. If today’s AI is as powerful as many claim, none of this should come as a surprise. Regulation and oversight is commonplace for other powerful technologies like aviation or nuclear power. If AI is truly the next big transformative technology, get used to audits and other risk management controls for AI systems.

Footnotes

Disclaimer: I am a co-author of that document.