Performance: Improving Quality while Mitigating Risks with Tests and Verification Procedures
By Dr. Rebecca Portnoff, Head of Data Science at Thorn
What does it look like when AI goes wrong — and how can we protect our organizations? In this week’s Navigating AI article, Dr. Rebecca Portnoff, Head of Data Science at Thorn, explores how to mitigate risk. For example, concepts like false positives and false negatives aren't just abstract statistical terms: They have real-world implications for the people involved in the system. She outlines testing and verification methods to help you navigate this new reality. I hope that you and your organization can explore these issues together.
-Raffi
What does it look like when AI goes wrong? And how can we protect ourselves against those risks? While these may seem like new concerns, machine learning (ML)/AI researchers, practitioners, and ethicists have been considering and addressing these questions for several decades.
I have had the privilege of wearing all three of these hats (ML/AI researcher, practitioner, and ethicist) in an issue space to which I’ve dedicated my career: protecting children from sexual abuse. For over a decade, I have worked in the intersection of child safety and ML/AI, building tools and driving initiatives to defend those most vulnerable among us. I hold a B.S.E. from Princeton and a Ph.D. from UC Berkeley, both in Computer Science. My Ph.D. dissertation focused on what my team at Thorn does today: building ML/AI to protect children from sexual abuse.
At Thorn, we have built a suite of image, video, and text classifiers to detect sexual harms against children. Harms like new child sexual abuse material (CSAM) that depicts a child in an active abuse scenario, text in which a bad actor is sexually extorting a child, or a viral video of CSAM of a survivor that is shared over and over again. Our solutions are built to arm frontline defenders against the scale of material they are facing in order to defend the children we serve. With ML/AI, we have seen firsthand that we can unlock the power of rapid prioritization and triage of abuse material across all the data domains in which it manifests. We are working towards a world where that child in need doesn’t have to wait another second to get their life back.
The stakes are real — and very high — with the work that we do. The fact of the matter is that every ML/AI system, regardless of the use case, will make errors. Understanding concepts like false positives and false negatives is crucial. These aren't just abstract statistical terms – they have real-world implications for the people involved in the system. In Thorn’s mission space, false negatives could mean that a child in harm’s way remains unidentified and continues to suffer abuse for days, months, or even years. False positives could waste the time of frontline defenders – including trust and safety moderators, hotline analysts, and victim identification specialists – diverting their attention from urgent cases. Navigating false negatives and positives in a privacy-preserving way is always front of mind in our work because any use of ML/AI will result in both.
Effectively navigating this reality is crucial, with testing and verification critical to ensuring this goal. That’s why employing a holistic approach across the ML/AI lifecycle is essential, incorporating the following three stages.
Stage 1: Develop
In this first stage, ML/AI engineers focus on research and development to build the desired model. This involves significant iteration: experimenting with different ML/AI frameworks and modeling techniques, conducting exploratory data analysis, gathering requirements from stakeholders, benchmarking different approaches, iterative data-cleaning and data-labeling, and more. The goal is to build the “best” model, but how do we define “best?” This will depend on the performance metric chosen to evaluate the model against the validation and testing datasets. A validation dataset is used to tune different model parameters, assessing the model’s performance across multiple iterative experiments. The testing dataset is a hold-out dataset typically used to measure the model’s likely performance in the wild before it is deployed into production. When evaluating the model’s performance against both the validation and testing datasets, your ML/AI engineers must pick the right benchmark metric for the desired task (binary vs. multi-class classification, multi-label classification, regression, etc.). Some common metrics used to evaluate a model’s performance for classification tasks include:
Accuracy - Accuracy is what most people think of when considering a metric for classification. It is defined as the total number of correct predictions the model makes divided by the total number of predictions it makes. When working with balanced data, in which the expected population has an even distribution across classes – a class being one of the labels across a collection of related data, e.g. “is abuse material” or “is not abuse material” – accuracy can be a good measurement of performance. However, many tasks that we encounter do not have balanced data: e.g., detecting an imminent neurological stroke (more people do not have strokes than have strokes), or detecting online abuse (while sadly, the scale of reported abuse material online continues to grow rapidly, there are still more non-abusive interactions and experiences online than abusive ones). In these cases, accuracy has some key limitations.
When navigating imbalanced datasets, accuracy may not provide a complete picture of how well the model performs. For example: Let’s say you build a model to distinguish between abusive and non-abusive content online. Let’s further say that your model does a good job predicting when an interaction is non-abusive but doesn’t do as well predicting when an interaction is abusive. When evaluated against a testing dataset that is representative of that imbalanced reality of data, the nuance of underperformance on abusive data won’t be reflected in the final accuracy score. It will be drowned out by the signal of how well the model performs on the majority (non-abusive) class.
Accuracy also assumes that the negative downstream impact of a false positive is the same as that of a false negative. This isn’t always true. Depending on your use case, a false positive might be significantly more damaging than a false negative, and vice versa. For example: While both would be difficult to experience, not knowing that you have cancer (false negative) could lead to a worse health outcome than thinking you have cancer when you don't (false positive). The impact of false positives versus false negatives will also depend on whether a system is in place to contest a prediction – e.g., whether your bank allows you to contest a transaction that was flagged as fraudulent (false positive) or indicate a transaction that wasn't flagged as fraudulent but should have been (false negative). Depending on what kind of systems are in place to correct for either type of error, the human cost can vary drastically.
Precision - Precision, which measures the quality of a model’s predictions, can be a better metric for measuring performance for classification tasks on imbalanced datasets when the target “positive” class is in the minority. When the model predicts that a piece of data is in the “positive” class, how often is the model correct? This is calculated by dividing the total number of correct positive predictions that the model makes by the total number of positive predictions that the model makes, both correct and incorrect. One obvious limitation of this metric is that it doesn’t take false negatives into account. This brings us to recall.
Recall - Recall is typically paired with precision for classification tasks, especially for cases where mitigating both false positives and false negatives is a priority. Precision and recall are inversely related: The higher you set your confidence threshold, the higher your precision will be, while recall decreases. That’s because recall measures the breadth of a model’s predictions: Of all the pieces of data that are truly in the “positive” class, how many does the model correctly identify as in that class? This is calculated by dividing the total number of positive predictions that the model makes by the total number of actual positive examples in the dataset (both those the model correctly predicts as positive and those the model incorrectly predicts as negative).
False Positive Rate - When dealing with classification tasks on imbalanced datasets where the target “positive” class is in the majority, the false positive rate can be an effective metric, and is typically paired with the true positive rate, which is equivalent to recall. The false positive rate measures “false alarms,” the number of positive predictions the model made that are truly negative, divided by the total number of negative examples (both false positives and true negatives).
In addition to selecting the appropriate metric to evaluate the model’s performance across multiple iterative experiments in the development stage, ML/AI engineers must also ensure that the data they are using to train and evaluate the model is appropriately curated. This requires a thorough understanding of not just the data but the downstream business and user needs. For example, some predictive ML/AI models are built for the purpose of predicting future behavior or future states of individuals (e.g., is a person at high risk for diabetes?). Other predictive ML/AI models are built for the purpose of identifying current behavior or states of individuals (e.g., is a person a minor or an adult?). In either case, if the data isn’t representative of the intended population or is biased to certain sub-categories within that population, you will end up with false positives and false negatives that are skewed against the direction of that missing data. However, what it means to have representative data may look different for different use cases. For example, when predicting future states, data that reflects the time series nature of the problem will be necessary, capturing data points over time at consistent intervals.
Computers may be at the core of this work, but ultimately, verifying that your training, testing, and validation data are representative, unbiased, and appropriate for the particular task at hand requires manually annotating the data. This allows you to know where there are gaps in the data that must be supplemented. There are several important elements to a robust, responsible, and ethical labeling strategy. Annotators should ideally be domain experts in the classification task or have been trained by domain experts to understand the task. Working with multiple annotators and measuring inter-annotator agreement on labels those annotators provide helps mitigate individual annotator bias that may be present. Iteratively labeling in batches allows for continuous feedback and improvement. Providing wellness support for labeling tasks that may be challenging to an annotator’s mental health (depending on the graphic nature of the material or the scale of the labeling task) is also crucial.
As a part of thorough data curation, ML/AI engineers will also need to vigorously deduplicate their datasets, eliminating additional copies of repeating data to ensure there is no leakage between the training, testing, and validation datasets. Without this type of deduplication in place, they run the risk of both building models that are overfit to a particular dataset – preventing them from generalizing to new data – and reporting metrics that don’t accurately represent how the model performs.
Stages 2 and 3: Deploy and Maintain
In the Deploy stage, the model is integrated into a production environment such that the model’s outputs can now be surfaced to users. This stage typically includes significant collaboration between multiple teams (engineering, product, ML/AI, data science, etc.); collaboration that should be included in Stage 1 as well).
It is also tightly coupled with the Maintain stage, in which ML/AI engineers work to uphold the quality of their models’ performance in the face of data drift, model drift and changes in user behavior. This requires building infrastructure that enables continuous assessment of the model’s performance, and incorporates a human into the loop for model correction and user feedback. This infrastructure allows ML/AI engineers to make continuous improvements that mitigate risks of false positives and negatives. There are three main strategies that can be used to accomplish these goals:
Continuous Assessment - Batch predictions allow the team to analyze the distribution and quality of predictions of the new incoming data before surfacing results to the end user. Rigorous unit and integration tests are important for ensuring the ML/AI system works as intended end-to-end. Metrics dashboards that alert the team when performance has dipped below a predefined performance threshold can be used to trigger a retrain (in which new training data is used to update the model, but no other design changes are made) or rebuild (in which the design of the model is revisited, altering the fundamental build) of the model. Phased deployment that incrementally expands the user base allows for paced assessment of the model as your group of users grows and changes.
Human in the Loop - When surfacing predictions, teams can transparently indicate the model’s confidence in the prediction to the end user, enabling the user to make an informed, human decision on any necessary action (with their full context). Similarly, for prioritization and triage use cases, product offerings can be designed to fit into existing workflows for end users, such that the ML/AI model is plugged in as one indicator amongst many for the human to use when making a final determination. Functionality that empowers users to provide feedback - both on the final output prediction and intermediate feedback as the model is iteratively optimized - which is critical for the long-term success of a model’s performance. This feedback can be used in both retrains and rebuilds of the model, allowing for continuous short- and long-term performance improvements.
Ongoing Research - Monitoring and maintaining performance becomes more challenging when dealing with data in the wild versus in a controlled testing environment. This is particularly true for those building ML/AI in an adversarial space. The unfortunate reality is that bad actors change their behaviors in order to make identifying victims, stopping revictimization and preventing abuse more difficult. Bad actors are often early adopters of new technology, finding ways to use it to further sexual harms against children.
Therefore, continuously conducting research to understand how your domain or issue space is evolving is part of a complete strategy for performance that actually solves the underlying problem you set out to address. Doing this will allow you to measure your model’s performance against those emerging threats, or make a decision as to whether a new solution needs to be built to keep up with the changes.
Conclusion
As we move into a world in which AI is increasingly integrated into our processes and systems, it is important to remember that it will be humans who bear the consequences, both positive and negative. While measurement on its own isn’t enough to steward this change toward AI, it’s also true that you can’t fix what you haven’t measured.
I work in a space filled with good intentions. But good intentions are only the starting point of a mission. When building technology, achieving actual positive impact requires understanding where and when your technology is performing as expected, and where and when it isn’t – and then letting go of ego and fixing it. And when it comes to ML/AI, the testing, validation and feedback loops necessary to answer the question “Is your technology performing as expected?” is never one and done. ML/AI is inherently iterative, and that iteration has to happen as part of a fully built-out system, with checks and balances in place in order to guarantee continuous improvement.