Automating the Assessment of Training Data Quality with Encord

Eric Landau


Eric Landau


Jun 22, 2022

Automating the Assessment of Training Data Quality with Encord

When building AI models, machine learning engineers run into two problems with regards to labelling training data: the quantity problem and the quality problem.

For a long time, machine learning engineers were stuck on the quantity problem. Supervised machine learning models need a lot of labelled data, and the model’s performance depends on having enough labelled training data to cover all the different types of scenarios and edge cases that the model might run into in the real world. As they gained access to more and more data, machine learning teams had to find ways to label it efficiently.

In the past few years, these teams have started to find solutions to this quantity problem – either by hiring large groups of people to annotate the data or by using new tools that automate the process and generate a lot of labels in more systematic ways.

Unfortunately, the quality problem only truly began to reveal itself once solutions to the quantity problem emerged. Solving the quantity problem first made sense– after all, the first thing you need to train a model is a lot of labelled training data. However, once you train a model on that data, it becomes apparent pretty quickly that the quality of the model’s performance is not only a function of the amount of training data but also of the quality of that data’s annotations.

The Training Data Quality Problem

Data quality issues arise for a number of reasons. The quality of the training data itself depends on having a strong pipeline for sourcing, cleaning, and organising the data to make sure that your model isn’t trained on duplicate, corrupt, or irrelevant data. After putting together a strong pipeline for sourcing and managing data, machine learning teams must be certain that the labels identifying features in the data are error free.

That’s no easy task because mistakes in data annotations arise from human error, and the reasons for these errors are as varied as the human annotators themselves. All annotators can make mistakes, especially if they’re labelling for eight hours a day. Sometimes, annotators don’t have the domain expertise required to label the data accurately. Sometimes, they haven’t been trained appropriately for the task at hand. Other times, they aren’t conscientious or consistent: they either aren’t careful or haven’t been taught best practices in data annotation. 

A misplaced box around a polyp (from the Hyper Kvasir dataset)

Regardless of the cause, poor data labelling can result in all types of model errors. For example, if trained on inaccurately labelled data, models might make miscategorisation errors, such as mistaking a horse for a cow. Or if trained on data where the bounding boxes haven’t been drawn tightly around an object, models might make geometric errors, such as failing to distinguish the target object from the background or other objects in the frame. A recent study revealed that 10 of the most cited AI datasets have serious labelling errors: the famous ImageNet test set has an estimated label error of 5.8 percent.

When you have errors in your labels, your model suffers because it's learning from incorrect information. When it comes to use cases where there’s a high sensitivity to error with regards to the consequences of a model’s mistake, such as autonomous vehicles and medical diagnosis, the labels must be specific and accurate– there’s no room for these types of labelling errors or poor quality data. In these situations where a model must operate at 99.99 percent accuracy, small margins in its performance really matter.

The breakdown in model performance from poor data quality is an insidious problem because machine learning engineers often don’t know whether the problem is in the model or in the data. They can spin their wheels trying to improve a model only to realise that the model will never improve because the problem was in the labels themselves. Taking a data- rather than model-centric approach to AI can relieve some of the headaches. After all, these sorts of problems are best first addressed by improving the quality of the training data itself before looking to improve the quality of the model. However, data-centric AI can’t reach its potential until we solve the data quality problem.

Currently, assuring data quality depends on manually intensive review processes. This approach to quality is problematic and unscalable because the volume of data that needs to be checked is far greater than the number of human reviewers available. And reviewers also make mistakes, so there’s human inconsistency throughout the labelling chain. To correct for these errors, a company can have multiple reviewers look at the same data, but now the cost and the workload has doubled so it’s not an efficient or economical solution.

Encord’s Fully Automated Data Quality and Label Assessment Tool

When we began Encord, we were focused on the quantity problem. We wanted to solve the human bottleneck in data labelling by automating the process. However, we quickly realised after talking to many AI practitioners, and in particular those at more sophisticated companies, they were stuck on the quality problem. From these conversations, we decided to turn our attention to solving the data quality problem, too. We realised that the quantity problem would only truly be solved if we got smarter about ensuring that the amount of data going into the pot was also high-quality data. 

Encord has created and launched the first fully automated label and data quality assessment tool for machine learning. This tool replaces the manual process that makes AI development expensive, time-consuming, and difficult to scale.

A Quick Tour of the Data Quality Assessment Tool

Within Encord’s platform, we have developed a quality feature that detects likely errors within a clients' project, using a semi-supervised learning algorithm. The client chooses all the labels and objects that they want to inspect from the project, runs the algorithm, and then receives an automated ranking of the labels by the probability of error. 

Each label receives a score, so rather than having a human review every individual label for quality, they can use the algorithm to curate the data for human review in an intelligent way.

The score reflects whether the label is likely to be high or low quality. The client can set a threshold to send everything above a certain score to the model and send anything below a certain score for manual review. The human can then accept or reject the label based on its quality. The humans are still in the loop, but the data quality assessment tool saves them as much time as possible, using their time efficiently and when it matters the most.

In the example below, the client has annotated different objects in the room. The bounding box in the image should be identifying a chair, but it isn’t tight to the chair and misses some of the object. That’s a label that a reviewer might want to inspect to see if it could be improved. Its score is .873, so if the threshold was set to .90 or above, this label would automatically be sent for review. It would never make it to the model unless a human passed it on.

The tool also aggregates statistics on the human rejection rate of different items, so machine learning teams can get a better understanding of how often humans reject certain labels. With this information, they can focus on improving labelling for more difficult objects. In the below example, beds and chairs have the highest rejection rate.

The tool currently works with object detection because that is the greatest need among our clients, but we’re currently working on ground-breaking research to make it work for other computer vision tasks like segmentation too.

Increased Efficiency: Combining Automated Data Labelling with the Quality Data Assessment Tool

Encord’s platform allows you to create labels manually and through automation (e.g. interpolation, tracking, or using our native model-assisted labelling). It also allows you to import model predictions via our API and Python SDK. Labels or imported model predictions are often subjected to manual review to ensure that they are of the highest possible quality or to validate results.

Now, however, using our automated quality assessment tool, our clients can perform an automated review of the labels generated by the aforementioned different labelling agents without changing any of their workflows and at scale. 

The quality feature reassures customers about the quality of machine-generated labels. In fact, our platform aggregates information to show which label generating agents– from both the human annotator group, the imported labels group, and the automatically produced labels  group– are doing the best jobs. In other words, the tool doesn’t distinguish between human- and model-produced labels when ranking the labels within a dataset. As a result, this feature helps build confidence in using several different label-generating methods to produce high quality training data. 

With both automated label generation using micro-models and the automated data quality assessment tool, Encord is optimising the human-in-the-loop’s time as much as possible. In doing so, we can cherish people’s time by using it only for the most necessary and meaningful contributions to machine learning.

Machine learning and data operations teams of all sizes use Encord’s collaborative applications, automation features, and APIs to build models & annotate, manage, and evaluate their datasets. Check us out here.

Related Posts