We're hiring!
*

Labeling tools are great, but what about quality checks?

Jakub Piotr Cłapa avatar

Jakub Piotr Cłapa
January 17, 2023

Share this post:

Reading time:

Labeling tools are great, but what about quality checks?

Introduction

Key takeaways

The QA problem in data labeling

How hard can it be?

MLfix in action

Slicing the data in many ways

All right, but is it worth doing?

Outlook

Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. In the words of deep learning pioneer Andrew Ng:

In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. – Andrew Ng: Unbiggen AI - IEEE Spectrum

Over the course of 2022, we worked on an open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors.

Key takeaways

  • Even carefully curated AI datasets have errors that can be spotted and fixed to improve the accuracy of resulting models.
  • Existing labeling tools do not have good support for doing quality assurance.
  • Fixing around 3% of label errors improves the model performance by 2%, although exact results will depend on the dataset and task.
  • Thanks to MLfix, even a big dataset like the Mapillary Traffic Sign Dataset could be fully verified and fixed by a single person over a few days of work.

The QA problem in data labeling

Labeling is a difficult cognitive task and accurate labels require a serious Quality Assurance (QA) process. Most existing labeling tools (both commercial and Open Source) have only minimal support for review. Frequently the QA process is more difficult (and expensive!) than initial labeling since you are forced to use an interface optimized for drawing bounding boxes to verify if all labels were assigned correctly. Here is the process described by a leading annotation service provider:

Annotations are reviewed four times in order to confirm accuracy. Two annotators label a given object, a supervisor then checks the quality of their work. – keymakr, a leading annotation provider

How hard can it be?

Can you spot the mistake in the following photo? I can't blame you. This is hard because it requires expert knowledge and a lot of cognitive resources to read all the labels, remember what each of these signs should look like, and finally spot the ones that are incorrect.

What if instead we show the exact same data like this:

Now it's not so difficult to spot the one speed limit sign that does not fit with the rest (the 30km/h speed limit). It requires you to only keep a single type of object in your working memory at a time and taps into the intuitive skill of spotting items that stand out from the rest. It also takes an order of magnitude less time.

This insight directly led to the creation of MLfix. Using the streamlined interface lets us perform the QA process more than 10 times faster and avoid missing even 30% of the errors.

MLfix in action

The video below shows a user quickly scrolling through 40 objects belonging to 5 classes and finding 6 mislabeled examples.

You can also try it yourself on a selection of 60km/h speed limit signs coming from the Mapillary Traffic Sign Dataset. Note that depending on demand the live demo can take some time to start.

Slicing the data in many ways

MLfix can be used as a standalone tool, but it can also be embedded directly into Jupyter notebooks that are used by data scientists to prepare and train deep learning networks. Thanks to that, MLfix can tap into all the metadata you have about your dataset and also utilize networks you've trained to help you with the QA process. You can:

  1. Slice the images based on the ground truth label:

  2. Show visually similar images together (based on LPIPS metric or a novel sorting network pretrained in an unsupervised manner):

  3. Show the output of your model (sorted by loss) on the validation set images to fish out mistakes. Here we are looking at the ground-truth class other-sign that the model believed to be the do-not-enter sign; we can see that it was right most of the time:

All right, but is it worth doing?

We made a comparison on the Mapillary Traffic Sign Dataset, which is an extensive dataset of 206 thousand traffic signs divided into 401 classes. Among these, there are 6,400 annotations of speed limit signs, and with MLfix, in about 30 minutes we could find and remove 3% of them that were erroneous. In other words, we corrected 0.11% of all the labels in the whole dataset.

We trained image classification models (based on the ResNet50 backbone) on both the original and fixed datasets 20 times and averaged out the accuracy metrics. After fixing the dataset, the model error rate went down from 7.28% to 7.05%, and the error rate for speed signs improved by almost 2% from 10.42% to 8.49%) which is a significant improvement for a very modest amount of effort. More information about these experiments (including the code to reproduce the results) can be found in the GitHub repo - jpc/mlfix-mapillary-traffic-signs. The accuracy histograms show that the improvement is consistent over multiple training runs:

Outlook

Our work could not have been possible without the help of countless open-source resources. We hope MLfix will help the annotations community to build the next generation of innovative technology.

If you have questions or ideas, join us on our Gitter #lounge channel or leave a comment in the comment section.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

The state of GFX virtualization using virglrenderer

15/01/2025

With VirGL, Venus, and vDRM, virglrenderer offers three different approaches to obtain access to accelerated GFX in a virtual machine. Here…

Faster inference: torch.compile vs TensorRT

19/12/2024

In the world of deep learning optimization, two powerful tools stand out: torch.compile, PyTorch’s just-in-time (JIT) compiler, and NVIDIA’s…

Mesa CI and the power of pre-merge testing

08/10/2024

Having multiple developers work on pre-merge testing distributes the process and ensures that every contribution is rigorously tested before…

A shifty tale about unit testing with Maxwell, NVK's backend compiler

15/08/2024

After rigorous debugging, a new unit testing framework was added to the backend compiler for NVK. This is a walkthrough of the steps taken…

A journey towards reliable testing in the Linux Kernel

01/08/2024

We're reflecting on the steps taken as we continually seek to improve Linux kernel integration. This will include more detail about the…

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.