Labeling Function Analysis

Allows users to view the results of their labeling functions, including coverage, overlaps, and conflicts, and to improve performance by training the label model

To get the results from labeling function analysis, you need to have some labeling functions and predict labels.

Labeling Function Analysis Window

The labeling function analysis window has two ways to view the results: by clicking on the labeling function button, or through the "See labeling function analysis" button after predicting the labels.

If you haven't predicted the labels, the labeling function analysis page will show an empty value.

After clicking predict labels, the results will be shown in the labeling function analysis. There are three metrics: coverage, overlaps, and conflicts.

  1. Coverage is the fraction of the dataset each labeling function labels.

  2. Overlaps are the fraction of the dataset where each labeling function and at least another labeling function label.

  3. Conflicts are the fraction of the dataset where each labeling function and at least another labeling function label, and they disagree.

If you have a new labeling function or have made changes to your labeling function, you need to re-predict labels in order to update the analysis value of the labeling function.

How to Improve Labeling Function Performance

The ideal situation for labeling function is to have high coverage, high overlap, and low conflicts. Below is a use case of labeling function performance conditions:

Fairly high coverage, high overlaps, and high conflicts.

It means our LFs can label a lot of data points and the majority of data points were assigned more than one LFs with different labels. We have one example of performance metrics value below.

  1. Coverage = 50%

  2. Overlaps = 30%

  3. Conflicts = 27%

This number shows that even though there is large coverage and overlaps, the disagreement between labeling functions happens in almost half of the coverage. To improve this, we need to train the label model to get the performance value between labeling functions. The performance value of labeling functions could estimate accuracies and correlations between labeling functions since we know some labeling functions could give high or low signals regarding the label.

Low coverage, high overlaps, and high conflicts.

We need to add several new labeling functions and try to identify which labeling function creates more conflicts by experimenting one by one. After identifying it, we can re-evaluate the labeling functions.

Last updated