Evaluation Metrics

Easily evaluate your labeling process

To access this page, click the triple dots on a specific project, select "View Project Details," and navigate to the "Evaluation Metrics" tab.

Note that evaluation metrics are available only after a project is completed. During calculation, labels from Reviewer Mode serve as the ground truth, evaluating each labeler against the Reviewer Mode version.

Currently, evaluation metrics are only available for Row Labeling projects with dropdown, hierarchical dropdown, checkbox, and radio button questions (excluding multiple answers).

How to Use

Datasaur acknowledges that evaluation metrics are commonly employed for assessing models rather than labeling work. This distinguishes it from IAA as its primary purpose is to gauge the agreement level among labelers. As previously mentioned, the system compares the Reviewer Mode (as the ground truth) to each individual labeler. However, you can still assess a model using one of the following approaches:

Integrate the inference results from the model through ML-Assisted Labeling, which will be invoked by a specific labeler and represented as the model. The system will function as usual but will evaluate the Reviewer Mode against the labeler who provides answers from the model, essentially evaluating the model. This approach could work for evaluating multiple ML models.
1. Create the project as usual, set no consensus so that the reviewers can easily put the correct answer without handling conflicts. To simplify the work, assign one person that can act as Labeler and Reviewer. You can assign the labelers accordingly if you want to assess more than one ML model.
2. Open the project and you will be on the Labeler Mode by default. Call the ML-Assisted Labeling, representing the inference from the model.
3. Change to Reviewer Mode and finish the labeling to provide the ground truth.
4. Complete the project, triggering the Evaluation Metrics calculation.
Integrate the inference results from the model using pre-labeled data by using the Datasaur Schema format. Similarly, this will be evaluated when compared to the Reviewer Mode. This approach could only work for evaluating one ML model.
1. Create the project with pre-labeled input. To simplify the work, assign one person that can act as Labeler and Reviewer.
2. Open the project and you will be on the Labeler Mode by default.
3. Change to Reviewer Mode and finish the labeling to provide the ground truth.
4. Complete the project, triggering the Evaluation Metrics calculation.

Metrics

Accuracy

The proportion of correctly labeled instances among the total labels.
Calculation = total correct labels divided by total labels.

Precision

The ratio of correctly labeled positive instances to the total instances predicted as positive.
Calculation = True Positives / (True Positives + False Positives).
"Of all the instances predicted as positive, how many were actually positive?"
Real world example: Maximizing precision for email spam detection because we do not want to have a perfectly normal email being incorrectly classified as a spam, essentially minimizing the false positives.

Recall

The ratio of correctly labeled positive instances to the total actual positive instances.
Calculation = True Positives / (True Positives + False Negatives).
"Of all the actual positive instances, how many were correctly identified?"
Real world example: Maximizing recall for medical diagnostic tools because the system cannot afford to label a cancerous case as a non-cancerous one, essentially minimizing the false negatives.

F1 Score

The harmonic mean of precision and recall.
Calculation = 2 * ((Precision * Recall) / (Precision + Recall)).
As the other metrics, it ranges between 0 and 1. Specifically for this metric, 1 indicates the perfect balance between precision and recall.

Confusion Matrix

A tabular representation that provides a detailed breakdown of the labels. The column will be represented by each option of a particular question.

Filter

There are three types of filters for metric calculation, with the default averaging all data:

By documents
By questions
By labelers

Last updated 4 months ago