Data Programming

Assisted-labeling feature to help you generate label using rules

Datasaur's Data Programming feature provides an advanced solution for processing large datasets. This feature leverages a blend of algorithms and heuristics to automate data labeling, a task typically performed manually. It's especially useful for handling massive data volumes, offering significant improvements in labeling efficiency and accuracy. This automation allows users to concentrate on more crucial elements of data analysis and model training, making it a key component in data management.

The Data Programming extension employs numerous heuristics or rules, known as Labeling Functions. These functions individually might not be highly accurate, but collectively, they provide better predictions than random selection.

Labeling Function

Labeling functions are designed to apply weak heuristics and rules for predicting labels on unlabeled data. These can be based on expert knowledge or other labeling models. The labeling functions should be coded in Python and can include:

  • Keyword searches with regular expressions. For instance, detecting severity scores in both numeric and Roman numeral formats.

  • Advanced preprocessing models using libraries like NLTK, Spacy, or TextBlob, which include POS tagging, sentiment analysis, NER, dependency parsing, syntax tree creation, stop words list, similarity measures, etc.

You can see a list of our Labeling Function examples here.

Supported Libraries

If you require any additional libraries, please reach out to us by contacting support@datasaur.ai.

NameVersion

pandas

1.4.4 and later

textblob

0.17.1 and later

nltk

3.7 and later

spacy

3.4.1 and later

scipy

1.9.1

numpy

1.23.3

transformers

4.28.1

requests

2.28.1 and later

datasets

2.7.0

openai

0.27.0

stanza

1.5.0 and later

spacy-fastlang

1.0.1 and later

lxml

4.9.2

Tutorial

How to use Data Programming in General

  1. Add Data Programming extension from the Manage Extension menu.

  2. The Data Programming Extension will appear on your right. Let's break down what we have here:

    • Target Question/Label Set: Choose the questions or label set that you want to target for Data Programming usage.

    • Multiple Label Template: If you turn this on, it will create a labeling function template that can predict based on multiple labels. By default, the Labeling Function logic only specified to predict 1 label from all defined labels. However, Datasaur also provides a multilabel labeling function template if a user needs labeling function logic that is sufficient for more than 1 label. Please find the template here.

    • Labeling Functions: This button will take you to the Data Programming pop-up, covering Labeling Function Settings, Labeling Function Analysis, and Inter-Annotator Agreement.

    • Predict Labels: After creating your Labeling Functions, you can start predicting the answers or labels using those functions.

  3. You can create Labeling Functions by clicking the "Labeling Functions” button. It will display the Labeling Function Settings, where you can add your Labeling Functions. It also provides you with a code template for a Labeling Function based on your label set. Note: Pay attention to the comment we've included there; you can start editing your logic where we write (Start editing here!)— the previous codes and lines are not supposed to be edited.

  4. Close Labeling Functions editor and click Predict labels in Data Programming Extension, Applying labeling function loading bar will show up while predicting labels from Data Programming.

Labeling Function Template

Edit and adjust Labeling Functions

  1. First, click + Add button to create the Labeling Function and the Labeling Function Editor will generate a template for you.

  2. To rename a Labeling Function, click the pencil icon next to the Labeling Function, type the new name, then click the ✔️ button or cancel it by clicking the X button.

  3. Removing the labeling function can be done in two ways, delete one by one or delete multiple at once. Select one or multiple labeling functions via the check box and click on the ‘Delete’ button.

  4. There will be a confirmation pop-up to confirm the “deletion” of the project. After you click OK, the selected projects will be deleted.

  5. Use a toggle which inlines with the labeling function to activate/inactivate the labeling function for prediction.

Build Labeling Functions in detail

  • By default, labeling_function only provides 1 label, which is defined on this line

    #Assign target label based on LABELS dictionary
    @apply_label(label=LABELS['labelA'])
  • By default, labeling_function process text that contains all columns in one row.

    ## please check <link section 'text' in gitbook> for more info
    text = list(sample.values())[0]

    If need to process only one column, then use:

    text = sample[<COLUMN_NAME>]

    If need to process certain columns, then use:

    text = ' '.join([sample[<COLUMN_NAME_A>], sample[<COLUMN_NAME_B>]])
  • Labeling Function returns boolean as output for row-based and match_list as output for span-based.

    match_list is a form of a list of match token index (format: [start_`index, end_index])

    In example:

    >>> text = "Russian and American Alien Spaceship Claims Raise Eyebrows"
    >>> target_token = ["Russian", "American"]
    >>> match_list = [[0,7],[12,20]]

    match_list is a list of target_token positions regarding to text

    Special Notes:

    if you use regex in your logic codes, you can find match_list with regex.finditer

    	# *TARGET could be **keyword** or **regex pattern.*** 
    
    >>> match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]
    
    or
    
    >>> date = re.compile(r"(19|20)\\d\\d[- /.](0[1-9]|[012])[- /.](0[1-9]|[12][0-9]|3[01])")
    >>> PATTERNS = [date]
    >>> match_list = [re.finditer(pattern, text) for pattern in PATTERNS]
  • Labeling function remover

    • Row Labeling

      def labeling_function(sample):
          return False
    • Span Labeling

      def labeling_function(sample):
      		
      match_list = []
      return match_list

Last updated