Datasaur
Search…
⌃K

Data Programming

Data Programming extension allows us to generate label by aggregating many heuristics or rules (dubbed label function) that may not be accurate by themselves but a better predictor than random when taken as a group.

Labeling Function

Labeling functions allow you to define weak heuristics and rules that predict a label given unlabelled data. These heuristics can be derived from expert knowledge or other labeling models. Our labeling function should be written in Python.
  • Multiple keyword searches (using regular expressions) within the text. For example, in finding a severity score we searched for the phrase in numeric format and in roman numeral format.
  • Complex pre-processor model from NLTK, Spacy, or textblob
    • POS tag, sentiment, general NER, dependency parsing, create syntax tree, had list of stopwords, similarity, etc

Supported Libraries

pydantic: ^1.9.2
snorkel: 0.9.9
pandas: ^1.4.4
textblob: ^0.17.1
nltk: ^3.7
spaCy: ^3.4.1
numpy: 1.22.4
scipy: 1.9.1
import re
from snorkel.labeling import labeling_function
# This variable or integer -1 is used for @labeling_function to abstain from voting
ABSTAIN = -1
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# This function should be written as this template and correctly implements labeling_function interface.
# Variable x contain content of columns, column_names, and helper dictionary. Below are the variables.
"""
x.line: int
x.columns: List[str]
x.column_names: List[str]
x.column_name_to_index: dict # key -> column_name
"""
KEYWORDS = {
'business': ['workers'],
'science': ['space', 'chemistry', 'researcher', 'education', 'school', 'virus'],
'sports': ['medal', 'record', 'bicep', 'football', 'physic', 'game'],
'world': ['confrontation', 'violent', 'harrassed','fight', 'vehicle', 'government', 'employment', 'military', 'war']
}
@labeling_function()
def labeling_function(x):
# Implement your logic here
text = ''.join(x.columns)
for label, keyword_list in KEYWORDS.items():
for key in keyword_list:
if re.search(key, text, re.IGNORECASE):
return LABELS[label]
return ABSTAIN

Regex rule labeling function

@labeling_function()
def labeling_function(x):
# Implement your logic here
text = ''.join(x.columns)
score = re.compile(r"\b(0|[1-9]\d*)-(0|[1-9]\d*)\b")
if score.search(text) is None:
return ABSTAIN
else:
return LABELS['sports']
return ABSTAIN

Labeling Function with complex preprocessor (Spacy)

from snorkel.preprocess.nlp import SpacyPreprocessor
# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)
@labeling_function(pre=[spacy])
def has_person(x):
"""Ham comments mention specific people and are short."""
if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
return HAM
else:
return ABSTAIN

Labeling Function with complex preprocessor (Textblob)

import re
from snorkel.labeling import labeling_function
# This variable or integer -1 is used for @labeling_function to abstain from voting
ABSTAIN = -1
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'negative' : 0,
'positive' : 1
}
# This function should be written as this template and correctly implements labeling_function interface.
# Variable x contain content of columns, column_names, and helper dictionary. Below are the variables.
"""
x.line: int
x.columns: List[str]
x.column_names: List[str]
x.column_name_to_index: dict # key -> column_name
"""
from snorkel.preprocess import preprocessor
from textblob import TextBlob
@preprocessor(memoize=True)
def textblob_sentiment(x):
text = ''.join(x.columns)
scores = TextBlob(text)
x.polarity = scores.sentiment.polarity
return x
@labeling_function(pre=[textblob_sentiment])
def labeling_function(x):
# Implement your logic here
if x.polarity > 0.16:
return LABELS['positive']
return LABELS['negative']

Tutorials

Use Data Programming in general

First, add Data Programming service to My Extension
Click Labeling Functions button to edit labeling functions using this tutorial
Create labeling function using labeling function editor: It also includes the labeling function code template. Note that the user can only edit codes after line 19, the previous codes are templates and not supposed to be edited.
Close Labeling Functions editor and click Predict labels in Data Programming Extension. The Applying labeling function loading bar will show up while predicting labels from Data Programming.

Edit and adjust labeling functions

First, click + button to create labeling function and labeling function editor will populate.
Second, to rename Labeling Function, click the pencil icon next to Labeling Function, type the new name, then click ✔️ button or cancel it by clicking x button.
Third, adjusting the display when editing the labeling function can be done in two ways, maximize or minimize.
  • By clicking on the maximize icon that aligned with the X button, the view will be maximized according to your screen size.
  • By clicking on the minimize icon that aligned with the X button, the view will be minimize into dialog size on the bottom left of your screen.
Fourth, remove labeling function can be done in two ways, delete one by one or delete multiple at once. By select one or multiple labeling function via the check box and click on the ‘Delete’ button.
There will be a confirmation pop up to confirm the “deletion” of the project. After you click OK, the selected projects will be deleted.
Fifth, use toggle which inlines with labeling function to activate/inactivate labeling function for prediction,

Build labeling functions in detail

Dataset column filter this line code means all columns will be processed in labeling_function() logic.
If you only need 1 column as labeling function input/context, line 24 can be changed to:
text = x.columns[x.column_name_to_index[COLUMN_NAME]]
For example, if the dataset has columns ‘Title’ and ‘Description’, and the user want only to process column ‘Description’ in labeling_function() logic, line 24 can be changed to:
text = x.columns[x.column_name_to_index['Description']]
Remove labeling function prediction
To remove the previous labeling function prediction, user can run Predict labels button with this labeling function. It only returns ABSTAIN and without any logic code inside.

Tutorials on building common labeling functions

Keywords based labeling function

State keywords dictionary
Implement logic to match keywords and text here. Return the match label or ABSTAIN
Define regex pattern
Search regex pattern in text and return the label or ABSTAIN

NER using spaCy

Import spacy and load model
State ner_mapping to map ner prediction to target label
Predict ner using loaded spacy modelpredict ner using loaded spacy model
If spacy prediction has ents (entity) attribute, map predicted entity label to target label; else return ABSTAIN

Textblob polarity

Import TextBlob
Predict polarity score for each text and define threshold to decide whether the text is [positive] or [negative]
Last modified 1mo ago