Datasaur
Search
⌃K

Data Programming

Assisted-labeling feature to help you generate label using rules
Data Programming extension allows us to generate labels by aggregating many heuristics or rules (dubbed label function) that may not be accurate by themselves but a better predictor than random when taken as a group.

Labeling Function

Labeling functions allow you to define weak heuristics and rules that predict a label given unlabeled data. These heuristics can be derived from expert knowledge or other labeling models. Our labeling function should be written in Python.
  • Multiple keyword searches (using regular expressions) within the text. For example, in finding a severity score, we searched for the phrase in numeric format and in Roman numeral format.
  • Complex pre-processor model from NLTK, spaCy, or textblob
    • POS tag, sentiment, general NER, dependency parsing, create syntax tree, had list of stop words, similarity, etc

Supported Libraries

  1. 1.
    pydantic: ^1.9.2
  2. 2.
    snorkel: 0.9.9
  3. 3.
    pandas: ^1.4.4
  4. 4.
    textblob: ^0.17.1
  5. 5.
    nltk: ^3.7
  6. 6.
    spaCy: ^3.4.1
  7. 7.
    numpy: 1.22.4
  8. 8.
    scipy: 1.9.1

Example of Building Label Functions

Example of using Snorkel

Regex rule labeling function
@labeling_function()
def labeling_function(x):
# Implement your logic here
text = ''.join(x.columns)
score = re.compile(r"\b(0|[1-9]\d*)-(0|[1-9]\d*)\b")
if score.search(text) is None:
return ABSTAIN
else:
return LABELS['sports']
return ABSTAIN
Labeling Function with complex preprocessor (Spacy)
from snorkel.preprocess.nlp import SpacyPreprocessor
# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)
@labeling_function(pre=[spacy])
def has_person(x):
"""Ham comments mention specific people and are short."""
if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
return HAM
else:
return ABSTAIN
Labeling Function with complex preprocessor (Textblob)
import re
from snorkel.labeling import labeling_function
# This variable or integer -1 is used for @labeling_function to abstain from voting
ABSTAIN = -1
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'negative' : 0,
'positive' : 1
}
# This function should be written as this template and correctly implements labeling_function interface.
# Variable x contain content of columns, column_names, and helper dictionary. Below are the variables.
"""
x.line: int
x.columns: List[str]
x.column_names: List[str]
x.column_name_to_index: dict # key -> column_name
"""
from snorkel.preprocess import preprocessor
from textblob import TextBlob
@preprocessor(memoize=True)
def textblob_sentiment(x):
text = ''.join(x.columns)
scores = TextBlob(text)
x.polarity = scores.sentiment.polarity
return x
@labeling_function(pre=[textblob_sentiment])
def labeling_function(x):
# Implement your logic here
if x.polarity > 0.16:
return LABELS['positive']
return LABELS['negative']

Example of using Datasaur Stegosaurus

Row Based

Keyword Search "world"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['world'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
TARGET_KEYWORDS = ['confrontation', 'violent', 'harrassed','fight', 'vehicle', 'government', 'employment', 'military', 'war']
for keyword in TARGET_KEYWORDS:
keyword = keyword.replace("\\", '')
if re.search(keyword, text, re.IGNORECASE):
return True
return False
Regex rule labeling function "sports"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['sports'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
score = re.compile(r"\b(0|[1-9]\d*)-(0|[1-9]\d*)\b")
PATTERNS = [score]
for pattern in PATTERNS:
if re.search(pattern, text):
return True
return False
Labeling Function with complex preprocessor (Spacy) "world"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
import spacy
nlp = spacy.load("en_core_web_sm")
NER_LABELS = ["NORP"]
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['world'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
spacy_pred = nlp(text)
TARGET_KEYWORDS = []
for token in spacy_pred.ents:
token_label = token.label_
if token_label in NER_LABELS:
TARGET_KEYWORDS.append(str(token))
for keyword in TARGET_KEYWORDS:
keyword = keyword.replace("\\", '')
if re.search(keyword, text, re.IGNORECASE):
return True
return False
Labeling Function with complex preprocessor (Textblob) "positive"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'positive' : 0,
'negative' : 1,
}
# Labeling function definition (Start editing here!)
from textblob import TextBlob
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['positive'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
scores = TextBlob(text)
polarity = scores.sentiment.polarity
if polarity > 0:
return True
return False

Token Based

Keyword Search "world"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['world'])
def label_function(sample):
text = sample['text'] #list(sample.values())[0]
# Implement your logic here
TARGET_KEYWORDS = ['confrontation', 'violent', 'harrassed','fight', 'vehicle', 'government', 'employment', 'military', 'war']
match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]
return match_list
Regex rule labeling function "sports"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['sports'])
def label_function(sample):
text = sample['text']
# Implement your logic here
score = re.compile(r"\b(0|[1-9]\d*)-(0|[1-9]\d*)\b")
PATTERNS = [score]
match_list = [re.finditer(pattern, text) for pattern in PATTERNS]
return match_listp
Labeling Function with complex preprocessor (Spacy) "world"
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
# Assign target label based on LABELS dictionary
import spacy
nlp = spacy.load("en_core_web_sm")
NER_LABELS = ["NORP"]
@target_label(label=LABELS['world'])
def label_function(sample):
text = sample['text']
# Implement your logic here
spacy_pred = nlp(text)
TARGET_KEYWORDS = []
for token in spacy_pred.ents:
token_label = token.label_
if token_label in NER_LABELS:
TARGET_KEYWORDS.append(str(token))
match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]
return match_list

Example of using Datasaur Stegosaurus and OpenAI

Row Based

Labeling Function with OpenAI for Text Classification
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'business' : 0,
'science' : 1,
'sports' : 2,
'world' : 3
}
# Labeling function definition (Start editing here!)
import openai
openai.api_key = YOUR_API_KEY
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['sports'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are an Expert Data Labeler in classifying text as sports by detecting match score"},
{
"role": "user",
"content": """Sentence: Treat Huey, the two-time All-Met Player of the year, suffered a 6-2, 6-0 loss to 29-year-old journeyman Mashiska Washington, the younger brother of former pro and 1996 Wimbledon finalist Malivai.
What is the match score from sentence above? Is the text above classified as sports?"""},
{"role": "assistant", "content": "[6-2,6-0];yes"},
{
"role": "user",
"content": """Sentence: Pay for the Washington area's top executives rose significantly last year, reversing the downward trend that set in with the recession in 2001.
What is the match score from sentence above? Is the text above classified as sports?"""},
{"role": "assistant", "content": '[];no'},
{
"role": "user",
"content": """Sentence: {}
What is the match score from sentence above? Is the text above classified as sports?""".format(text)}
]
)
match_text = completion['choices'][0]['message']['content']
detected_score, is_sport = match_text.split(';')
if detected_score!='[]' and is_sport=='yes':
return True
return False

Token Based

Labeling Function with OpenAI for Token
import re
from stegosaurus.annotator import target_label
# This is your constant labels dictionary, SHOULD NOT BE EDITED
LABELS = {
'art' : 0,
'eve' : 1,
'geo' : 2,
'gpe' : 3,
'nat' : 4,
'org' : 5,
'per' : 6,
'tim' : 7
}
# Labeling function definition (Start editing here!)
import openai
openai.api_key = YOUR_API_KEY
# Assign target label based on LABELS dictionary
@target_label(label=LABELS['eve'])
def label_function(sample):
text = list(sample.values())[0]
# Implement your logic here
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are an Expert Data Labeler in classifying sports event"},
{
"role": "user",
"content": """Sentence: The former Soviet republic was playing in an Asian Cup finals tie for the first time .
What is the sports event from the sentence above?"""},
{"role": "assistant", "content": str(["Asian Cup"])},
{
"role": "user",
"content": """Sentence: 68 , Trevor Dodds ( Namibia ) 72 69 142 Don Robertson ( U.S. ) 73 69 , Dion Fourie 69 73 ,
What is the sports event from the sentence above?"""},
{"role": "assistant", "content": '[]'},
{
"role": "user",
"content": """Sentence: {}
What is the sports event from the sentence above?""".format(text)}
]
)
match_text = completion['choices'][0]['message']['content']
match_text = match_text.replace('[','').replace(']','').replace("'",'')
if len(match_text.split(','))>1:
TARGET_KEYWORDS = match_text.split(',')
elif match_text=='':
TARGET_KEYWORDS = []
else:
TARGET_KEYWORDS = [match_text]
match_list = [re.finditer(re.escape(target), text, re.IGNORECASE) for target in TARGET_KEYWORDS]
return match_list

Tutorials

Use Data Programming in general

  1. 1.
    Add Data Programming service to My Extension
  2. 2.
    Click Labeling Functions button to edit labeling functions using this tutorial
  1. 3.
    Create labeling function using labeling function editor: It also includes the labeling function code template. Note that the user can only edit codes after line 19, the previous codes are templates and not supposed to be edited.
  1. 4.
    Close Labeling Functions editor and click Predict labels in Data Programming Extension. The Applying labeling function loading bar will show up while predicting labels from Data Programming.

Edit and adjust labeling functions

  1. 1.
    First, click + button to create labeling function and labeling function editor will populate.
  1. 2.
    To rename Labeling Function, click the pencil icon next to Labeling Function, type the new name, then click ✔️ button or cancel it by clicking x button.
  1. 3.
    Adjusting the display when editing the labeling function can be done in two ways, maximize or minimize.
  2. 4.
    By clicking on the maximize icon that aligned with the X button, the view will be maximized according to your screen size.
  1. 5.
    By clicking on the minimize icon that aligned with the X button, the view will be minimize into dialog size on the bottom left of your screen.
  1. 6.
    Removing the labeling function can be done in two ways, delete one by one or delete multiple at once. Select one or multiple labeling functions via the check box and click on the ‘Delete’ button.
There will be a confirmation pop up to confirm the “deletion” of the project. After you click OK, the selected projects will be deleted.
  1. 7.
    Use a toggle which inlines with the labeling function to activate/inactivate the labeling function for prediction.

Build labeling functions in detail

Dataset column filter this line code means all columns will be processed in labeling_function() logic.
If you only need 1 column as labeling function input/context, line 24 can be changed to:
text = x.columns[x.column_name_to_index[COLUMN_NAME]]
For example, if the dataset has columns ‘Title’ and ‘Description’, and the user want only to process column ‘Description’ in labeling_function() logic, line 24 can be changed to:
text = x.columns[x.column_name_to_index['Description']]

Remove labeling function prediction

To remove the previous labeling function prediction, user can run Predict labels button with this labeling function. It only returns ABSTAIN and without any logic code inside.

Tutorials on building common labeling functions

Keywords-based labeling function

  1. 1.
    State keywords dictionary
  1. 2.
    Implement logic to match keywords and text here. Return the match label or ABSTAIN
  1. 3.
    Define regex pattern
  1. 4.
    Search regex pattern in text and return the label or ABSTAIN

NER using spaCy

  1. 1.
    Import spacy and load model
  1. 2.
    State ner_mapping to map ner prediction to target label
  1. 3.
    Predict NER using loaded spacy modelpredict ner using loaded spacy model
If spaCy prediction has ents (entity) attribute, map predicted entity label to target label; else return ABSTAIN

Textblob polarity

  1. 1.
    Import TextBlob
  1. 2.
    Predict the polarity score for each text and define the threshold to decide whether the text is [positive] or [negative]