Just like Microsoft Word or PowerPoint have templates, Datasaur has project templates that allow you to quickly get started with pre-built settings. Let's explore each one in turn.
Named Entity Recognition
Named Entity Recognition (NER) is also referred to as Named Entity Extraction. It describes the process of identifying and extracting specific entities in a text. These entities will be classified into various predefined categories that can represent real-world objects, such as places, organizations, names, locations, etc.
Named entities are not always single tokens. In the example above, WED Enterprises is an entity that has two tokens. Multi-token labeling is common in NER labeling.
sherlockholmes-ner.tsv
836B
Binary
Sherlock Holmes
Part of Speech
Part of Speech (POS) tagging is the process of labeling each word in a text with its part of speech based on the context of the sentence. Once we define the role of each word, it will be useful for training the algorithm to understand the structure and meaning of a sentence.
You are welcome to define your own parts of speech for labeling.
Note: one standard industry practice for English POS is to follow the parts of speech as defined by thePenn Treebank Project.
sherlockholmes-pos.tsv
1KB
Binary
Sherlock Holmes
Coreference
Coreference resolution is the task of identifying all expressions that refer to the same entity in a text. This kind of task can be beneficial for many applications, including information extraction, text summarization, question answering, and machine translation.
Coreference resolution usually includes nouns, noun phrases, proper nouns, and pronouns. We can see that his is a pronoun and refers to Sherlock Holmes, which is a noun phrase. Coreference resolution helps eliminate ambiguity in deciphering a document. It often requires labeling phrases first, then drawing arrows from one to another.
sherlockholmes (1).txt
525B
Binary
Sherlock Holmes
Dependency
Dependency parsing is the task of labeling relations between words. These relations consist of a head and a dependent. Please consider the example below. Sherlock is the subject of the verb became.
Document labeling is the task of classifying and categorizing data. This type of labeling is different from the types discussed above, because the labeler is answering questions about the text, rather than labeling spans of tokens within the text. It can be beneficial for projects such as sentiment analysis or applying metadata to a document.
In Datasaur, document labeling can be done on a per-row basis or on a per-document basis.
bookreview2020 (1) (1) (1) (1).xlsx
5KB
Binary
Book Review
You can also label images, .pdf, and even .gif in document labeling.
imagesamplefiles.zip
1MB
Binary
Book Cover
Note: when create projects via the DOC project template, the following settings will apply as a default.
Any uploaded questions will be set as required.
Answer sets can be edited. If the labeler types something in the text box that does not match an existing label, she can click "Add <your answer> as a new answer".
We currently provide this feature to help you label OCR results.
Note: when uploading pairs of OCR documents, please make sure your image files and their corresponding transcription have the same file name. For example covid-unicef.jpg and covid-unicef.txt.
Create your own template!
After successfully sign up, every new user will see Datasaur built-in project templates on their workspace. Now, they are allowed to create their own template!
The first step is creating a project with all setting has been set. Then, click triple-dots on the project and choose Save as template. You can rename the template as you want, and even upload an avatar.
If you are in a team workspace and need a custom script for project creation, this will surely save your time!