Datasaur
Search…
Project Templates
Exploring Datasaur's pre-built project templates
Just like Microsoft Word or PowerPoint have templates, Datasaur has project templates that allow you to quickly get started with pre-built settings. Let's explore each one in turn.

Named Entity Recognition

Named Entity Recognition (NER) is also referred to as Named Entity Extraction. It describes the process of identifying and extracting specific entities in a text. These entities will be classified into various predefined categories that can represent real-world objects, such as places, organizations, names, locations, etc.
Named entities are not always single tokens. In the example above, WED Enterprises is an entity that has two tokens. Multi-token labeling is common in NER labeling.
sherlockholmes-ner.tsv
836B
Binary
Sherlock Holmes

Part of Speech

Part of Speech (POS) tagging is the process of labeling each word in a text with its part of speech based on the context of the sentence. Once we define the role of each word, it will be useful for training the algorithm to understand the structure and meaning of a sentence.
You are welcome to define your own parts of speech for labeling.
Note: one standard industry practice for English POS is to follow the parts of speech as defined by the Penn Treebank Project.
sherlockholmes-pos.tsv
1KB
Binary
Sherlock Holmes

Coreference

Coreference resolution is the task of identifying all expressions that refer to the same entity in a text. This kind of task can be beneficial for many applications, including information extraction, text summarization, question answering, and machine translation.
Coreference resolution usually includes nouns, noun phrases, proper nouns, and pronouns. We can see that his is a pronoun and refers to Sherlock Holmes, which is a noun phrase. Coreference resolution helps eliminate ambiguity in deciphering a document. It often requires labeling phrases first, then drawing arrows from one to another.
sherlockholmes (1).txt
525B
Binary
Sherlock Holmes

Dependency

Dependency parsing is the task of labeling relations between words. These relations consist of a head and a dependent. Please consider the example below. Sherlock is the subject of the verb became.
Note: for Dependency label sets, common industry best practices include Universal Dependencies or the Stanford typed dependencies Manual.
sherlockholmes-dependency (1).tsv
871B
Binary
Sherlock Holmes

Document Labeling

Document labeling is the task of classifying and categorizing data. This type of labeling is different from the types discussed above, because the labeler is answering questions about the text, rather than labeling spans of tokens within the text. It can be beneficial for projects such as sentiment analysis or applying metadata to a document.
In Datasaur, document labeling can be done on a per-row basis or on a per-document basis.
bookreview2020 (3).xlsx
5KB
Binary
Book Review
You can also label images, .pdf, and even .gif in document labeling.
imagesamplefiles (1).zip
1MB
Binary
Book Cover
Note: when create projects via the DOC project template, the following settings will apply as a default.
    Any uploaded questions will be set as required.
    Answer sets can be edited. If the labeler types something in the text box that does not match an existing label, she can click "Add <your answer> as a new answer".

Optical Character Recognition

Optical Character Recognition (OCR) is the task of translating text inside images or scanned documents into machine-readable text data. Some common use applications of OCR include invoices, receipts, or legal documents.
We currently provide this feature to help you label OCR results.
Note: when uploading pairs of OCR documents, please make sure your image files and their corresponding transcription have the same file name. For example covid-unicef.jpg and covid-unicef.txt.
ocr-samples.zip
973KB
Binary
COVID-19 Unicef
Last modified 3mo ago