Project Templates

Exploring Datasaur's pre-built project templates

Just like Microsoft Word or PowerPoint have templates, Datasaur has project templates that allow you to quickly get started with pre-built settings. Let's explore each one in turn.

Named Entity Recognition

Named Entity Recognition (NER) is also referred to as Named Entity Extraction. It describes the process of identifying and extracting specific entities in a text. These entities will be classified into various predefined categories that can represent real-world objects, such as places, organizations, names, locations, etc.

Named entities are not always single tokens. In the example above, WED Enterprises is an entity that has two tokens. Multi-token labeling is common in NER labeling.

Part of Speech

Part of Speech (POS) tagging is the process of labeling each word in a text with its part of speech based on the context of the sentence. Once we define the role of each word, it will be useful for training the algorithm to understand the structure and meaning of a sentence.

You are welcome to define your own parts of speech for labeling.

Note: one standard industry practice for English POS is to follow the parts of speech as defined by the Penn Treebank Project.

Coreference

Coreference resolution is the task of identifying all expressions that refer to the same entity in a text. This kind of task can be beneficial for many applications, including information extraction, text summarization, question answering, and machine translation.

Coreference resolution usually includes nouns, noun phrases, proper nouns, and pronouns. We can see that his is a pronoun and refers to Sherlock Holmes, which is a noun phrase. Coreference resolution helps eliminate ambiguity in deciphering a document. It often requires labeling phrases first, then drawing arrows from one to another.

Dependency

Dependency parsing is the task of labeling relations between words. These relations consist of a head and a dependent. Please consider the example below. Sherlock is the subject of the verb became.

Note: for Dependency label sets, common industry best practices include Universal Dependencies or the Stanford typed dependencies Manual.

Document Labeling

Document labeling is the task of classifying and categorizing data. This type of labeling is different from the types discussed above, because the labeler is answering questions about the text, rather than labeling spans of tokens within the text. It can be beneficial for projects such as sentiment analysis or applying metadata to a document.

In Datasaur, document labeling can be done on a per-row basis or on a per-document basis.

You can also label images, .pdf, and even .gif in document labeling.

Note: when create projects via the DOC project template, the following settings will apply as a default.

  • Any uploaded questions will be set as required.

  • Answer sets can be edited. If the labeler types something in the text box that does not match an existing label, she can click "Add <your answer> as a new answer".

Optical Character Recognition

Optical Character Recognition (OCR) is the task of translating text inside images or scanned documents into machine-readable text data. Some common use applications of OCR include invoices, receipts, or legal documents.

We currently provide this feature to help you label OCR results.

Note: when uploading pairs of OCR documents, please make sure your image files and their corresponding transcription have the same file name. For example covid-unicef.jpg and covid-unicef.txt.