Common Terminology

A glossary of words we frequently use

Natural Language Processing (NLP)

The field of artificial intelligence specific to text and linguistics.


The person doing the labeling. Sometimes also referred to as annotators or taggers.


Someone assigned to review the labels of another colleague.


At Datasaur, every labeling task starts with a project. A project can have multiple files, and each file will likely contain many labels.


A conceptual person, object, or location mentioned in a document. Oftentimes, the token or span of tokens to be labeled in a NER project.


A Cell is a box that is used to display data in the Editor. For example, in the above picture, the box that contains the text "Sherlock Holmes become widely popular in 1891" is a Cell. Cells are structured in a matrix-like manner.
Cell's Line is its position relative to the vertical axis, it's numbered from 0 starting from top to bottom.
Cell's Index is its position relative to the horizontal axis, it's numbered from 0 starting from left to right. We refer to a Cell by using its Line and Index. For example, Cell with Line equals 3 and Index equals 0, is the Cell that contains the text "All but one are set in the Victorian or Edwardian eras, between about 1880 and 1914."

Tokens and Characters

Typically, the atomic unit in a document. This can be a single word but can also refer to punctuation such as '.'.
Tokens are indexed starting from 0 within each cell, and its characters are indexed starting from 0 as well. For example, the Token "popularity" has index 1 in the current Cell and thecaller "u" has index 3 in "popularity".

Label Set

A Label Set is a set of labels that are related one to another. Each projects in Datasaur can have up to 5 different label sets. Label Set are indexed starting from 0 to 4. We used to refer to Label Set Index as Layer.